Research Applications of the National Biodiversity Network

A summary of the NBN Research Applications Workshop, January 20th 2014 – Nick J. B. Isaac

NERC Centre for Ecology & Hydology, Maclean Building, Benson Lane, Crowmarsh Gifford, Wallingford OX10 8BB

Background
The National Biodiversity Network (NBN) provides a focus for collating and disseminating data about UK biodiversity among conservation agencies, NGOs and the community of voluntary wildlife recorders. Since 2000, the activities of the NBN have been coordinated by a charitable body known as the NBN Trust (NBNT). The most visible element of NBN activities is the NBN Gateway, which provides access to >90 million species records via a range of sophisticated web services. 

In recent years, the NBN has sought to strengthen links with the UK research community. To pursue this agenda, NBNT convened a workshop exploring the research applications of NBN data, which was hosted by the Natural History Museum on 20th January 2014.
 

The aims of the workshop were to demonstrate ways in which NBN data can be useful to the research sector, and to identify challenges and opportunities for the future. 

The Workshop
About 30 people attended the workshop. The majority of delegates were members of the UK research community with an interest in using and analysing biodiversity data, but some data providers and policy makers were also represented. The agenda was split equally between presentations and discussion sessions (including breakout groups). The four presentations are summarised below, followed by a review of the issues raised. Not surprisingly, the discussions were dominated by terrestrial examples: this was recognised and a similar workshop focussing on the marine environment was suggested.  Soundbites “Resurveys are fundamental for monitoring long-term change”

“A little bit of metadata (for each record) would go a long way”

“We need better coverage of functionally important species to deal with future environmental challenges”

“Feedback between recorders, schemes and professional scientists is fundamental”

Presentations

The day began with an overview of the NBN Gateway from Paula Lightfoot, the then Data Access Officer of the NBNT. Paula described the Gateway as a reconnaissance tool for exploring the data, through a combination of text queries and interactive maps, all built on a RESTful API (Applications Programming Interface). The Gateway now holds over 90 million species records, of which around 75% are of birds, Lepidoptera and vascular plants; the spatial coverage of less charismatic groups (e.g. many invertebrate groups) and less accessible habitats (especially marine but also the uplands) remains sparse and patchy. A small portion of the records come from structured and effort-based surveys, such as the UK Butterfly Monitoring Scheme (incorporated within the Butterflies for the New Millenium dataset), Breeding Bird Survey and Shorewatch. Some of the structured datasets record species’ absence, but the vast majority of data are relatively unstructured presence-only records, mostly generated from volunteers. These two type of data (structured vs relatively unstructured, ‘opportunistic’) were the focus of the next two talks. 

Download Paula's presentation
 

Chris Thomas presented a colourful and provocative call-to-arms on how the data collection process can be made fit for the environmental challenges of the 21st Century. In essence, we need to move away from ‘recording stuff’ to large-scale citizen science, characterized by repeated surveys to key sites, appropriately stratified by landcover, climate etc. Although data are often presented at coarse spatial and temporal resolution (maps summarising a species’ current distribution), the precise spatial coordinates are crucial for understanding species’ microclimatic niches, and to ensure that surveys conducted at different points in time are truly comparable. The vast survey effort of the Butterflies of the New Millennium project (1.6 million records) achieved near complete coverage of the British Isles at hectad (100 km2) resolution, yet very few of the records were located in the same hectare as the original survey of the 1970-82, so it’s not strictly true to report changes in the status of individual grid cells as extinction or colonization events. True comparability can only be achieved by structured resurveys of historical datasets. The UK’s four Boreal butterflies provide a case in point: just 421 species: sample locations (1km2) were sufficient to show that distributions of three species are shifting northwards, but the much larger (but coarser) data in the BNM were not.

Download Chris' presentation
 

I (Nick Isaac) was asked to talk on what can be done to remove biases from volunteer-collected data, of the type that makes up the majority of NBN records. I identified four sources of bias that complicate the use of such data: variation in sampling intensity in time and space, variation in sampling effort per survey and variation in detectability. I presented results of a computer simulation, showing how simple correction factors are not sufficient to control all forms of bias, but that Bayesian Occupancy models and site selection criteria would be a robust and powerful combination. Occupancy models make it possible to model the bias, rather than trying to remove it, and this approach is likely to be most fruitful in the long term. However, I concluded that a small amount of metadata about sampling intensity (e.g. whether a species list is complete) would go a long way to making the data fit for the analyses of the future.  

Download Nick's presentation
 

Bill Sutherland built further on the idea of making our data fit for the challenges of the future. As an example, he pointed out how the scientific community was unprepared for the rise of biofuels, and asked us to consider the data that would be required to deal with challenges such as geoengineering, artificial life and nanotechnology. Challenges such as these could threaten functionally-important taxa for which there is limited capacity in the voluntary sector (e.g. mycorrhizae, earthworms).  To date we’ve relied on the data already to hand, in terms of both the threats and the species groups under consideration.  Platforms such as IBPES and EUBON have created momentum to collate large quantities of data, but we should think about what other information would be worth collecting (e.g. environmental levels of nano-silver). The very process of collecting biodiversity data is likely to change in the future. Technology is already beginning to change the way that citizens interact with biodiversity data, through initiatives such as Discover Life (multi-access keys) , eBird (which provides instant feedback) and iSpot (for identification). Technology also makes it possible to record species in ways that the NBN has not considered (e.g. metabarcoding and environmental DNA). 

Discussion

Two breakout groups picked up on the issues raised by the presentations: each was followed by a plenary discussion. Several themes emerged.

Research Use of NBN data

Accessing the data has recently become easier, due to changes in the download format and the development of the new API. An ‘rNBN’ package is in development, which will make it possible to query the NBN Gateway directly from the R environment for statistical computing. However, several barriers were identified that have limited the use of NBN data and services in research.
 

A general point is that the NBN Gateway is not designed with research applications in mind. Most NBN activities are motivated by government priorities, whereas most research applications are curiosity driven. Most academic users of NBN data are interested in questions at the national scale; most non-academic users want local summaries of the data (e.g. for offsetting or Local Authority Planning). 
 

One is that the process for accessing data remains difficult to navigate. The NBN Terms and Conditions were felt to have restricted research use to date. Securing permission to download records for academic research requires written permission from the data providers. This message is conveyed in red capital letters, which gives the erroneous impression that records cannot be downloaded until the permission has been granted. Perhaps for this reason, a straw poll of delegates revealed that just one actually uses the Gateway to download data directly. Most delegates access the data via existing relationships with the data providers (BTO, CEH etc). Others felt they had insufficient knowledge (or access to knowledge) about the quality of the data available. However, Master’s and undergraduate student projects do not require written permission, and NBN receives a large number of such data requests each year.  
 

Conversely, the Terms and Conditions require users to conform, and it is possible to access large amounts of data without knowledge of the data owners. For this reason some organisations have become more restrictive in what they make available on the NBN Gateway, which goes against the push towards open data.
 

Once data has been accessed, researchers need to know how to cite the data. The existing guidance citing data relies on examples: this is reasonably easy to follow when a single accessing a single data source, but less straightforward when multiple datasets are combined. Rather than providing examples of how to cite data, it would be useful to researchers if the appropriate citation were generated automatically and included in the download package.
 

A long-standing issue for research use of NBN data has been that individual datasets are not static. The working model is that datasets are periodically replaced when new records become available (e.g. annually), rather than appending the newest records to the existing set. This means that the same data query posted on different dates is liable to return a different set of records, thus hampering any attempt to reproduce the results of a research paper using data downloaded from the NBN. As the Open Science agenda gathers momentum, funding agencies and journals are increasingly demanding that source code and raw data files be made widely available. At present the NBN Terms and Conditions make it difficult for researchers to publish the data in its raw form. However, if these issues could be resolved then the NBN could fulfil a central role in the academic environment. Two complementary ideas were suggested as potential solutions. One is that datasets could be time-stamped and permanently citeable using Digital Object Identifiers, thus streamlining the data citation process. Another is that researchers publish the queries submitted to the Gateway’s API and/or the unique identities (e.g. row numbers) of the records used in analysis.  Both approaches could help satisfy the reproducibility requirement whilst obviating the need to publish the records themselves. The example of GenBank provides an obvious model for the NBN to follow: individual gene sequences can be cited in the literature using accession numbers.

Data Quality and Derived Data Products
The biases inherent in NBN data have been widely discussed. Some progress towards understanding and mitigating these biases has been made, to the point where it is conceivable to provide some kind of interpretation in addition to the raw data. The idea of providing derived data products on the Gateway are already under discussion within NBN. The option to view (and download) modelled species’ distributions, accounting for sampling intensity, would provide a useful context for understanding the raw data and would conceivably be useful in a range of applications.

If derived data products are provided on the NBN Gateway, why not also provide access to related datasets that would be useful? The NBN could become the central repository for national scale datasets to support analyses of biodiversity change (e.g. environmental layers, species traits, phylogenies).
 

A particular feature of NBN records is that the datasets themselves are heterogeneous: they include intensive local surveys and opportunistic national schemes.  To date, little attention has considered the issues of combining these sorts of data into a single analytical framework.  

How to facilitate improvements in data collection?
A majority of data on the NBN Gateway were opportunistically collected by volunteers: we know nothing about the sampling intensity. The NBN can play an important role in facilitating a transition from volunteer recorders to true citizen scientists, by working with its component organisations to promote examples of good practice and initiating a discussion about what sorts of data would be desirable.

There was broad agreement that we’d like to see changes to the data collection process, in order to bring it closer to a designed sampling regime. Mobile Apps and other technology have a crucial role in facilitating changes in behaviour, for example by harvesting metadata about sampling effort. Changing the behaviour of volunteers is perceived as difficult, because most recorders aren’t motivated by data targets, survey protocols or the bigger picture of biodiversity change. However, many are motivated by a strong sense of connection with their ‘local patch’, and this could form a useful basis for deepening the relationship between ‘producers’ and ‘consumers’ of biodiversity data. It was pointed out that BirdTrack was initially unpopular among BTO members, until its usefulness became apparent. Thus, behavioural change can be achieved if the Apps are easy to use and give the recorders something they actually want (the Facebook or Google model). Increased use of social media and mobile technology has the potential to widen the participation in voluntary recording (e.g. through increased uptake, by encouraging cross-taxon recording), but also deepen that relationship, by providing instant feedback to the user and/or highlighting gaps (cf the eBird model). On the other hand, Apps can exacerbate some biases, e.g. they tend to generate lists with fewer species than conventional recording. There will inevitably be a diversity of Apps developed: it’s unclear whether NBN should try to influence this development, although it does have an obvious role in publicising the success stories.
 

There is a limit to what can be achieved using the voluntary community alone. There was a suggestion that NBN could have a role in commissioning surveys to fill gaps in coverage (of species, grid cells or habitats that are poorly-sampled by the existing datasets). In addition to data collection, the voluntary community provides essential data validation services for all records submitted to the national schemes and societies. A general growth in public participation in collecting biodiversity data places an increasing burden on this small pool of experts. Supporting this community is a key challenge for the NBN partnership.

NBN & Structured data

The NBN Gateway holds relatively little data from structured surveys. Two obvious ways exist for NBN to develop in this area were suggested. One is to provide mechanisms for structured data, including that collected for scientific research, to be collated and disseminated (i.e. expand the remit of the NBN so it can properly handle a variety of data types). Some repositories for scientists to submit and share ecological data already exist, but no ‘industry standard’ has so far emerged. The second is to act as a forum or catalyst for designing and implementing new structured schemes, in the way that recently occurred with pollinators.

NBN & Policy

The NBN has a role in bridging the gap between government conservation agencies and the research community. It can act as a coordinated front representing its member organisations to the government (although several are themselves wholly owned by government). Whether the NBN can influence government policy is less clear.
 

Most research applications are curiosity-driven, but NBN activities have, to date, been driven by the policy requirements. Similarly, most research uses of NBN data are national in scale, yet non-academic users are concerned with local information (e.g. offsetting, Local Authority planning). The current Gateway takes a ‘one-size fits all’ approach that may not be sustainable in the long term, although the REST web services offer the potential to develop bespoke interfaces for niche requirements (e.g. rNBN).
 

In spite of this dichotomy, UK biodiversity policy influences all members of the NBN, including the academic community who seek government funds for research. The National Ecosystem Assessment is seen as successful and will probably return in some form: the NBN should be prepared to contribute to this and emerging debates on Green Infrastructure, Biodiversity Offsetting, Natural Capital and the development of IPBES. What can the NBN do to ensure government can take effective action on pollination, invasive species and tree health?  

Conclusions

The NBN was not designed with the research community in mind. Recent developments in the Gateway and in modelling approaches have opened up the potential for greater use of the data, although several areas for improvement were identified (especially around terms and conditions). A number of questions were raised about the role of NBN, and what strategic direction it should take. The NBN can play an important role in building links between the professional science and citizen science communities, in order to ensure that NBN data is fit for the environmental challenges of the 21st century.

The NBN Trust would like to thank Nick for this detailed write up of the days proceedings and outputs.  If you have any comments, you can contact Nick by email at njbi@ceh.ac.uk

You can see how data from the NBN are being used for Reasecrh on our Examples of Use page here

Web design by Red Paint