Herbarium or museum records, or even a single collector’s records, are all aggregations of records taken at different times and by different collectors. The flow of biological observations can go from observer to end user through multiple digital aggregators. Bob Mesibov is also a data aggregator in his analysis of the Australian millipedes. At any node in the data flow, errors can be detected, introduced or addressed.
Data should be published in secure locations where they can be preserved and improved in perpetuity and the ALA is a good example. We are moving beyond storage of data by individuals or institutions that do not have a strategy for enduring digital data integration, storage and access.
One of the most powerful outcomes of publishing digital data is that inherent legacy problems are revealed despite the concerted work of dedicated taxonomists over decades or longer. Exposing data provides the opportunity for the community to detect and correct errors. Indeed, much of the admirable work achieved by Mesibov (2013) was enabled by having data exposed beyond the originating institutions.
The ability to identify and correct data issues is also the responsibility of the whole community and not any single agent such as the ALA. There is the need to seamlessly and effectively integrate expert knowledge and automated processes so all amendments form part of a persistent digital knowledge about species. Talented and committed individuals can make enormous progress in error detection and correction (as seen in Mesibov, 2013) but how do we ensure that when a project like that on millipedes ceases, the data and all associated work are not lost? This implies standards in capturing and linking this information and maintaining the data with all amendments documented. To achieve this, the biodiversity research community needs to be motivated and empowered to work in a collaborative fashion.
Data quality is of the highest concern but data may not have to be 100% accurate to have utility in some projects. Quality issues affecting some users may be of secondary or no importance to others. For example, a locational inaccuracy of 20km on a record will likely not invalidate its use with regional or continental scale studies. Access to information on a type specimen is likely to be of value even if georeferences are incomplete. The term ‘fitness for use’ may therefore be more appropriate than ‘data quality’ in many circumstances.
This is not an excuse to ignore errors, but a recognition that effective use depends on knowledge about the data. The goal of the aggregators is to understand how much confidence is appropriate in each element of each record and to enable users to filter data based on these confidence measures. The philosophy of most of the aggregators is therefore to flag potential issues, correct what is obviously correctable and flag issues rather than hiding the associated record.
Some data quality issues can be detected without domain specific taxonomic expertise and some cannot. The correction of detected issues also may or may not require domain specific expertise. There are therefore four possible scenarios.
Specialist domain expertise is required to detect and correct many of the issues raised by Mesibov (2013). Agencies such as the ALA do not generally have this type of expertise. The ALA does however expose the data and provides infrastructure and processes that help to detect and to address data issues. An example of the quality controls undertaken by the ALA as can be seen here.
A more fundamental issue is that most biodiversity data today are managed and published through a wide range of heterogeneous databases and processes. Consistency is required for guaranteed, stable, persistent access to each data record and in establishing standardised approaches to registering and handling corrections. ‘Aggregators’ such as the ALA have a key role in addressing this challenge but ultimately it will depend on widespread changes in the culture of biodiversity data management.
Another component in the infrastructure supporting error detection and correction in the ALA is a sophisticated annotations service that uses crowd sourcing. Issues that are detected, with potential corrections are returned to the data provider. Note however that some data providers may not have the resources to address the issues or indeed may no longer exist.
See Belbin, et al. (2013) below for the full paper.
Belbin Lee, Daly Joanne, Hirsch Tim, Hobern Donald, LaSalle John (2013). A specialist’s audit of aggregated occurrence records: An ‘aggregator’s’ perspective. ZooKeys 305, 67-76. doi: 10.3897/zookeys.305.5438.
Chapman, AD (2005a). Principles and Methods of Data Cleaning – Primary Species and Species Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 75p.
Chapman, AD (2005b). Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 61p.
Costello MJ, Michener WK, Gahegan M, Zhang Z-Q, Bourne P, Chavan V (2012). Quality assurance and intellectual property rights in advancing biodiversity data publications version 1.0, Copenhagen: Global Biodiversity Information Facility, 40p, ISBN: 87‐92020‐49‐6.
Mesibov R (2013) A specialist’s audit of aggregated occurrence records. ZooKeys 293: 1-18. doi: 10.3897/zookeys.293.5111
Otegui J, Ariño AH, Encinas MA, Pando F (2013) Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF). PLoS ONE 8(1):
Due to demand – I have produced a detailed user manual for the Atlas of Living Australia’s Spatial Portal (SP). This is intended to be a stand-alone document that provides the context for, and philosophy behind the SP as well as a deeper insight into its many tools. The manual covers all aspects of the SP with advice, examples and references. I’m very grateful to Margaret Cawsey, John Tann and John Busby for providing extensive feedback and suggestions. I hope that all of their comments have been incorporated.
In putting together this document, an issue we all face is that the SP and other Atlas tools are constantly responding to user feedback and improving through the work of a talented programming team. As such, this document will become out of date over time but we will update the manual as often as possible after system updates.
Some items that we are looking forward to work on include:
Feedback on any of these items would be valued.
We are regularly asked about how we process and manage data. This blog provides a reasonably technical overview of:
Note that these processes continue to evolve over time to better detect issues and address your needs so your feedback is welcome.
Our philosophy is to:
In this approach, the Atlas does not make an overall assertion about the quality of data. This is based upon the premise that we do not know (or need to know) the purpose to which data will be put. In other words each user’s fitness-for-purpose scenario is different – so we put as much information in the hands of the user as possible to help you make this call.
To take this point further – some data may be unusable to some users in some circumstances, but will still be valuable in other contexts. With this in mind, the Atlas will never restrict data from a search based on the results of the data processing, but will expose the results and allow users to filter certain data in or out of display and later analyses.
As data is processed and ingested into the Atlas, a large number of tests are run against the data. These result in assertions about the content and quality of the data. All the assertion types and descriptions of the tests are outlined in this spreadsheet.
The results of the tests and assertions are visible on a record page and can also be used in searches as well as including or excluding records from display or analyses – as per the following screenshots:
The taxonomy provided with the original occurrence record can vary from a scientific name to a classification from subspecies to kingdom. Scientific names are parsed using the useful java GBIF ECAT name parser library developed by GBIF. This code extracts the key components of the name (generic, specific epithet, authorship etc) taking into account the nomenclatural rules (botanical and zoological) for scientific names. The Atlas extends the GBIF ECAT Name parser to handle Phrase Names as defined in the National Species List (NSL). The NSL includes the Australian Faunal Directory (AFD), the Australian Plant Name Index (APNI) and the Australian Plant Census. The classification is matched to taxa in the NSL.
The type of the name matching is classified into one of the categories on the left (as of 4/10/2013).
Taxonomic name issues that are flagged include:
Further information about the Name Matching algorithms used by the ALA can be found in the ALA Names List wiki.
Detecting for duplicate occurrence records is one of the tests performed during data processing. If a potential duplicate is detected, the record is flagged which then allows users to optionally discard duplicates records from searches, analysis and mapping. A discussion by Simon Bennett on duplicate records is available here. Duplicates may occur in observational data due to historical data merging between datasets. For specimen data, “duplicate” may be inaccurate as there may be multiple specimens taken from the same individual or multiple specimens taken as part of the same collection event.
The Atlas uses the scientific name, decimal latitude, decimal longitude, collector and collection date to detect potential duplicate records. Here are some additional implementation details:
When a group of duplicate records are identified, one record is identified as the “representative” of the duplicates. The representative record is the one that is the most complete in terms of geospatial information and other metadata.
The duplicate detection process is run as a batch job across all data each week.
Here is an example record that has been marked as a duplicate. In this particular case the specimens were possibly collected as part of the same collection event.
As part of the work on fishmap, the Atlas worked with CSIRO CMAR to expose fish distributions. These distributions are represented as polygons where each species is believed to occur. Each distribution has been developed by an expert in the taxonomic group.
A similar suite of expert distributions has been developed for birds by BirdLife international. Both sets of distributions are used to add assertions to occurrence records that fall outside these polygons. As with many assertions, just because an assertion is made, does not necessarily mean that the record is an ‘error’. For example, in the case of these expert distributions, a new observation may extend the species range.
We hope to add expert ranges for other groups as they become available. This process is run across all data each week.
Environmental outlier’ detection is run across all occurrence data that has been classified to species level. In this case ‘Environmental outlier’ means that an occurrence record lies outside one or more of the expected environmental ranges of the species. This check intersects all point locations for each species with 5 selected environmental surfaces, and then runs an algorithm known as Reverse JackKnife See here for details.
The 5 selected environmental layers were selected because, as a group, they account for most of the terrestrial environment of Australia at a 1km resolution. It is recognized that these 5 layers may not cover the significant environmental determinants of all terrestrial species, or at different scales, but it is a start. With marine species, the environmental layers are less well developed, but watch this space.
For more information about the selection of the environmental layers see Williams, K.J. and Belbin, L. and Austin, M.P. and Stein, J.L. and Ferrier, S., (2012). “Which environmental variables should I use in my biodiversity model?”, International Journal of Geographical Information Science, vol. 26, no. 11, pp. 2009-2047. DOI: 10.1080/13658816.2012.698015, http://www.scopus.com/inward/record.url?eid=2-s2.0-84867849849&partnerID=MN8TOARS
This process is run across all species each week.
Below is a display taken from an occurrence record that has been marked as a potential environmental outlier against 3 of the 5 environments.
Records in blue are not considered outliers. The records in red are considered potential outliers for this environmental surface. The actual record being viewed is considered a potential outlier against this environment and is coloured yellow.
As part of the Australian National Data Service (ANDS) funded collaboration with Centre for Tropical Biodiversity & Climate Change and the eResearch Centre, James Cook University, the Atlas now supports what we are calling “query assertions”. These are pre-defined queries (based on, say – “this species in this location at this time”) that dynamically flags records if they match the query. For example, a query could be “all records for the Atlas Nematode (fictitious) in the Condamine Alliance NRM (or any polygon) with the date range of 1970 to 1980” can be flagged as for example, “introduced” or another criteria.
This gives expert users a quick way of flagging issues against a large number of records. This type of assertion also differs from ad-hoc annotations provided for a single record (see “Flag an issue” on a record page) in that they are applied to all new data that matches an existing query assertion as it is loaded into the Atlas.
A listing of records that have been so marked with query assertions is available here. An example record that has been marked in this way is here. The annotation also has a link to other records that have been marked with this assertion. Here is an example set of record marked with a single assertion listing.
This process is run weekly across all data. Note: There isn’t currently a user interface in the Atlas (web services must be used) for adding this type of assertion but we hope to add this in the near future.
Records are marked as spatially suspect if they fail one or more of a subset of the spatial tests. As with other tests, users may include/exclude data with specific spatial assertions. All tests are outlined in this spreadsheet.
Examples of issues that will cause a record to be marked as suspect include:
The suite of Atlas tests cannot identify all errors. For example, a record may contain a valid scientific name but it may have been misnamed at the time of observation or subsequently. If the misnamed species occurs in the same environmental or spatial conditions, it is unlikely to be automatically detected, unless other tests flag an issue. Such a naming error may be detected for example, by a taxonomist who recognizes the record and knows about the corresponding museum specimen or a history of misnaming.
There are four possible combinations of detecting and correcting errors by automated tests and by people. Automated tests may be able to both detect and correct a range of errors as noted above. In some cases, automated tests can detect errors but cannot correct them. For example, a record may be flagged as being a genuine environmental outlier but how does one correct the location of the observation?
In summary, the automatic tests with the Atlas of Living Australia are very necessary but insufficient. Human intervention is required to design the tests and evaluate the results. In many cases, human intervention is required to detect and correct errors that the automatic tests cannot: See http://www.pensoft.net/journals/zookeys/article/5438/a-specialist%E2%80%99s-audit-of-aggregated-occurrence-records-an-%E2%80%98aggregator%E2%80%99s%E2%80%99-perspective
The Atlas therefore values all annotations to records when potential issues are detected by users. The Atlas is looking into methods that will support the bulk import of record annotations where an external analysis has been performed by taxonomic/domain experts.
You can run the Atlas tests against you data by uploading it to the Atlas’ sandbox tool. Note: The uploaded data is removed from the Atlas periodically. The tests run in the sandbox do not currently include duplicate detection, environmental outlier detection or expert distribution outlier detection – but this is another area are working on at the moment.