Article

A specialist’s audit of aggregated occurrence records: An ‘aggregator’s’ perspective

‘Data quality’ has been a hot topic in the Atlas of Living Australia (ALA) for many years. In just about all the presentations I have ever given on the ALA, data quality comes up at question time.

A recent ZooKeys’ paper (Mesibov 2013) highlighted data quality issues in aggregated data sets in the ALA and GBIF. The admirable work by Bob was however enabled by the ALA’s exposure of integrated datasets. While we welcome Bob’s highlighting of data issues, the ALA and counterparts internationally have been working diligently to minimise such errors, but not as fast as the community may wish.

Whots odd?

Herbarium or museum records, or even a single collector’s records, are all aggregations of records taken at different times and by different collectors.  The flow of biological observations can go from observer to end user through multiple digital aggregators. Bob Mesibov is also a data aggregator in his analysis of the Australian millipedes. At any node in the data flow, errors can be detected, introduced or addressed.

Data should be published in secure locations where they can be preserved and improved in perpetuity and the ALA is a good example. We are moving beyond storage of data by individuals or institutions that do not have a strategy for enduring digital data integration, storage and access.

One of the most powerful outcomes of publishing digital data is that inherent legacy problems are revealed despite the concerted work of dedicated taxonomists over decades or longer. Exposing data provides the opportunity for the community to detect and correct errors. Indeed, much of the admirable work achieved by Mesibov (2013) was enabled by having data exposed beyond the originating institutions.

The ability to identify and correct data issues is also the responsibility of the whole community and not any single agent such as the ALA. There is the need to seamlessly and effectively integrate expert knowledge and automated processes so all amendments form part of a persistent digital knowledge about species.  Talented and committed individuals can make enormous progress in error detection and correction (as seen in Mesibov, 2013) but how do we ensure that when a project like that on millipedes ceases,  the data and all associated work are not lost? This implies standards in capturing and linking this information and maintaining the data with all amendments documented.  To achieve this, the biodiversity research community needs to be motivated and empowered to work in a collaborative fashion.

Data quality is of the highest concern but data may not have to be 100% accurate to have utility in some projects. Quality issues affecting some users may be of secondary or no importance to others. For example, a locational inaccuracy of 20km on a record will likely not invalidate its use with regional or continental scale studies. Access to information on a type specimen is likely to be of value even if georeferences are incomplete.  The term ‘fitness for use’ may therefore be more appropriate than ‘data quality’ in many circumstances.

This is not an excuse to ignore errors, but a recognition that effective use depends on knowledge about the data. The goal of the aggregators is to understand how much confidence is appropriate in each element of each record and to enable users to filter data based on these confidence measures. The philosophy of most of the aggregators is therefore to flag potential issues, correct what is obviously correctable and flag issues rather than hiding the associated record.

Some data quality issues can be detected without domain specific taxonomic expertise and some cannot. The correction of detected issues also may or may not require domain specific expertise. There are therefore four possible scenarios.

Specialist domain expertise is required to detect and correct many of the issues raised by Mesibov (2013). Agencies such as the ALA do not generally have this type of expertise. The ALA does however expose the data and provides infrastructure and processes that help to detect and to address data issues. An example of the quality controls undertaken by the ALA as can be seen here.

A more fundamental issue is that most biodiversity data today are managed and published through a wide range of heterogeneous databases and processes.  Consistency is required for guaranteed, stable, persistent access to each data record and in establishing standardised approaches to registering and handling corrections.  ‘Aggregators’ such as the ALA have a key role in addressing this challenge but ultimately it will depend on widespread changes in the culture of biodiversity data management.

Another component in the infrastructure supporting error detection and correction in the ALA is a sophisticated annotations service that uses crowd sourcing.  Issues that are detected, with potential corrections are returned to the data provider. Note however that some data providers may not have the resources to address the issues or indeed may no longer exist.

See Belbin, et al. (2013) below for the full paper.

References

Belbin Lee, Daly Joanne, Hirsch Tim, Hobern Donald, LaSalle John (2013). A specialist’s audit of aggregated occurrence records: An ‘aggregator’s’ perspective. ZooKeys 305, 67-76. doi: 10.3897/zookeys.305.5438.

Chapman, AD (2005a). Principles and Methods of Data Cleaning – Primary Species and Species Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 75p.

Chapman, AD (2005b). Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 61p.

Costello MJ, Michener WK, Gahegan M, Zhang Z-Q, Bourne P, Chavan V (2012). Quality assurance and intellectual property rights in advancing biodiversity data publications version 1.0, Copenhagen: Global Biodiversity Information Facility, 40p, ISBN: 87‐92020‐49‐6.

Mesibov R (2013) A specialist’s audit of aggregated occurrence records. ZooKeys 293: 1-18. doi: 10.3897/zookeys.293.5111

Otegui J, Ariño AH, Encinas MA, Pando F (2013) Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF). PLoS ONE 8(1): 

Due to demand – I have produced a detailed user manual for the Atlas of Living Australia’s Spatial Portal (SP).  This is intended to be a stand-alone document that provides the context for, and philosophy behind the SP as well as a deeper insight into its many tools.  The manual covers all aspects of the SP with advice, examples and references.  I’m very grateful to Margaret Cawsey, John Tann and John Busby for providing extensive feedback and suggestions. I hope that all of their comments have been incorporated.

Koppen climate classification

Koppen climate classification

In putting together this document, an issue we all face is that the SP and other Atlas tools are constantly responding to user feedback and improving through the work of a talented programming team.  As such, this document will become out of date over time but we will update the manual as often as possible after system updates.

Some items that we are looking forward to work on include:

  1. Greatly extending the current Area Report – to integrate a far wider range of bio-environmental information within the Atlas.   Comprehensive bio-oriented reports on any defined area will be of great use for local Government, environmental impacts assessments, land owners, land managers, research etc. As ever, ‘area’ in the SP can be defined in any of 14 different ways.
  2. Develop a library of scripts, leveraging existing Atlas web services, to enable users of the R package to access Atlas data.  This is in response to feedback from researchers that efficient access to Atlas data via R will be highly valuable.  This work will also improve the web services themselves and deliver some updated documentation on how to use these services.  We are still considering integrating more analysis tools within the portal itself but this work with the R software will immediately support a wide range of analyses.
  3. Improve our approach to data on invasives, aliens, pests etc – such that they parallel how we currently handle conservation status. This domain is of immense cost/value to Australia so it is vital that we present a useful and consistent interface to relevant information that can be used for the broadest applications.  This includes the ability to answer questions such as: what invasives exist in a given area – and, by further leveraging our lists tool we will soon be making this information available in the extended Area Report as mentioned above.
  4. Continue outreach to the scientific community.
    1. I visited Professor Steven Chown and Associate Professor Melodie McGeoch at Monash in September and it was gratifying to see how the Atlas is being used in teaching and research at Monash. We are looking to ways that the Atlas can better assist in this environment.
    2. A presentation on ‘Data Quality’ has been accepted for ESA2013. This continues to be a hot topic, and one that the Atlas continues to work on.
    3. We hope to have a resource available in the near future to help with the delivery of targeted resources for education and training purposes.
  5. In general, we continue to review and manage the spatial layers in the Spatial Portal as resources allow – so do do let us know if there are particular layers and climate models/scenarios that you would like to have incorporated.  Note that we have recently included the Koppen climate classification  (see image above).

Feedback on any of these items would be valued.

Lee Belbin

We are regularly asked about how we process and manage data.  This blog provides a reasonably technical overview of:

  1. some of the processes occurrence data goes through;
  2. how the results of this processing are visible to users; and
  3. how users can explore and filter data to be “fit for purpose”

Note that these processes continue to evolve over time to better detect issues and address your needs so your feedback is welcome.

General approach

Our philosophy is to:

  • Preserve the original data as provided by the data provider
  • Interpret the data in terms of taxonomy provided by National Species Lists, geospatial information and attribution.
  • Assert additional information about the data based on a range of tests. These assertions are designed to help Atlas’ users make clear decisions about data use.

In this approach, the Atlas does not make an overall assertion about the quality of data.  This is based upon the premise that we do not know (or need to know) the purpose to which data will be put.  In other words each user’s fitness-for-purpose scenario is different – so we put as much information in the hands of the user as possible to help you make this call.

To take this point further – some data may be unusable to some users in some circumstances, but will still be valuable in other contexts.  With this in mind, the Atlas will never restrict data from a search based on the results of the data processing, but will expose the results and allow users to filter certain data in or out of display and later analyses.

Simple diagram of some of the processes run over occurrence data

Simple diagram of some of the processes run over occurrence data

Assertions

As data is processed and ingested into the Atlas, a large number of tests are run against the data.  These result in assertions about the content and quality of the data.  All the assertion types and descriptions of the tests are outlined in this spreadsheet.

Data quality summary on record page

Data quality summary displayed on record page

The results of the tests and assertions are visible on a record page and can also be used in searches as well as including or excluding records from display or analyses – as per the following screenshots:

Listing of test ran on a record page

Listing of tests run on a record page

The "record issues" facet which can be used to include/exclude records with specific assertions

The "record issues" facet which can be used to include/exclude records with specific assertions

Mapping of Acacia acuminata

Mapping of Acacia acuminata with the record issues facet selected. The occurrence points are coloured by issues detected with the records. Within the spatial portal, users can select records with/without issues and produce a layer from this selection

Taxonomy

The taxonomy provided with the original occurrence record can vary from a scientific name to a classification from subspecies to kingdom.  Scientific names are parsed using the useful java GBIF ECAT name parser library developed by GBIF. This code extracts the key components of the name (generic, specific epithet, authorship etc) taking into account the nomenclatural rules (botanical and zoological) for scientific names.  The Atlas extends the GBIF ECAT Name parser to handle Phrase Names as defined in the National Species List (NSL).  The NSL includes  the Australian Faunal Directory (AFD), the Australian Plant Name Index (APNI) and the Australian Plant Census. The classification is matched to taxa in the NSL.

The type of the name matching is classified into one of the categories on the left (as of 4/10/2013).

Name matching metrics

Name matching metrics can be used to select records matched in a certain way.

Taxonomic name matching

Taxonomic name issues can be used to include/exclude records that may have issues such as being associated with a misapplied name



Taxonomic name issues that are flagged include:

  • Parent concept is synonym of child: Arises when a species is divided into 1 or more subspecies and the species name is marked as a synonym to one of the subspecies.
  • Match to misapplied name: A warning to indicate that the supplied scientific name has been misapplied in the past.
  • Matched to homonym: During the match a homonym was detected.  This will often cause the match to be applied at a higher level in the classification.
  • Affinity species: The supplied scientific name included an aff. marker.
  • Confer species: The supplied scientific name included a cf. marker.
  • Associated name excluded:  The scientific name is both an accepted name and excluded from another name.
  • Matched to excluded species: The scientific name matched to a species that is considered excluded from Australia.
  • Species plural: The scientific name was supplied with an spp. marker.

Further information about the Name Matching algorithms used by the ALA can be found in the ALA Names List wiki.

Duplicate detection

Detecting for duplicate occurrence records is one of the tests performed during data processing.  If a potential duplicate is detected, the record is flagged which then allows users to optionally discard duplicates records from searches, analysis and mapping.  A discussion  by Simon Bennett on duplicate records is available here.  Duplicates may occur in observational data due to historical data merging between datasets. For specimen data, “duplicate” may be inaccurate as there may be multiple specimens taken from the same individual or multiple specimens taken as part of the same collection event.

The Atlas uses the scientific name, decimal latitude, decimal longitude, collector and collection date to detect potential duplicate records. Here are some additional implementation details:

  • Records are grouped at a species level. Synonyms are mapped to the accepted name, and the accepted name is used in this grouping.
  • Collection dates are duplicates where year, month and day are identical. Paired empty values are considered identical.
  • Collector names are compared; if one is null it is considered a duplicate, otherwise a Levenstein distance is calculated with an acceptable threshold indicating a duplicate
  • Latitudes and Longitudes are duplicates when they are identical at the same precision. Null values are excluded from consideration.

When a group of duplicate records are identified, one record is identified as the “representative” of the duplicates.  The representative record is the one that is the most complete in terms of geospatial information and other metadata.

The duplicate detection process is run as a batch job across all data each week.

Duplicate detection status values from the facet

Duplicate detection status values from the facet. Users can use this facet to select records marked as duplicates by the criteria used to associate the records.

Here is an example record that has been marked as a duplicate. In this particular case the specimens were possibly collected as part of the same collection event. 

Associated record details for PERTH 8480311

Associated record details for PERTH 8480311

Expert distribution outlier

As part of the work on fishmap, the Atlas worked with CSIRO CMAR to expose fish distributions. These distributions are represented as polygons where each species is believed to occur. Each distribution has been developed by an expert in the taxonomic group.

A similar suite of expert distributions has been developed for birds by BirdLife international. Both sets of distributions are used to add assertions to occurrence records that fall outside these polygons. As with many assertions, just because an assertion is made, does not necessarily mean that the record is an ‘error’. For example, in the case of these expert distributions, a new observation may extend the species range.

Chaetodon reticulatus : Reticulate Butterflyfish

Reticulate Butterflyfish. Image by: J.E. Randall. Australian National Fish Collection Images. Licence: CC BY-NC 3.0

Distribution map for Reticulate Butterflyfish as supplied by CSIRO CMAR

Distribution map for Reticulate Butterflyfish as supplied by CSIRO CMAR

The expert range and the occurrence record location are displayed on the record page:
Record outside of expert distribution area

Record outside of expert distribution area

We hope to add expert ranges for other groups as they become available. This process is run across all data each week.

Environmental outlier detection

Mountain thornbill. Image by: Tom Tarrant. Licence: http://creativecommons.org/licenses/by-nc-sa/2.0/

Mountain thornbill. Image by: Tom Tarrant. Licence: http://creativecommons.org/licenses/by-nc-sa/2.0/

Environmental outlier’ detection is run across all occurrence data that has been classified to species level. In this case ‘Environmental outlier’ means that an occurrence record lies outside one or more of the expected environmental ranges of the species. This check intersects all point locations for each species with 5 selected environmental surfaces, and then runs an algorithm known as Reverse JackKnife  See here for details.

The 5 selected environmental layers were selected because, as a group, they account for most of the terrestrial environment of Australia at a 1km resolution. It is recognized that these 5 layers may not cover the significant environmental determinants of all terrestrial species, or at different scales, but it is a start. With marine species, the environmental layers are less well developed, but watch this space.

For more information about the selection of the environmental layers see Williams, K.J. and Belbin, L. and Austin, M.P. and Stein, J.L. and Ferrier, S., (2012). “Which environmental variables should I use in my biodiversity model?”, International Journal of Geographical Information Science, vol. 26, no. 11, pp. 2009-2047. DOI: 10.1080/13658816.2012.698015http://www.scopus.com/inward/record.url?eid=2-s2.0-84867849849&partnerID=MN8TOARS

This process is run across all species each week.

Mountain thornbill records in the spatial portal

Mountain thornbill records in the spatial portal

Below is a display taken from an occurrence record that has been marked as a potential environmental outlier against 3 of the 5 environments.
Records in blue are not considered outliers. The records in red are considered potential outliers for this environmental surface. The actual record being viewed is considered a potential outlier against this environment and is coloured yellow.

Record considered an environmental outlier for 3 surfaces

Record considered an environmental outlier for 3 surfaces

Query assertions (new!)

As part of the Australian National Data Service (ANDS) funded collaboration with Centre for Tropical Biodiversity & Climate Change and the eResearch CentreJames Cook University, the Atlas now supports what we are calling “query assertions”. These are pre-defined queries (based on, say – “this species in this location at this time”) that dynamically flags records if they match the query. For example, a query could be “all records for the Atlas Nematode (fictitious) in the Condamine Alliance NRM (or any polygon) with the date range of 1970 to 1980” can be flagged as for example, “introduced” or another criteria.

This gives expert users a quick way of flagging issues against a  large number of records. This type of assertion also differs from ad-hoc annotations provided for a single record (see “Flag an issue” on a record page) in that they are applied to all new data that matches an existing query assertion as it is loaded into the Atlas.

There are blog articles on the development activities for the project here and here.

Mountain thornbill records as display in the JCU edgar portal

Mountain thornbill records as display in the JCU edgar portal

A listing of records that have been so marked with query assertions is available here. An example record that has been marked in this way is here. The annotation also has a link to other records that have been marked with this assertion. Here is an example set of record marked with a single assertion listing.

A record marked with query assertion

A record marked with query assertion

This process is run weekly across all data. Note: There isn’t currently a user interface in the Atlas (web services must be used) for adding this type of assertion but we hope to add this in the near future.

Spatial validity

Records are marked as spatially suspect if they fail one or more of a subset of the spatial tests. As with other tests, users may include/exclude data with specific spatial assertions. All tests are outlined in this spreadsheet.

Examples of issues that will cause a record to be marked as suspect include:

  • Coordinates given as 0,0. Typically a result of bad default values for empty database fields.
  • Coordinates out of range (Latitude >90 or <-90 and Longitude >180 or <-180).
  • Marine species on land or vice-versa.
  • Supplied coordinates are the centre of a country.
  • The coordinates given are in the centre of a State or Territory. suggesting they have been generated post collection event, erroneously by software .
  • Records marked as geospatially suspect by users.
  • Environmental outliers.
  • Expert distribution outliers.

Automated Tests and Human annotations

The suite of Atlas tests cannot identify all errors. For example, a record may contain a valid scientific name but it may have been misnamed at the time of observation or subsequently. If the misnamed species occurs in the same environmental or spatial conditions,  it is unlikely to be automatically detected, unless other tests flag an issue. Such a naming  error may be detected for example, by a taxonomist who recognizes the record and knows about the corresponding museum specimen or a history of misnaming.

There are four possible combinations of detecting and correcting errors by automated tests and by people. Automated tests may be able to both detect and correct a range of errors as noted above. In some cases, automated tests can detect errors but cannot correct them. For example, a record may be flagged as being a genuine environmental outlier but how does one correct the location of the observation?

In summary, the automatic tests with the Atlas of Living Australia are very necessary but insufficient. Human intervention is required to design the tests and evaluate the results. In many cases, human intervention is required to detect and correct errors that the automatic tests cannot: See http://www.pensoft.net/journals/zookeys/article/5438/a-specialist%E2%80%99s-audit-of-aggregated-occurrence-records-an-%E2%80%98aggregator%E2%80%99s%E2%80%99-perspective

The Atlas therefore values all annotations to records when potential issues are detected by users. The Atlas is looking into methods that will support the bulk import of record annotations where an external analysis has been performed by taxonomic/domain experts.

The Sandbox

You can run the Atlas tests against you data by uploading it to the Atlas’ sandbox tool.  Note: The uploaded data is removed from the Atlas periodically. The tests run in the sandbox do not currently include duplicate detection, environmental outlier detection or expert distribution outlier detection – but this is another area are working on at the moment.