Article

International data quality workshops focus on fitness for use

At a recent meeting in Gainesville, Florida, an international group finalised a standard suite of data quality tests across biodiversity platforms.

The issue of data quality has been an important issue for digital biodiversity data since the inception of platforms such as the Global Biodiversity Infrastructure Facility (GBIF) and the ALA.

Other than data availability, data quality is probably the most significant issue for users of biodiversity data, especially for the research community. For this reason, GBIF and Biodiversity Information Standards (TDWG) joined forces in 2015 to establish a formal Interest Group to address the issue of data quality. The ALA has taken a leading role in this group.

Three task groups were set up to establish a framework on data, data quality tests and assertions, and a use case library. Members include the ALA, Integrated Digitized Biocollections (iDigBio), GBIF and Kurator.

One of the first outcomes was to agree that data quality is better expressed as ‘fitness for use’. Data that suits some applications does not suit others, and opinions about data quality are often personal because researchers have their own opinions on suitability and methodologies.

Task Group 2 chaired by Lee Belbin, Science Advisor Data, ALA, was charged with establishing a suite of core tests that data aggregators, such as ALA, iDigBio and GBIF, can apply to help users identify data issues. In January 2018, the task group met at iDigBio in Gainesville, Florida and finalised a set of core tests and assertions. These core tests will be implemented initially across three international platforms – GBIF, ALA and iDigBio.

This outcome is the result of many hours of work. The group reviewed over 250 tests currently in use by eight agencies around the world. The tests were classified, evaluated and refined and new tests were developed. The final suite of 98 tests and their assertions are based on the Darwin Core standard, and include a comprehensive suite of code, documentation and test data for each test.

Data quality is a pressing issue for the ongoing success of digital biodiversity platforms. The TDWG Interest Group hopes an internationally agreed, standard suite of core tests can be implemented by all data providers and data collectors, enabling greater and more appropriate use of biodiversity data.

For more information, please contact Lee Belbin or visit https://github.com/tdwg/bdq.

Members attending the Gainesville workshop. Members of Task Group 2 (Data Quality tests and Assertions): Left to right: Lee Belbin (TG2 Chair, ALA), Arthur Chapman (Data Quality Interest Group Chair, Australia), Paul Morris (Kurator, USA), John Wieczorek (Mr Darwin Core, Argentina), Paula Zermoglio (Vocabulary Task Group 4 Chair). Missing is Alex Thompson of IDigBio (USA).