We are regularly asked about how we process and manage data. This blog provides a reasonably technical overview of:
Note that these processes continue to evolve over time to better detect issues and address your needs so your feedback is welcome.
Our philosophy is to:
In this approach, the Atlas does not make an overall assertion about the quality of data. This is based upon the premise that we do not know (or need to know) the purpose to which data will be put. In other words each user’s fitness-for-purpose scenario is different – so we put as much information in the hands of the user as possible to help you make this call.
To take this point further – some data may be unusable to some users in some circumstances, but will still be valuable in other contexts. With this in mind, the Atlas will never restrict data from a search based on the results of the data processing, but will expose the results and allow users to filter certain data in or out of display and later analyses.
As data is processed and ingested into the Atlas, a large number of tests are run against the data. These result in assertions about the content and quality of the data. All the assertion types and descriptions of the tests are outlined in this spreadsheet.
The results of the tests and assertions are visible on a record page and can also be used in searches as well as including or excluding records from display or analyses – as per the following screenshots:
The taxonomy provided with the original occurrence record can vary from a scientific name to a classification from subspecies to kingdom. Scientific names are parsed using the useful java GBIF ECAT name parser library developed by GBIF. This code extracts the key components of the name (generic, specific epithet, authorship etc) taking into account the nomenclatural rules (botanical and zoological) for scientific names. The Atlas extends the GBIF ECAT Name parser to handle Phrase Names as defined in the National Species List (NSL). The NSL includes the Australian Faunal Directory (AFD), the Australian Plant Name Index (APNI) and the Australian Plant Census. The classification is matched to taxa in the NSL.
The type of the name matching is classified into one of the categories on the left (as of 4/10/2013).
Taxonomic name issues that are flagged include:
Further information about the Name Matching algorithms used by the ALA can be found in the ALA Names List wiki.
Detecting for duplicate occurrence records is one of the tests performed during data processing. If a potential duplicate is detected, the record is flagged which then allows users to optionally discard duplicates records from searches, analysis and mapping. A discussion by Simon Bennett on duplicate records is available here. Duplicates may occur in observational data due to historical data merging between datasets. For specimen data, “duplicate” may be inaccurate as there may be multiple specimens taken from the same individual or multiple specimens taken as part of the same collection event.
The Atlas uses the scientific name, decimal latitude, decimal longitude, collector and collection date to detect potential duplicate records. Here are some additional implementation details:
When a group of duplicate records are identified, one record is identified as the “representative” of the duplicates. The representative record is the one that is the most complete in terms of geospatial information and other metadata.
The duplicate detection process is run as a batch job across all data each week.
Here is an example record that has been marked as a duplicate. In this particular case the specimens were possibly collected as part of the same collection event.
As part of the work on fishmap, the Atlas worked with CSIRO CMAR to expose fish distributions. These distributions are represented as polygons where each species is believed to occur. Each distribution has been developed by an expert in the taxonomic group.
A similar suite of expert distributions has been developed for birds by BirdLife international. Both sets of distributions are used to add assertions to occurrence records that fall outside these polygons. As with many assertions, just because an assertion is made, does not necessarily mean that the record is an ‘error’. For example, in the case of these expert distributions, a new observation may extend the species range.
We hope to add expert ranges for other groups as they become available. This process is run across all data each week.
Environmental outlier’ detection is run across all occurrence data that has been classified to species level. In this case ‘Environmental outlier’ means that an occurrence record lies outside one or more of the expected environmental ranges of the species. This check intersects all point locations for each species with 5 selected environmental surfaces, and then runs an algorithm known as Reverse JackKnife See here for details.
The 5 selected environmental layers were selected because, as a group, they account for most of the terrestrial environment of Australia at a 1km resolution. It is recognized that these 5 layers may not cover the significant environmental determinants of all terrestrial species, or at different scales, but it is a start. With marine species, the environmental layers are less well developed, but watch this space.
For more information about the selection of the environmental layers see Williams, K.J. and Belbin, L. and Austin, M.P. and Stein, J.L. and Ferrier, S., (2012). “Which environmental variables should I use in my biodiversity model?”, International Journal of Geographical Information Science, vol. 26, no. 11, pp. 2009-2047. DOI: 10.1080/13658816.2012.698015, http://www.scopus.com/inward/record.url?eid=2-s2.0-84867849849&partnerID=MN8TOARS
This process is run across all species each week.
Below is a display taken from an occurrence record that has been marked as a potential environmental outlier against 3 of the 5 environments.
Records in blue are not considered outliers. The records in red are considered potential outliers for this environmental surface. The actual record being viewed is considered a potential outlier against this environment and is coloured yellow.
As part of the Australian National Data Service (ANDS) funded collaboration with Centre for Tropical Biodiversity & Climate Change and the eResearch Centre, James Cook University, the Atlas now supports what we are calling “query assertions”. These are pre-defined queries (based on, say – “this species in this location at this time”) that dynamically flags records if they match the query. For example, a query could be “all records for the Atlas Nematode (fictitious) in the Condamine Alliance NRM (or any polygon) with the date range of 1970 to 1980” can be flagged as for example, “introduced” or another criteria.
This gives expert users a quick way of flagging issues against a large number of records. This type of assertion also differs from ad-hoc annotations provided for a single record (see “Flag an issue” on a record page) in that they are applied to all new data that matches an existing query assertion as it is loaded into the Atlas.
A listing of records that have been so marked with query assertions is available here. An example record that has been marked in this way is here. The annotation also has a link to other records that have been marked with this assertion. Here is an example set of record marked with a single assertion listing.
This process is run weekly across all data. Note: There isn’t currently a user interface in the Atlas (web services must be used) for adding this type of assertion but we hope to add this in the near future.
Records are marked as spatially suspect if they fail one or more of a subset of the spatial tests. As with other tests, users may include/exclude data with specific spatial assertions. All tests are outlined in this spreadsheet.
Examples of issues that will cause a record to be marked as suspect include:
The suite of Atlas tests cannot identify all errors. For example, a record may contain a valid scientific name but it may have been misnamed at the time of observation or subsequently. If the misnamed species occurs in the same environmental or spatial conditions, it is unlikely to be automatically detected, unless other tests flag an issue. Such a naming error may be detected for example, by a taxonomist who recognizes the record and knows about the corresponding museum specimen or a history of misnaming.
There are four possible combinations of detecting and correcting errors by automated tests and by people. Automated tests may be able to both detect and correct a range of errors as noted above. In some cases, automated tests can detect errors but cannot correct them. For example, a record may be flagged as being a genuine environmental outlier but how does one correct the location of the observation?
In summary, the automatic tests with the Atlas of Living Australia are very necessary but insufficient. Human intervention is required to design the tests and evaluate the results. In many cases, human intervention is required to detect and correct errors that the automatic tests cannot: See http://www.pensoft.net/journals/zookeys/article/5438/a-specialist%E2%80%99s-audit-of-aggregated-occurrence-records-an-%E2%80%98aggregator%E2%80%99s%E2%80%99-perspective
The Atlas therefore values all annotations to records when potential issues are detected by users. The Atlas is looking into methods that will support the bulk import of record annotations where an external analysis has been performed by taxonomic/domain experts.
You can run the Atlas tests against you data by uploading it to the Atlas’ sandbox tool. Note: The uploaded data is removed from the Atlas periodically. The tests run in the sandbox do not currently include duplicate detection, environmental outlier detection or expert distribution outlier detection – but this is another area are working on at the moment.