Sharing data through the Atlas

26 October, 2009

This is a DRAFT – please send us your comments and suggestions.

Background

Different researchers and institutions capture and store data in the forms and combinations which best meet their needs. However, in order to make these data more widely accessible and to ensure that they can be reused for different purposes, data providers need to consider the appropriate way to need to expose their data, using structures and terms which can be recognised by others. Once these decisions have been made, they need to ensure that their data are transformed (or “mapped”) into these standard forms, and that the data set is associated with a good description and additional information on the source and ownership of the data, again in a form which can be understood by other users.

Sharing data therefore depends on the consistent use of agreed standards. Many different standards exist for different purposes. This document identifies the standards which are being adopted by the Atlas of Living Australia and seeks to explain how they can be used to improve the reuse of Australian biodiversity data.

TDWG produces an annual Technical Roadmap which provides recommendations for best practices in data sharing between biodiversity informatics projects. This document seeks to reflect these recommendations. The Technical Roadmap is available – PDF (106KB).

The ALA and data standards

The ALA aims to bring together data from many sources. To do this it will support a range of different data standards, many of which are already in use by various communities around Australia. The ALA aims to shield users from this complexity by providing services which handle these source data formats and offer consistent views of merged data.

The ALA will therefore respond to developing needs and requirements by continuing to identify and support relevant data standards. This page provides a snapshot of the range of standards which are already recognised as important. Additional information will be added as it becomes available.

The information on this page has been divided into three major topics, each with a number of sub-topics:

  1. Data standards – specific information on standards supported for sharing different types of data.
    1. Species occurrence data (specimens and observations)
    2. Names, classifications and checklist data
    3. Structured descriptions and keys
    4. Species fact sheets
    5. Images and other multimedia
  2. Protocols – general information on sharing data in ways that the ALA and end users can access.
    1. TAPIR (and DiGIR and BioCASe)
    2. Tab-delimited and comma-separated data
    3. LSIDs and RDF data
    4. OAI Protocol for Metadata Harvesting (OAI-PMH)
    5. Plain URL
  3. Metadata standards – general information on describing data resources and registering them with the ALA.
    1. Dublin Core
    2. Ecological Metadata Language (EML)

The term, “metadata”, refers to the information a data provider supplies to describe a data set and to help users to access it. This may include descriptions of the contents of the data set, how the data were collected, information about the collectors and owners of the data and the uses they approve for the data, and technical details required to access or interpret the data.

As a quick shorthand – when considering the task of sharing data – the specific data standards address the question of what information can be shared, the protocols address the question of how the information is to be shared and accessed, and the metadata standards address the question of why users should be interested in accessing the information. (This is an oversimplification, but may help to explain the significance of the following sections.)

The section on protocols includes some information on software packages which implement these standards.

Please consider the question of copyright and usage restrictions when making data available. The ALA has developed some draft guidelines on the subject and recommends the adoption of Creative Commons licences in most situations.

Since the ALA deals with biodiversity data, the expectation is that all data items will include a label or data element identifying the species or other taxon covered by the resource. Where feasible data providers are encouraged to use scientific names found in the AFD or the APC or the Catalogue of Life (particularly for any taxa not found in AFD/APC). However it is recognised that some data sets (particularly historical collections and literature) use names which may not be currently accepted species names. In such situations data providers are advised to use the scientific names found in the original sources. The ALA will develop or use Taxonomic Services to attempt to map such data appropriately to current species concepts.

Over the coming months, this document will be enhanced with examples for each of the standards identified.

For more information on any of these recommendations, please contact us.

1. Data standards

The ALA aims to enable access to any type of biodiversity information. The following subsections provide guidelines for some major classes of data. These represent the classes of data on which the ALA particularly expects to focus in the earliest stages of its development. As the project proceeds, other classes will be handled in more detail.

Many of the data standards below include a unique identifier for each record. The ALA strongly recommends that all data providers supply such identifiers. Some data objects will already have an appropriate globally unique identifier. Publications may be associated with a DOI. Species characteristics and other terms from controlled vocabularies may be associated with stable HTTP URIs as identifiers. In other cases, the ALA follows the guidelines from TDWG and recommends the use of Life Science Identifiers (LSIDs) for this purpose. Constructing an LSID to serve as a record identifier gives a good guarantee of uniqueness and also allows the data provider to use LSID resolution as a means for users to be able directly to access the associated data.

1a. Species occurrence data (specimens and observations)

Quick recommendations

Detailed recommendations

Two families of data standards are in wide use around the world for exchanging data records for individual specimens in natural history collections and herbaria (including living collections such as culture collections and seed banks) and also for observations of living organisms in the field:

Each of these standards has been used in several different versions. The ALA will support the same range of versions supported by GBIF (see http://www.gbif.org/participation/participant-nodes/resources/how-to-establish-a-participant-node/):

Both Darwin Core and ABCD were developed through the work of TDWG, which has recently been reworking all of its data standards to make them more compatible and easier to use in a wide range of different situations. As part of this activity, TDWG has developed a TaxonOccurrence vocabulary which will replace both Darwin Core and ABCD. As this standard is more widely adopted, the ALA will support and promote it as a preferred data model.

It should be noted that Darwin Core, ABCD and the TaxonOccurrence vocabulary are all designed to handle the exchange of “presence” data, i.e. data sets which comprise records of the actual occurrence of different species. These data standards are inadequate to represent the full complexity of many ecological data sets. In particular they lack any standard mechanism for expressing level and standardisation of recording effort or for recording the absence of a species of interest. The TDWG Observation and Specimen Records Interest Group is exploring ways to represent this further level of information, and the ALA will participate in this process. In the mean time, the ALA encourages data providers to use one of these existing standards to share their data and to provide a text description of any underlying data collection methods in the metadata for the data set.

Some data sets (particularly ecological data sets) will include data elements for which there is no corresponding property in Darwin Core or in the TaxonOccurrence vocabulary. In such cases data providers are recommended to use appropriate terms from other vocabularies to label these additional elements (see for example the range of vocabularies and ontologies listed by Marine Metadata Interoperability or SPIRE). If these elements are of general importance to biodiversity projects, it is also recommended to initiate discussion with TDWG to make standard recommendations on their representation.

1b. Names, classifications and checklist data

Quick recommendations

Detailed recommendations

Taxonomic and nomenclatural data is frequently much more complex than species occurrence data, with many nested relationships between data items. Different data sets contain different combinations of nomenclatural data (information on the origin and validity of scientific names), taxonomic data (judgments on which names correspond to real species, how these species are to be classified and how they relate to earlier names and species concepts), bibliographic data (details of nomenclatural and taxonomic literature) and often a range of legislative, distributional and other information.

In addition, the hierarchical nature of classifications and checklists is handled in different ways in different databases. Some databases have a single table to accommodate taxa of all ranks, and rely on parent-child relationships and explicit taxon rank elements. Others flatten the hierarchy and assign different ranks to different database columns. There is similar variation in the way that different databases model the relationships between the names each database accepts for a species and the various synonyms identified in the database for the same species.

All of this makes it hard to provide generic recommendations for sharing data of this kind. To simplify the problem slightly, we can distinguish data sets according to their primary purpose:

There are several relevant data standards which have been used to share such data. The ALA will seek to support data shared using any of the following models:

Nomenclatural and taxonomic databases can be exposed as a TCS document (accessed via a Plain URL) or as a set of smaller TCS documents each representing a single taxon name or taxon concept (and accessed via a TAPIR search interface). It is however preferable in such cases to use the TaxonName and TaxonConcept vocabularies and to use LSIDs (and perhaps also OAI-PMH) in addition to, or instead of TAPIR.

For other data sets, it will often be more appropriate simply to share the entire data set as a single document in a tab-delimited or comma-separated format, with terms from the TaxonName and/or TaxonConcept vocabularies or the SPICE Common Data Model as column headings and explanation of the purpose of the list included as part of the metadata for the data set. This approach may be applicable for listing protected species (red lists, CITES, etc.), species exhibiting some characteristic (e.g. endemic, invasive, venomous or leaf-mining taxa) or checklists for protected areas or localities.

1c. Structured descriptions and keys

Quick recommendations

Detailed recommendations

Resources intended to assist users with identifying a set of organisms can be divided into two main categories:

The ALA is working with the Taxonomic Resource Information Network (TRIN), IdentifyLife, the Encyclopedia of Life (EOL), Key To Nature and others to explore ways to manage access to descriptive data and identification tools. For the present, the recommendation is to share data using Delta or SDD wherever possible, and to register other resources as URLs with associated metadata.

1d. Species fact sheets

Quick recommendations

Detailed recommendations

TDWG (in conjunction with GBIF, EOL and a range of other projects) has been developing the Species Profile Model as a vocabulary for use in advertising and exchanging blocks of information about different species.

This vocabulary is still under development but is intended to address a number of needs. For the ALA, particular benefits of SPM are likely to include:

At present all of these uses are still being explored. The ALA would be interested to hear from any projects or researchers interested in trial use of SPM.

1e. Images and other multimedia

Quick recommendations

Detailed recommendations

Images and other multimedia resources are typically shared across the Internet in three different ways:

This variation is a complication in the task of managing access to image resources. Any of these three approaches may meet a user’s needs to view images and other multimedia objects for a particular organism, but only the first approach easily supports display of alternate views (e.g. thumbnails). The ALA will allow data providers to specify in their metadata which of these three approaches relates to each data resource, either on a record-by-record basis or for an entire data set (see below).

The preferred approach to sharing metadata about images (including the actual links to the images) will be OAI-PMH using properties from the standard OAI Dublin Core mapping (oai_dc). Where this is not feasible, a set of metadata fields for every image in a collection can be shared as a tab-delimited or comma-separated file. These metadata fields may include:

Subject and Identifier must be supplied for every record, but the other Dublin Core properties may be supplied on a record-by-record basis or as a set of default values within the metadata for the entire data set.

The ALA will be able to use OAI-PMH or the tab-delimited/comma-separated data set to harvest basic metadata about the multimedia objects in a collection.

This same set of Dublin Core properties can also be used for registering a single multimedia object as an online resource.

2. Protocols

The ALA will use a range of standard approaches to link to data resources. Each has particular benefits.

In addition to those protocols addressed in the following subsections, the ALA expects to support the Open Geospatial Consortium (OGC) Web Feature Service (WFS) for exchange of species occurrence data (using properties from the Darwin Core or the TaxonOccurrence vocabulary). Specific recommendations will be added at a later date on the use of WFS and other OGC standards.

2a. TAPIR (and DiGIR and BioCASe)

TAPIR is the current version of the remote query protocol developed and promoted by TDWG. It is effectively an enhanced replacement for the two earlier TDWG data access protocols, DiGIR and BioCASe. The ALA will interface with existing DiGIR or BioCASe resources but recommends the use of TAPIR for future implementations.

TAPIR is a general-purpose language for querying remote databases and for retrieving data in formats suited to the needs of the requestor. It relies on the existence of agreed community data models. Each such model provides a set of agreed properties which can be mapped against a range of distributed databases, even though their underlying data structures may be very different.

For example, a collection database, an ecological data set and a repository of biodiversity image metadata may all contain fields which represent the scientific name of an organism and the spatial coordinates and date associated with a particular occurrence of that organism. TAPIR allows all of these databases to advertise the fact that they have records containing fields which correspond to the ScientificName, EarliestDateCollected, Latitude and Longitude concepts from a model such as the Darwin Core. A client application can then send standard requests (as XML documents or a set of parameters encoded in a URL) to each of these resources and receive the results back as an XML response document. The format of the response document and the data elements to be inserted into this format can be defined by the client requesting the data. This makes it possible, for example, for the client to request the results as a KML file which can be displayed directly in Google Earth or Google Maps.

TAPIR supports five types of request:

Several implementations of the TAPIR protocol are available. The TAPIR server implementations all include configuration interfaces to assist data providers in mapping their relational databases to standards such as Darwin Core, ABCD and the TDWG TaxonOccurrence vocabulary. Available software includes:

The ALA is working with GBIF and others to simplify the installation and use of these tools.

For more information on TAPIR, see:

2b. Tab-delimited and comma-separated data

TAPIR is a powerful tool for supporting remote access to a database but some data providers may not need the flexibility it offers, or may be unable to run the TAPIR software on a web server. In such circumstances it may be simpler to share a flat file representation of the data. The simplest approach, supported by database and spreadsheet software, is to export the content as tab-delimited or comma-separated data and then to register a URL from which the data file can be downloaded. This makes it possible to associate the data with any relevant descriptive metadata. Note that data should always be shared using UTF-8 as the character encoding.

One possible weakness in such an approach is that it may be difficult for users to interpret the significance of the various columns in such a data set. The ALA makes the following recommendations to address this problem:

The property identifiers should be constructed using either the short or long form specified in this table. The long forms are formal URIs and are preferred since it should allow other clients to interpret the data correctly. The short forms are suggested standard abbreviations likely only to be interpreted correctly by the ALA and similar biodiversity informatics projects.

Vocabulary Long form Short form
Dublin Core Element URI (e.g. “http://purl.org/dc/elements/1.1/description“) Prefix property name with “dc:” (e.g. “dc:description”)
TaxonOccurrence Prefix property name with “http://rs.tdwg.org/ontology/voc/TaxonOccurrence#” (e.g. “http://rs.tdwg.org/ontology/voc/TaxonOccurrence#catalogueNumber”) Prefix property name with “to:” (e.g. “to:catalogueNumber”)
TaxonConcept Prefix property name with “http://rs.tdwg.org/ontology/voc/TaxonConcept#” (e.g. “http://rs.tdwg.org/ontology/voc/TaxonConcept#nameString”) Prefix property name with “tc:” (e.g. “tc:nameString”)
TaxonName Prefix property name with “http://rs.tdwg.org/ontology/voc/Taxonname#” (e.g. “http://rs.tdwg.org/ontology/voc/TaxonName#year”) Prefix property name with “tn:” (e.g. “tn:year”)
Other vocabulary Property URI N/A
Unmatched term Plain text column name N/A

This approach will allow relatively simple (“flat”) database structures to be shared. It will however not accommodate all needs. Two particular issues need to be considered:

It would be possible for the ALA to develop more complex recommendations on ways to use tab-delimited/comma-separated formats to overcome these issues. At present the recommendation is to avoid the use of these formats for these more complex situations. In one special case, sharing of synonymised checklist data, the ALA recommends the use of the SPICE Common Data Model properties to supply column headings. This model includes separate sets of properties for the accepted name and for the synonym and therefore avoids the general problems described above.

2c. LSIDs and RDF data

Life Science Identifiers (LSIDs) are globally unique identifiers which can be used both uniquely to identify a particular data object and also to assist users with retrieving associated metadata in a standard format.

An LSID is a string with the following format: urn:lsid:<Authority>:<Namespace>:<ObjectID>[:<Version>]

Where:

Example LSIDs include:

TDWG has adopted LSIDs as its recommended standard for assigning globally unique identifiers to data records and has developed a range of software components to assist data providers and users in exchanging data via LSIDs. This software is listed on the TDWG wiki and includes:

Note that web browsers cannot directly resolve an LSID. When presenting an LSID for display to humans, it is recommended that the identifier should be prefixed with the address for the LSID Web Resolver web site, e.g. instead of showing urn:lsid:ipni.org:names:302735-2, show http://lsid.tdwg.org/urn:lsid:ipni.org:names:302735-2.

Responses to requests for LSID metadata are expected to return this information encoded as RDF. The TDWG TaxonOccurrence, TaxonConcept and TaxonName vocabularies were in part developed to make it possible to use them in RDF documents. See the TDWG wiki page on LSID Vocabularies for more details.

For more information on LSIDs, see:

2d. OAI Protocol for Metadata Harvesting (OAI-PMH)

The Open Archives Initiative (OAI) has developed a simple protocol for harvesting of metadata from web repsitories. OAI-PMH is a set of six request types which allow clients to browse the contents of such a repository and, at a later date, to be able to discover more recent changes. These requests are:

Essentially OAI-PMH can be seen as a more sophisticated and flexible version of RSS. A data provider can advertise a set of resources via their metadata, and can optionally use Sets to subdivide these resources into categories of importance to users. Users can browse all metadata documents or the documents for a single Set and can limit requests to those metadata documents added, modified and even deleted since a specified date.

OAI-PMH allows repositories to offer a range of metadata formats but mandates that the OAI Dublin Core (oai_dc) format must be supported.

The ALA recommends the use of OAI-PMH particularly for repositories of images, documents, etc. It is highly suited to situations in which data records can be represented with plain Dublin Core and in which the contents of the repository change over time. The OAI-PMH service should itself be registered as a resource in the ALA metadata repository (along with any default metadata properties applying to all records). The ALA will then use this interface to harvest the underlying data records.

For more information on OAI-PMH, see:

2e. Plain URL

In cases in which none of the above protocols is appropriate, it is possible simply to share a resource by publishing it in a web-browser-compatible format (word processor document, PDF, HTML, JPEG, MPEG, Flash, Java applet, etc.) and providing sufficient metadata to ensure that a user can understand the nature of the resource and how to use it.

In such cases, the ALA is able to direct users to the resource and may be able to add value by indexing the content of the resource (for text resources) but it will typically not be possible to use the information in more complex analyses.

3. Metadata standards

The ALA is developing a metadata repository for capturing information about data resources of relevance to Australian biodiversity. This is a work-in-progress and the following guidelines will be extended and perhaps modified in coming months.

In general, metadata can solve several critical problems with managing data:

  1. Data access – providing the basic technical details on where data are stored and how to retrieve them
  2. Data discovery – enabling users to find data resources which match search criteria
  3. Data description – giving users the information needed to determine whether the resource contents are relevant and
    usable for their needs
  4. Data ownership – expressing information on the ownership and reuse of each data resource
  5. Data longevity – providing the information framework for managing long-term archival of data resources while ensuring that they remain discoverable and usable into the future

The goal of the ALA is to gather sufficient metadata from different sources to meet the following requirements:

Achieving this goal will depend on bringing together metadata gathered from different sources:

This section addresses only the first three of these sources, i.e. the metadata provided explicitly by the data provider.

Data providers should consider the range of expected users for their data and include appropriate metadata terms expected by those communities. For example, materials developed for use in education might consider the Metadata Guidelines provided by the Learning Federation and incorporate some of the suggested terms from the edna metadata standard. The ALA will seek to preserve and propagate such domain-specific metadata fields.

3a. Dublin Core

Dublin Core is a set of general-purpose metadata elements which can be used to describe any data resource. The ALA expects particularly to harvest Dublin Core metadata for resources exposed via OAI-PMH and for images and other multimedia shared through a tab-delimited/comma-separated file of metadata. Dublin Core properties may however also be used to provide additional metadata on a record-by-record basis for data records exposed through TAPIR or LSIDs.

See Images and other multimedia for specific recommendations on the use of Dublin Core properties for multimedia objects.

In general, the ALA recommends that the Dublin Core Subject property should contain the scientific name for the organism or taxon covered by a particular resource, or alternatively an LSID or other unique identifier for the relevant taxon in a recognised taxonomic resource (e.g. AFD/APC or Catalogue of Life).

3b. Ecological Metadata Language (EML)

EML is a metadata specification developed and promoted for use with ecological data. It includes a number of features which make it highly suitable as a standard metadata representation for the ALA. In particular (unlike Dublin Core), it includes elements explicitly intended to capture information on the taxonomic and geographic scope of a data set, and on any methods which went into the data capture.

EML can be represented as an XML document describing a data set. Typically the EML document therefore contains the metadata and an associated tab-delimited/comma-separated file contains the experimental data.

The ALA and GBIF are both exploring the use of EML as a preferred metadata representation. The ALA expects therefore to make specific recommendations in the near future. In the mean time, the ALA commits to handle metadata captured as EML documents.

Morpho is a free standalone tool for editing EML documents. GBIF is working on an EML editor integrated into its data portal.

Comments are closed.