Sharing data through the Atlas

Sharing data through the Atlas

  • By Miles Nicholls
  •  October 26, 2009
  •  Tags:  Blogs & news Data

This is a DRAFT – please send us your comments and suggestions.

Background

Different researchers and institutions capture and store data in the forms and combinations which best meet their needs. However, in order to make these data more widely accessible and to ensure that they can be reused for different purposes, data providers need to consider the appropriate way to need to expose their data, using structures and terms which can be recognised by others. Once these decisions have been made, they need to ensure that their data are transformed (or “mapped”) into these standard forms, and that the data set is associated with a good description and additional information on the source and ownership of the data, again in a form which can be understood by other users.

Sharing data therefore depends on the consistent use of agreed standards. Many different standards exist for different purposes. This document identifies the standards which are being adopted by the Atlas of Living Australia and seeks to explain how they can be used to improve the reuse of Australian biodiversity data.

TDWG produces an annual Technical Roadmap which provides recommendations for best practices in data sharing between biodiversity informatics projects. This document seeks to reflect these recommendations. The Technical Roadmap is available – PDF (106KB).

The ALA and data standards

The ALA aims to bring together data from many sources. To do this it will support a range of different data standards, many of which are already in use by various communities around Australia. The ALA aims to shield users from this complexity by providing services which handle these source data formats and offer consistent views of merged data.

The ALA will therefore respond to developing needs and requirements by continuing to identify and support relevant data standards. This page provides a snapshot of the range of standards which are already recognised as important. Additional information will be added as it becomes available.

The information on this page has been divided into three major topics, each with a number of sub-topics:

  1. Data standards – specific information on standards supported for sharing different types of data.
    1. Species occurrence data (specimens and observations)
    2. Names, classifications and checklist data
    3. Structured descriptions and keys
    4. Species fact sheets
    5. Images and other multimedia
  2. Protocols – general information on sharing data in ways that the ALA and end users can access.
    1. TAPIR (and DiGIR and BioCASe)
    2. Tab-delimited and comma-separated data
    3. LSIDs and RDF data
    4. OAI Protocol for Metadata Harvesting (OAI-PMH)
    5. Plain URL
  3. Metadata standards – general information on describing data resources and registering them with the ALA.
    1. Dublin Core
    2. Ecological Metadata Language (EML)

The term, “metadata”, refers to the information a data provider supplies to describe a data set and to help users to access it. This may include descriptions of the contents of the data set, how the data were collected, information about the collectors and owners of the data and the uses they approve for the data, and technical details required to access or interpret the data.

As a quick shorthand – when considering the task of sharing data – the specific data standards address the question of what information can be shared, the protocols address the question of how the information is to be shared and accessed, and the metadata standards address the question of why users should be interested in accessing the information. (This is an oversimplification, but may help to explain the significance of the following sections.)

The section on protocols includes some information on software packages which implement these standards.

Please consider the question of copyright and usage restrictions when making data available. The ALA has developed some draft guidelines on the subject and recommends the adoption of Creative Commons licences in most situations.

Since the ALA deals with biodiversity data, the expectation is that all data items will include a label or data element identifying the species or other taxon covered by the resource. Where feasible data providers are encouraged to use scientific names found in the AFD or the APC or the Catalogue of Life (particularly for any taxa not found in AFD/APC). However it is recognised that some data sets (particularly historical collections and literature) use names which may not be currently accepted species names. In such situations data providers are advised to use the scientific names found in the original sources. The ALA will develop or use Taxonomic Services to attempt to map such data appropriately to current species concepts.

Over the coming months, this document will be enhanced with examples for each of the standards identified.

For more information on any of these recommendations, please contact us.

1. Data standards

The ALA aims to enable access to any type of biodiversity information. The following subsections provide guidelines for some major classes of data. These represent the classes of data on which the ALA particularly expects to focus in the earliest stages of its development. As the project proceeds, other classes will be handled in more detail.

Many of the data standards below include a unique identifier for each record. The ALA strongly recommends that all data providers supply such identifiers. Some data objects will already have an appropriate globally unique identifier. Publications may be associated with a DOI. Species characteristics and other terms from controlled vocabularies may be associated with stable HTTP URIs as identifiers. In other cases, the ALA follows the guidelines from TDWG and recommends the use of Life Science Identifiers (LSIDs) for this purpose. Constructing an LSID to serve as a record identifier gives a good guarantee of uniqueness and also allows the data provider to use LSID resolution as a means for users to be able directly to access the associated data.

1a. Species occurrence data (specimens and observations)

Quick recommendations

Detailed recommendations

Two families of data standards are in wide use around the world for exchanging data records for individual specimens in natural history collections and herbaria (including living collections such as culture collections and seed banks) and also for observations of living organisms in the field:

  • Darwin Core (http://www.tdwg.org/activities/darwincore/) is a simple set of descriptive properties which can be used to describe the main data elements in any species occurrence database. Darwin Core itself is very simple and can easily be extended with additional properties, either from a set of standard Darwin Core Extensions or from any other data standard appropriate for describing an organism occurrence. It is particularly widely used for describing zoological specimens and in the oceanographic community. Darwin Core data can be shared with the ALA, OZCAM, GBIF, OBIS and many other networks.
  • ABCD (Access to Biological Collections Data, http://www.tdwg.org/activities/abcd/) is a complex structure for describing specimens and observations in great detail. It supports all of the information included in the Darwin Core and can also support more complex requirements (e.g. including a history of successive identifications for a single specimen). ABCD is widely used in botanical collections and in seed banks. ABCD data can be shared with the ALA, GBIF, Bioversity International and other networks. The HISPID standard developed by the Australia’s Virtual Herbarium (see http://hiscom.chah.org.au/wiki/index.php/HISPID_Mapping_to_ABCD) is an extension to ABCD 2.06 and HISPID is the preferred form for data sharing within the AVH.

Each of these standards has been used in several different versions. The ALA will support the same range of versions supported by GBIF (see http://www.gbif.org/participation/participant-nodes/resources/how-to-establish-a-participant-node/):

  • Darwin Core 1.2
  • Darwin Core 1.4 (sometimes known as “Darwin Core 2”)
  • The MaNIS versions of Darwin Core
  • The OBIS Schema (an extension of Darwin Core 1.2)
  • ABCD 1.20
  • ABCD 1.48
  • ABCD 2.05
  • ABCD 2.06
  • HISPID 5

Both Darwin Core and ABCD were developed through the work of TDWG, which has recently been reworking all of its data standards to make them more compatible and easier to use in a wide range of different situations. As part of this activity, TDWG has developed a TaxonOccurrence vocabulary which will replace both Darwin Core and ABCD. As this standard is more widely adopted, the ALA will support and promote it as a preferred data model.

It should be noted that Darwin Core, ABCD and the TaxonOccurrence vocabulary are all designed to handle the exchange of “presence” data, i.e. data sets which comprise records of the actual occurrence of different species. These data standards are inadequate to represent the full complexity of many ecological data sets. In particular they lack any standard mechanism for expressing level and standardisation of recording effort or for recording the absence of a species of interest. The TDWG Observation and Specimen Records Interest Group is exploring ways to represent this further level of information, and the ALA will participate in this process. In the mean time, the ALA encourages data providers to use one of these existing standards to share their data and to provide a text description of any underlying data collection methods in the metadata for the data set.

Some data sets (particularly ecological data sets) will include data elements for which there is no corresponding property in Darwin Core or in the TaxonOccurrence vocabulary. In such cases data providers are recommended to use appropriate terms from other vocabularies to label these additional elements (see for example the range of vocabularies and ontologies listed by Marine Metadata Interoperability or SPIRE). If these elements are of general importance to biodiversity projects, it is also recommended to initiate discussion with TDWG to make standard recommendations on their representation.

1b. Names, classifications and checklist data

Quick recommendations

  • To establish a public web service (i.e. a searchable web interface), use the TDWG TaxonName and/or TaxonConcept vocabularies and make the data accessible via LSIDs, and optionally a TAPIR search interface
  • To share simple checklist data as a freestanding document, use a tab-delimited or comma-separated format with terms from the TDWG TaxonName and/or TaxonConcept vocabularies as column headings

Detailed recommendations

Taxonomic and nomenclatural data is frequently much more complex than species occurrence data, with many nested relationships between data items. Different data sets contain different combinations of nomenclatural data (information on the origin and validity of scientific names), taxonomic data (judgments on which names correspond to real species, how these species are to be classified and how they relate to earlier names and species concepts), bibliographic data (details of nomenclatural and taxonomic literature) and often a range of legislative, distributional and other information.

In addition, the hierarchical nature of classifications and checklists is handled in different ways in different databases. Some databases have a single table to accommodate taxa of all ranks, and rely on parent-child relationships and explicit taxon rank elements. Others flatten the hierarchy and assign different ranks to different database columns. There is similar variation in the way that different databases model the relationships between the names each database accepts for a species and the various synonyms identified in the database for the same species.

All of this makes it hard to provide generic recommendations for sharing data of this kind. To simplify the problem slightly, we can distinguish data sets according to their primary purpose:

  • Nomenclatural databases – resources dedicated to cataloguing the published names for a group of organisms (without addressing whether the name is regarded as the current name for a valid species). Examples include: APNI, IPNI, Index Fungorum and ZooBank. Such databases serve as foundations for taxonomic databases.
  • Taxonomic databases – resources representing the views of an individual taxonomist or a community on the list of valid species (or taxa of other ranks) within a particular taxonomic group, the relationship between these species and the various names which have been applied to the group, and the classification of these species into a taxonomic hierarchy. Such data sets may provide a global view, as with the databases making up the Catalogue of Life, or a regional view, as with the APC and AFD.
  • Checklists – resources listing the species of relevance to a particular topic. Examples include checklists of organisms found in a region or protected area, lists of organisms protected under some legislation, etc. Such lists are usually based on one or more reference taxonomies but do not represent an authoritative attempt to define valid species. Their purpose is to inform users of some characteristics of the set of species included in the list (e.g. that they can all be found in the same region). This approach can also be used to communicate the set of species included within a web site.

There are several relevant data standards which have been used to share such data. The ALA will seek to support data shared using any of the following models:

  • The Catalogue of Life has integrated data sets using the SPICE Common Data Model (CDM). This is a very simple model particularly suited to sharing data using tab-delimited or comma-separated data formats. Each record can contain a scientific name and a set of major taxonomic ranks associated with a species (kingdom, phylum/division, class, order, family, genus, species, infraspecific rank) and then makes an assertion as to whether it is an accepted name for a species or should be regarded as a synonym, in which case the record also contains the accepted species name and an associated set of major ranks. It is an efficient standard for sharing the key data elements used by the Catalogue of Life, and is associated with the SPICE Protocol, which supports simple tree-walking operations through the checklist.
  • TDWG has developed the Taxon Concept Schema (TCS) as a richer model for representing information on taxon names, taxon concepts (i.e. judgments on valid species), taxonomic publications, type specimens and relationships asserted between different taxonomies. This standard provides a complete model for representing and exchanging all of this information, either as static documents or in response to database queries. It is however being superseded by the TDWG TaxonName and TaxonConcept vocabularies.
  • As part of the reworking of all of its data standards, TDWG has reworked elements from TCS to supply two separate vocabularies for TaxonName and TaxonConcept records. These vocabularies are intended to be more flexible than TCS and to be used in a wider range of situations. They are already being used by IPNI and Index Fungorum and are being adopted for the APC and AFD web services currently under development.

Nomenclatural and taxonomic databases can be exposed as a TCS document (accessed via a Plain URL) or as a set of smaller TCS documents each representing a single taxon name or taxon concept (and accessed via a TAPIR search interface). It is however preferable in such cases to use the TaxonName and TaxonConcept vocabularies and to use LSIDs (and perhaps also OAI-PMH) in addition to, or instead of TAPIR.

For other data sets, it will often be more appropriate simply to share the entire data set as a single document in a tab-delimited or comma-separated format, with terms from the TaxonName and/or TaxonConcept vocabularies or the SPICE Common Data Model as column headings and explanation of the purpose of the list included as part of the metadata for the data set. This approach may be applicable for listing protected species (red lists, CITES, etc.), species exhibiting some characteristic (e.g. endemic, invasive, venomous or leaf-mining taxa) or checklists for protected areas or localities.

1c. Structured descriptions and keys

Quick recommendations

  • Wherever possible share data on the characters of individual taxa or specimens using TDWG SDD or Delta (accessed via a Plain URL).
  • For other materials (e.g. dichotomous keys stored as word processor documents, taxonomic treatments as PDFs, online dynamic identification tools) simply share the resource via a Plain URL.

Detailed recommendations

Resources intended to assist users with identifying a set of organisms can be divided into two main categories:

  • Structured data sets which express the relationships between a set of taxa or specimens and a set of descriptive characters. Such data sets allow software to reuse the information in many different contexts. The Delta language is a rich model for encoding taxonomic descriptions and is supported by an extensive community and suite of tools. A number of other software applications, including Lucid, provide environments for editing descriptive data and publishing such data as dynamic keys (either installed on a user’s computer or accessed via a web browser). The TDWG Structured Descriptive Data (SDD) standard is an XML data model for storing and exchanging structured descriptions. It is supported by Lucid as an output format.
  • Unstructured resources which are not exposed using a machine-readable standard like Delta or SDD. Examples include word processor documents and PDFs of taxonomic treatments or dichotomous keys, or online dynamic keys accessible only through key playing software.

The ALA is working with the Taxonomic Resource Information Network (TRIN), IdentifyLife, the Encyclopedia of Life (EOL), Key To Nature and others to explore ways to manage access to descriptive data and identification tools. For the present, the recommendation is to share data using Delta or SDD wherever possible, and to register other resources as URLs with associated metadata.

1d. Species fact sheets

Quick recommendations

  • To share information from species fact sheets where the information is stored in a structured form in a database, consider sharing the data using TDWG SPM and TAPIR.
  • Alternatively share a set of links to fact sheets as tab-delimited or comma-separated data with at least two columns. One of these should hold the name of the species (or other taxon) covered by each fact sheet and should have the SPM aboutTaxon property as its heading. The other should hold the URL for the fact sheet and should have the SPM hasInformation property as its heading. If these URLs relate to XHTML pages, it may also be appropriate to embed tags in these pages to mark sections which correspond to specific SPM properties. (These recommendations are provisional and may change.)
  • To share a single fact sheet, use a Plain URL. If the fact sheet is an XHTML page, it may also be appropriate to embed tags in this page to mark sections which correspond to specific SPM properties.

Detailed recommendations

TDWG (in conjunction with GBIF, EOL and a range of other projects) has been developing the Species Profile Model as a vocabulary for use in advertising and exchanging blocks of information about different species.

This vocabulary is still under development but is intended to address a number of needs. For the ALA, particular benefits of SPM are likely to include:

  • Serving as a primary categorisation for organising biodiversity data resources (Biology, Description, Behaviour, Conservation, Legislation, etc.)
  • Allowing users to select relevant elements from a structured database of species information (e.g. selecting just descriptive data)
  • Providing an XML envelope for transporting a range of different content relating to a given species – a Description element could contain SDD data, a Distribution element could contain RDF TaxonOccurrence data, and other elements could contain plain text
  • Serving as a set of XHTML tags for marking sections within a species page.

At present all of these uses are still being explored. The ALA would be interested to hear from any projects or researchers interested in trial use of SPM.

1e. Images and other multimedia

Quick recommendations

  • To share information on a collection of multimedia objects (particularly a collection that is being actively maintained) and to share links with these objects, consider using OAI-PMH.
  • For collections of multimedia objects where OAI-PMH is not appropriate or feasible, share a set of links to objects as tab-delimited or comma-separated data with at least two columns. One of these should hold the name of the species represented by each object (with the Dublin Core Subject property as the heading) and the other should hold the URL for the object (Dublin Core Identifier). Additional columns may be used to hold a description of the object (Dublin Core Description), the type of multimedia object (Dublin Core Type), the format (Dublin Core Format), the copyright holder (Dublin Core Publisher) and the photographer/recorder (Dublin Core Creator). (These recommendations are provisional and may change.)
  • To share a single multimedia object, use a Plain URL.

Detailed recommendations

Images and other multimedia resources are typically shared across the Internet in three different ways:

  • As multimedia binaries in various formats (JPEG, GIF, PNG, MPEG, AVI, etc.).
  • As HTML web pages displaying a multimedia object with associated information.
  • As links to image viewer tools which provide additional functions (e.g. pan-and-zoom) and may restrict access to the original binary objects.

This variation is a complication in the task of managing access to image resources. Any of these three approaches may meet a user’s needs to view images and other multimedia objects for a particular organism, but only the first approach easily supports display of alternate views (e.g. thumbnails). The ALA will allow data providers to specify in their metadata which of these three approaches relates to each data resource, either on a record-by-record basis or for an entire data set (see below).

The preferred approach to sharing metadata about images (including the actual links to the images) will be OAI-PMH using properties from the standard OAI Dublin Core mapping (oai_dc). Where this is not feasible, a set of metadata fields for every image in a collection can be shared as a tab-delimited or comma-separated file. These metadata fields may include:

  • Dublin Core Subject (mandatory) – the species (or most precise taxon) represented by the object.
  • Dublin Core Identifier (mandatory) – the URL for accessing the resource.
  • Dublin Core Title (recommended) – the text to be displayed as a label for the image – this should normally include both scientific and vernacular names where available, followed by collection and specimen identifiers where applicable. If Title is omitted, the Subject will be used in its place.
  • Dublin Core Description (recommended) – the full description of the image.
  • Dublin Core Type (recommended) – the type of multimedia object – StillImage is assumed by default. The following terms from the DCMI Type Vocabulary are appropriate:
  • Dublin Core Format (recommended) – the MIME type for the multimedia object.
  • Dublin Core Creator (recommended) – the photographer or recorder responsible for the multimedia object.
  • Dublin Core Publisher (recommended) – the institution or other party holding copyright for the multimedia object.
  • Dublin Core Rights (recommended) – information about property rights or reuse of the multimedia object.

Subject and Identifier must be supplied for every record, but the other Dublin Core properties may be supplied on a record-by-record basis or as a set of default values within the metadata for the entire data set.

The ALA will be able to use OAI-PMH or the tab-delimited/comma-separated data set to harvest basic metadata about the multimedia objects in a collection.

This same set of Dublin Core properties can also be used for registering a single multimedia object as an online resource.

2. Protocols

The ALA will use a range of standard approaches to link to data resources. Each has particular benefits.

In addition to those protocols addressed in the following subsections, the ALA expects to support the Open Geospatial Consortium (OGC) Web Feature Service (WFS) for exchange of species occurrence data (using properties from the Darwin Core or the TaxonOccurrence vocabulary). Specific recommendations will be added at a later date on the use of WFS and other OGC standards.

2a. TAPIR (and DiGIR and BioCASe)

TAPIR is the current version of the remote query protocol developed and promoted by TDWG. It is effectively an enhanced replacement for the two earlier TDWG data access protocols, DiGIR and BioCASe. The ALA will interface with existing DiGIR or BioCASe resources but recommends the use of TAPIR for future implementations.

TAPIR is a general-purpose language for querying remote databases and for retrieving data in formats suited to the needs of the requestor. It relies on the existence of agreed community data models. Each such model provides a set of agreed properties which can be mapped against a range of distributed databases, even though their underlying data structures may be very different.

For example, a collection database, an ecological data set and a repository of biodiversity image metadata may all contain fields which represent the scientific name of an organism and the spatial coordinates and date associated with a particular occurrence of that organism. TAPIR allows all of these databases to advertise the fact that they have records containing fields which correspond to the ScientificName, EarliestDateCollected, Latitude and Longitude concepts from a model such as the Darwin Core. A client application can then send standard requests (as XML documents or a set of parameters encoded in a URL) to each of these resources and receive the results back as an XML response document. The format of the response document and the data elements to be inserted into this format can be defined by the client requesting the data. This makes it possible, for example, for the client to request the results as a KML file which can be displayed directly in Google Earth or Google Maps.

TAPIR supports five types of request:

  • Metadata – get description and ownership details for a data set
  • Capabilities – get information on the technical aspects of the TAPIR instance and the concepts (properties) shared
  • Inventory – list distinct values (with record counts) for a supported concept (e.g. list all scientific names)
  • Search – search for records matching specified search criteria
  • Ping – verify that TAPIR instance is active

Several implementations of the TAPIR protocol are available. The TAPIR server implementations all include configuration interfaces to assist data providers in mapping their relational databases to standards such as Darwin Core, ABCD and the TDWG TaxonOccurrence vocabulary. Available software includes:

The ALA is working with GBIF and others to simplify the installation and use of these tools.

For more information on TAPIR, see:

2b. Tab-delimited and comma-separated data

TAPIR is a powerful tool for supporting remote access to a database but some data providers may not need the flexibility it offers, or may be unable to run the TAPIR software on a web server. In such circumstances it may be simpler to share a flat file representation of the data. The simplest approach, supported by database and spreadsheet software, is to export the content as tab-delimited or comma-separated data and then to register a URL from which the data file can be downloaded. This makes it possible to associate the data with any relevant descriptive metadata. Note that data should always be shared using UTF-8 as the character encoding.

One possible weakness in such an approach is that it may be difficult for users to interpret the significance of the various columns in such a data set. The ALA makes the following recommendations to address this problem:

  • Where columns correspond to Dublin Core properties, the columns should be labeled with the appropriate Dublin Core element identifier.
  • Where columns correspond to properties from the TDWG vocabularies, the columns should be labeled with the appropriate TDWG TaxonOccurrence, TaxonConcept or TaxonName property identifiers.
    Where applicable, and no ambiguity arises, terms from multiple vocabularies may be combined (e.g a TaxonOccurrence record may include
    properties from the TaxonConcept vocabulary), but note comments below on more complex data sets.
  • Where columns do not correspond to any property in the TDWG vocabularies, the columns should be labeled with a URI identifying a suitable term from another ontology or vocabulary. If this is not possible a plain text column heading may be used.

The property identifiers should be constructed using either the short or long form specified in this table. The long forms are formal URIs and are preferred since it should allow other clients to interpret the data correctly. The short forms are suggested standard abbreviations likely only to be interpreted correctly by the ALA and similar biodiversity informatics projects.

Vocabulary Long form Short form
Dublin Core Element URI (e.g. “http://purl.org/dc/elements/1.1/description“) Prefix property name with “dc:” (e.g. “dc:description”)
TaxonOccurrence Prefix property name with “http://rs.tdwg.org/ontology/voc/TaxonOccurrence#” (e.g. “http://rs.tdwg.org/ontology/voc/TaxonOccurrence#catalogueNumber”) Prefix property name with “to:” (e.g. “to:catalogueNumber”)
TaxonConcept Prefix property name with “http://rs.tdwg.org/ontology/voc/TaxonConcept#” (e.g. “http://rs.tdwg.org/ontology/voc/TaxonConcept#nameString”) Prefix property name with “tc:” (e.g. “tc:nameString”)
TaxonName Prefix property name with “http://rs.tdwg.org/ontology/voc/Taxonname#” (e.g. “http://rs.tdwg.org/ontology/voc/TaxonName#year”) Prefix property name with “tn:” (e.g. “tn:year”)
Other vocabulary Property URI N/A
Unmatched term Plain text column name N/A

This approach will allow relatively simple (“flat”) database structures to be shared. It will however not accommodate all needs. Two particular issues need to be considered:

  • Many-to-one joins – This issue does not relate to all database joins. The important question is which table in the database is the “root” table for the records to be exported. As an example, consider sharing TaxonOccurrence records from a database with just two separate tables, one for Specimens and one for Taxa. There will be a many-to-one relationship between these tables but the table to be considered as the “root” for TaxonOccurrence records is the Specimens table. Each record in the Specimens table is joined to a single record in the Taxa table and the whole database could be “flattened” with a single record for each TaxonOccurrence (whereas it could not be flattened with a single record for each TaxonConcept if those records are to include all the specimen data). There are situations in which the proper mapping of a database should include multiple nested subrecords (e.g. a history of consecutive identifications for a specimen, or a series of synonyms for an accepted species name). In such cases, it is not possible to share the data using a single tab-delimited/comma-separated file.
  • Ambiguity – In some cases, there may be several ways that different data objects may be connected. As an example a TaxonOccurrence record could be related to a named person as the person who collected a specimen or as the person who identified it. A database might contain detailed information for each of these roles (name, email address, title, etc.) and it might be appropriate to use another vocabulary (e.g. vCard) to share this detailed information. However a column labeled “vcard:email” would be ambiguous. Is this the email address for the collector or the identifier?

It would be possible for the ALA to develop more complex recommendations on ways to use tab-delimited/comma-separated formats to overcome these issues. At present the recommendation is to avoid the use of these formats for these more complex situations. In one special case, sharing of synonymised checklist data, the ALA recommends the use of the SPICE Common Data Model properties to supply column headings. This model includes separate sets of properties for the accepted name and for the synonym and therefore avoids the general problems described above.

2c. LSIDs and RDF data

Life Science Identifiers (LSIDs) are globally unique identifiers which can be used both uniquely to identify a particular data object and also to assist users with retrieving associated metadata in a standard format.

An LSID is a string with the following format: urn:lsid:<Authority>:<Namespace>:<ObjectID>[:<Version>]

Where:

  • Authority is an identifier for the organisation issuing the identifier (usually the root DNS name for the organisation)
  • Namespace is selected by the organisation as a name for a class of identified objects
  • ObjectID is a unique identifier for an object in that class
  • Version is an optional additional string to provide versioning information

Example LSIDs include:

TDWG has adopted LSIDs as its recommended standard for assigning globally unique identifiers to data records and has developed a range of software components to assist data providers and users in exchanging data via LSIDs. This software is listed on the TDWG wiki and includes:

  • LaunchPad for Internet Explorer – plug-in allowing Internet Explorer to handle LSIDs natively
  • LaunchPad for Mozilla Firefox – plug-in allowing Firefox to handle LSIDs natively
  • LSID Server Conformance Test Tool – simple check of protocol conformance for any LSID
  • Lean PHP Resolver – simple PHP server-side LSID framework
  • Perl LSID API – server-side and client-side LSID implementation
  • J2EE LSID API – server-side and client-side LSID implementation
  • MS .NET LSID API – server-side and client-side LSID implementation
  • LSID Web Resolver (http://lsid.tdwg.org/) – web page for resolving any LSID

Note that web browsers cannot directly resolve an LSID. When presenting an LSID for display to humans, it is recommended that the identifier should be prefixed with the address for the LSID Web Resolver web site, e.g. instead of showing urn:lsid:ipni.org:names:302735-2, show http://lsid.tdwg.org/urn:lsid:ipni.org:names:302735-2.

Responses to requests for LSID metadata are expected to return this information encoded as RDF. The TDWG TaxonOccurrence, TaxonConcept and TaxonName vocabularies were in part developed to make it possible to use them in RDF documents. See the TDWG wiki page on LSID Vocabularies for more details.

For more information on LSIDs, see:

2d. OAI Protocol for Metadata Harvesting (OAI-PMH)

The Open Archives Initiative (OAI) has developed a simple protocol for harvesting of metadata from web repsitories. OAI-PMH is a set of six request types which allow clients to browse the contents of such a repository and, at a later date, to be able to discover more recent changes. These requests are:

  • GetRecord – retrieve an individual metadata record selected by its identifier
  • Identify – get information about the repository
  • ListIdentifiers – get header information for metadata records optionally filtered by a timestamp and/or a Set identifier (see ListSets)
  • ListMetadataFormats – get information on the response formats supported by the repository
  • ListRecords – get full information for metadata records optionally filtered by a timestamp and/or a Set identifier
  • ListSets – get information on the Sets which may be used to organise the contents of a repository into subcategories

Essentially OAI-PMH can be seen as a more sophisticated and flexible version of RSS. A data provider can advertise a set of resources via their metadata, and can optionally use Sets to subdivide these resources into categories of importance to users. Users can browse all metadata documents or the documents for a single Set and can limit requests to those metadata documents added, modified and even deleted since a specified date.

OAI-PMH allows repositories to offer a range of metadata formats but mandates that the OAI Dublin Core (oai_dc) format must be supported.

The ALA recommends the use of OAI-PMH particularly for repositories of images, documents, etc. It is highly suited to situations in which data records can be represented with plain Dublin Core and in which the contents of the repository change over time. The OAI-PMH service should itself be registered as a resource in the ALA metadata repository (along with any default metadata properties applying to all records). The ALA will then use this interface to harvest the underlying data records.

For more information on OAI-PMH, see:

2e. Plain URL

In cases in which none of the above protocols is appropriate, it is possible simply to share a resource by publishing it in a web-browser-compatible format (word processor document, PDF, HTML, JPEG, MPEG, Flash, Java applet, etc.) and providing sufficient metadata to ensure that a user can understand the nature of the resource and how to use it.

In such cases, the ALA is able to direct users to the resource and may be able to add value by indexing the content of the resource (for text resources) but it will typically not be possible to use the information in more complex analyses.

3. Metadata standards

The ALA is developing a metadata repository for capturing information about data resources of relevance to Australian biodiversity. This is a work-in-progress and the following guidelines will be extended and perhaps modified in coming months.

In general, metadata can solve several critical problems with managing data:

  1. Data access – providing the basic technical details on where data are stored and how to retrieve them
  2. Data discovery – enabling users to find data resources which match search criteria
  3. Data description – giving users the information needed to determine whether the resource contents are relevant and
    usable for their needs
  4. Data ownership – expressing information on the ownership and reuse of each data resource
  5. Data longevity – providing the information framework for managing long-term archival of data resources while ensuring that they remain discoverable and usable into the future

The goal of the ALA is to gather sufficient metadata from different sources to meet the following requirements:

  • Associate each data resource with the species to which it relates
  • (Where relevant) associate each data resource with gazetteer terms and geographical areas to which it relates
  • Associate each data resource with a range of other terms from relevant controlled vocabularies and ontologies
  • Display appropriate information to users on the ownership for each data resource and any usage restrictions
  • Assist users in connecting directly to each data resource using appropriate software
  • Make connections on behalf of users between related data resources

Achieving this goal will depend on bringing together metadata gathered from different sources:

  • Metadata supplied by a user when registering a resource
  • Metadata harvested for individual objects exposed through OAI-PMH
  • Dublin Core fields included as part of TAPIR or LSID responses or tab-delimited/comma-separated data sets
  • Summary data gathered by the ALA through inspecting the data shared using TAPIR or tab-delimited/comma-separated data sets (e.g. the ALA will take an inventory of ScientificName values to understand which taxa are included in each such data set)
  • (In the longer term) summary data gathered by the ALA through inspection of the contents of text resources (e.g. an inventory of the scientific names found in a text document).
  • User-provided annotations on data resources or individual records (see Data Annotation Services)

This section addresses only the first three of these sources, i.e. the metadata provided explicitly by the data provider.

Data providers should consider the range of expected users for their data and include appropriate metadata terms expected by those communities. For example, materials developed for use in education might consider the Metadata Guidelines provided by the Learning Federation and incorporate some of the suggested terms from the edna metadata standard. The ALA will seek to preserve and propagate such domain-specific metadata fields.

3a. Dublin Core

Dublin Core is a set of general-purpose metadata elements which can be used to describe any data resource. The ALA expects particularly to harvest Dublin Core metadata for resources exposed via OAI-PMH and for images and other multimedia shared through a tab-delimited/comma-separated file of metadata. Dublin Core properties may however also be used to provide additional metadata on a record-by-record basis for data records exposed through TAPIR or LSIDs.

See Images and other multimedia for specific recommendations on the use of Dublin Core properties for multimedia objects.

In general, the ALA recommends that the Dublin Core Subject property should contain the scientific name for the organism or taxon covered by a particular resource, or alternatively an LSID or other unique identifier for the relevant taxon in a recognised taxonomic resource (e.g. AFD/APC or Catalogue of Life).

3b. Ecological Metadata Language (EML)

EML is a metadata specification developed and promoted for use with ecological data. It includes a number of features which make it highly suitable as a standard metadata representation for the ALA. In particular (unlike Dublin Core), it includes elements explicitly intended to capture information on the taxonomic and geographic scope of a data set, and on any methods which went into the data capture.

EML can be represented as an XML document describing a data set. Typically the EML document therefore contains the metadata and an associated tab-delimited/comma-separated file contains the experimental data.

The ALA and GBIF are both exploring the use of EML as a preferred metadata representation. The ALA expects therefore to make specific recommendations in the near future. In the mean time, the ALA commits to handle metadata captured as EML documents.

Morpho is a free standalone tool for editing EML documents. GBIF is working on an EML editor integrated into its data portal.