The different challenges of integrating data from many sources

The different challenges of integrating data from many sources

  • By Lynne Sealie
  •  May 24, 2012
  •  Tags:  Blogs & news Data Mapping & analysis

 

By Juliette Bryan, Data Analyst at the ALA

One of the challenges the Atlas of Living Australia faces is the integration of biodiversity occurrence data in many different forms. Most of our data comes from museums, herbaria, other biological collections, State conservation agencies and BirdLife Australia. While these data are generally well structured, data from other sources may be inconsistent in format – and can be from both amateurs and professionals.

Some of the data we receive has been recorded over many years or decades, by a variety of individuals from a regional group or nature club, so will be of high value for certain research questions. However, even within a single club, members have often adopted their own method for recording the data and, while there are some standards, there is often great inconsistency in the format.

Data from government agencies or organisations with a background in biodiversity data often require very little restructuring before the data can be loaded in to the Atlas of Living Australia.
Before any data is loaded into the ALA it must be mapped to Darwin Core terms. The Darwin Core is a body of standards primarily based on taxa and their occurrence in nature as documented by observations, specimens, and samples, and related information. Darwin Core terms include the basis of record, location, event date, sampling protocol, recorded by, identified by, species common and scientific names and associated media.

When reading and cleaning the more difficult, unstructured data, we use a variety of tools. Open source tools such as MySQL, Talend and Pentaho are all useful in mapping the data to Darwin Core terms. The more structured data usually only require changing the column names to match their corresponding Darwin Core terms and then the data is ready to load.

Example 1: Raw data which will be mapped to Darwin Core terms

2011 (Spring) October 12th

Waterbird Sanctuary

Watsons block

Tips Billabong

Racecourse

Corridor/River

Purple Swamphen

3+2chicks

4+2chicks

5

Dusky Moorhen

3

4

Eurasian Coot

20

23

Below is the data mapped to Darwin Core terms. As we were given the locality in the original data we used the ALA Spatial Portal (http://spatial.ala.org.au/) to create the columns: coordinatePrecision, coordinateUncertaintyInMeters, decimalLatitude and decimalLongitude.

vernacularName locality individualCount occuranceRemarks eventDate samplingprotocol
Purple Swamphen Waterbird Sanctuary 3+2chicks

12/10/2011

Present:    Time: 8:30am – 11.30am Weather:  6 – 18 deg
Dusky Moorhen Watsons Block

3

12/10/2011

Present:    Time: 8:30am – 11.30am Weather:  6 – 18 deg
Eurasian Coot Watsons Block

20

12/10/2011

Present:    Time: 8:30am – 11.30am Weather:  6 – 18 deg
Purple Swamphen Tips Billabong 4+2chicks

12/10/2011

Present:    Time: 8:30am – 11.30am Weather:  6 – 18 deg
Dusky Moorhen Tips Billabong

4

12/10/2011

Present:    Time: 8:30am – 11.30am Weather:  6 – 18 deg
Eurasian Coot Tips Billabong

23

12/10/2011

Present:    Time: 8:30am – 11.30am Weather:  6 – 18 deg

 

coordinatePrecision coordinateUncertaintyInMeters georeferencedDate decimalLatitude decimalLongitude

0.00001

100

13/12/2011

-36.93344

149.87758

0.00001

100

13/12/2011

-36.93281

149.87255

0.00001

100

13/12/2011

-36.93281

149.87255

0.00001

100

13/12/2011

-36.93313

149.875

0.00001

100

13/12/2011

-36.93313

149.875

0.00001

100

13/12/2011

-36.93313

149.875

 

georeferenceProtocol georeferenceSources georeferencedBy
Location description looked up on map and digitised in ALA spatial portal Panboola Wetlands Map provided by surveyors and ALA spatial portal Miles Nicholls (ALA)
Location description looked up on map and digitised in ALA spatial portal Panboola Wetlands Map provided by surveyors and ALA spatial portal Miles Nicholls (ALA)
Location description looked up on map and digitised in ALA spatial portal Panboola Wetlands Map provided by surveyors and ALA spatial portal Miles Nicholls (ALA)
Location description looked up on map and digitised in ALA spatial portal Panboola Wetlands Map provided by surveyors and ALA spatial portal Miles Nicholls (ALA)
Location description looked up on map and digitised in ALA spatial portal Panboola Wetlands Map provided by surveyors and ALA spatial portal Miles Nicholls (ALA)
Location description looked up on map and digitised in ALA spatial portal Panboola Wetlands Map provided by surveyors and ALA spatial portal Miles Nicholls (ALA)

Below is the same data displayed in the ALA

 

 

Example 2: Data digitised from a series of publications

Scientific Name Year of collection Attribution Identified by Recorded by Synonymy station no Material Studied
Caulastrea echinulata (Edwards & Haime, 1849) 1976 Veron J. E. N., Pichon M., Maya Wijsman-Best,  1977, Schleractinia of Eastern Australia Part 2, Families Faviidae, Trachyphylliidae, Australian Institute of Marine Science (AIMS), Australian Government Publishing Service, Canberra J. Veron, M. Pichon, M. Wijsman-Best J. Veron Dasyphyllia echinulata Edwards & Haime, 1849; Edwards & Haime (1857); Ortmann (1888). Caulastrea echinulata (Edwards & Haime, 1849); Matthai (1928); Nemenzo (1959); Wijsman-Best (1972). Caulastrea aiharai Yabe & Sugiyama, 1935; Yabe, Sugiyama & Eguchi (1936). 9,36,90 Yonge Reef, Palm Islands (4 specimens). These localities include collecting stations 9, 36, 90.
Psammocora explanulata van der Horst, 1922 1975 Veron J. E. N., Pichon M., 1976, Schleractinia of Eastern Australia Part 1, Families Thamnasteriidae, Astrocoeniidae, Pocilloporidae, Australian Institute of Marine Science (AIMS), Australian Government Publishing Service, Canberra J. Veron, M. Pichon J. Veron Psammocora explanulata van der Horst, 1922; Wells (1954). 55 Palm Islands (2 specimens), collecting station 55.

Additionally, we were provided with a lookup table of 259 stations and their locations. As we were given the locality of the station number in the original data, we used the ALA Spatial Portal (http://spatial.ala.org.au/) to create the columns:  Uncertainty in Km, Latitude and Longitude.  A sample of the data is shown below.  The Location Remarks column is used to identify if a particular station is a dredging station or not.

Station No

Station location

Longitude

Latitude

Uncertainty in Km

Location Remarks

1 Great Detached Reef 144.028 -11.694 6
2 Tijou Reef 143.95 -13.166 2
3 Yonge Reef 145.657 -14.693 0.5
4 Bowl Reef 147.545 -18.512 1

The data was transposed by Station number in order to create 1 row per station. This data was then joined to the Station lookup table to pick up the locality and mapping coordinates. Below is the final data mapped to Darwin Core terms.

scientificName

year

recordedBy

identifiedBy

locality

locationId

Caulastrea echinulata (Edwards & Haime, 1849) 1976 J. Veron J. Veron, M. Pichon, M. Wijsman-Best Yonge Reef 9
Caulastrea echinulata (Edwards & Haime, 1849) 1976 J. Veron J. Veron, M. Pichon, M. Wijsman-Best Electra Head, Great Palm Island 36
Caulastrea echinulata (Edwards & Haime, 1849) 1976 J. Veron J. Veron, M. Pichon, M. Wijsman-Best Pelorus Island, Palm Islands, W 90
Psammocora explanulata van der Horst, 1922 1975 J. Veron J. Veron, M. Pichon Orpheus Island (Palm Islands), NW point 55

 

locationRemarks decimalLongitude decimalLatitude coordinateUncertaintyInMeters
145.623 -14.597 500
146.69 -18.73 500
146.488 -18.551 500
146.48 -18.567 500

 

occuranceRemarks previousIdentifications references
Yonge Reef, Palm Islands (4 specimens). These localities include collecting stations 9, 36, 90. Dasyphyllia echinulata Edwards & Haime, 1849; Edwards & Haime (1857); Ortmann (1888). Caulastrea echinulata (Edwards & Haime, 1849); Matthai (1928); Nemenzo (1959); Wijsman-Best (1972). Caulastrea aiharai Yabe & Sugiyama, 1935; Yabe, Sugiyama & Eguchi (1936). Veron J. E. N., Pichon M., Maya Wijsman-Best,  1977, Schleractinia of Eastern Australia Part 2, Families Faviidae, Trachyphylliidae, Australian Institute of Marine Science (AIMS), Australian Government Publishing Service, Canberra
Yonge Reef, Palm Islands (4 specimens). These localities include collecting stations 9, 36, 90. Dasyphyllia echinulata Edwards & Haime, 1849; Edwards & Haime (1857); Ortmann (1888). Caulastrea echinulata (Edwards & Haime, 1849); Matthai (1928); Nemenzo (1959); Wijsman-Best (1972). Caulastrea aiharai Yabe & Sugiyama, 1935; Yabe, Sugiyama & Eguchi (1936). Veron J. E. N., Pichon M., Maya Wijsman-Best,  1977, Schleractinia of Eastern Australia Part 2, Families Faviidae, Trachyphylliidae, Australian Institute of Marine Science (AIMS), Australian Government Publishing Service, Canberra
Yonge Reef, Palm Islands (4 specimens). These localities include collecting stations 9, 36, 90. Dasyphyllia echinulata Edwards & Haime, 1849; Edwards & Haime (1857); Ortmann (1888). Caulastrea echinulata (Edwards & Haime, 1849); Matthai (1928); Nemenzo (1959); Wijsman-Best (1972). Caulastrea aiharai Yabe & Sugiyama, 1935; Yabe, Sugiyama & Eguchi (1936). Veron J. E. N., Pichon M., Maya Wijsman-Best,  1977, Schleractinia of Eastern Australia Part 2, Families Faviidae, Trachyphylliidae, Australian Institute of Marine Science (AIMS), Australian Government Publishing Service, Canberra
Palm Islands (2 specimens), collecting station 55. Psammocora explanulata van der Horst, 1922; Wells (1954). Veron J. E. N., Pichon M., 1976, Schleractinia of Eastern Australia Part 1, Families Thamnasteriidae, Astrocoeniidae, Pocilloporidae, Australian Institute of Marine Science (AIMS), Australian Government Publishing Service, Canberra

Below is the same data displayed in the ALA