One of the COMPADRE/COMADRE core committee members (Owen Jones) recently attended the “csv,conf” in Berlin. This one-day conference was a fringe event of the bigger Open Knowledge Festival and was about data – it was for those who collect or aggregate it, those who make it available online and those who analyse and visualise it.
There were a heap of interesting talks – for example, Felienne Hermanns spoke on why we should treat spreadsheets like Excel (which we use as our COMPADRE data entry platform) as a kind code, and employ the techniques of good coding practice to them (e.g. by building in error checking and validation at each step). Another was Karthik Ram‘s talk about a new package for R called testdat, which will be a useful tool to validate our COMPADRE/COMADRE metadata. For example, it can help identify outliers that could represent date entry errors, and things like non-numeric entries in numeric columns etc.
I gave a short talk about the COMPADRE and COMADRE population matrix databases. I covered some of the history of the databases and highlighted why these kinds of data are so important. I also highlighted some of the issues we have had to deal with along the way – one of these is how to handle data entry and error checking/validation on what are fast becoming large and unwieldy spreadsheets. We need to balance the need for a cheap, easy-to-use tool, with the need to have a robust error-free output.
Excel is great because it is already familiar to the COMPADRINOs* and has an easy learning curve. On the other hand, the fact that it is not always “what-you-see-is-what-you-get” means that errors can creep in unnoticed. For example, a number can be registered as a text string so that 0.00 is recorded as being different than 0, or sometimes as a date — very frustrating!
Fortunately, we do not distribute COMPADRE/COMADRE data in its “raw” Excel form – we save them out as CSV files and then combine them into an structured RData list object. While doing this, Rob Salguero-Gomez and I, the supervisors of COMPADRE and COMADRE, have developed routines to carry out a range error checks and validations for all the metadata and matrices allowing us to identify and correct any errors and inconsistencies before data distribution**.
Here’s the abstract for the talk —
Evolutionary biologists aim to make sense of population behaviour in species across the tree of life. However, the collection of animal and plant population data is laborious and costly so analyses that try to generalise across many species are not feasible unless data are shared among researchers, or obtained from the literature. I will report on the 30+ year journey of construction of two databases that collate demographic data from published literature on more than 2000 species with an aim of making it openly available to all. I will briefly outline why these data are important, describe the process of data production, and contemplate the lessons learned along the way.
Unfortunately the talk wasn’t recorded, but you can find the slides for it here at Figshare.
*The COMPADRINOS are the wonderful team of students based at the MPIDR in Rostock that do the data acquisition and data entry work for the databases.
**No doubt some errors will still creep in – please let us know if you spot any (compadre-contact AT demog DOT mpg DOT de)