Embracing sampling uncertainty in analyses with COM(P)ADRE

by Patrick Barks (University of Southern Denmark, email: barks@biology.sdu.dk)

The COM(P)ADRE Plant and Animal Matrix Databases together contain thousands of population projection matrices from hundreds of individual studies. The availability of these matrices to researchers has led to fascinating comparative analyses in the fields of ecology, evolution, and demography, at taxonomic, spatial, and temporal scales that would not otherwise be possible (see here for a list of relevant publications).

One of the challenges inherent in such analyses is that it’s often difficult to obtain information regarding the degree of sampling uncertainty associated with the values that populate projection matrices (i.e. stage- or age-specific transition rates based on survival, growth, and reproduction). These transition rates are almost always estimates of population parameters based on samples (population in the statistical sense), and therefore have associated sampling uncertainty, as do any parameters derived from them (e.g. population growth rate, damping ratio, life expectancy, etc.)1. Transition estimates based on a small number of individuals will tend to have large uncertainty, while those based on larger samples have less uncertainty. For example, the figure below depicts the sampling uncertainty associated with a stage-specific survival rate of 40% estimated from a random sample of N = 5 individuals vs. N = 50 individuals.

sampling uncertainty

Whereas sampling uncertainty is routinely incorporated into statistical analyses in the original studies that produce projection matrices, it is rarely incorporated into analyses that use published projection matrix data from sources like COM(P)ADRE. This omission may lead to bias or overconfidence in some types of analyses.

To investigate this possibility, we are initiating a study to examine the nature and distribution of sampling uncertainty among projection matrices in the COMPADRE Plant Matrix Database. Our goals are to:

  • understand whether uncertainty in transition rates is relevant for analyses based on COMPADRE,
  • assess the types of variables or analyses that are most likely to be affected by sampling uncertainty, and
  • develop resources to help researchers incorporate sampling uncertainty into their analyses.

To this end, we are currently working to obtain additional data for as many of the matrices in COMPADRE as possible, with a specific focus on matrices from unmanipulated, wild populations. To estimate sampling uncertainty we generally require more information than is available in the original papers (e.g. stage-specific sample sizes, counts of reproductive structures, etc.), so we will be contacting many authors over the coming weeks to request these data. We sincerely appreciate the time and effort taken by researchers to make their hard-won datasets available to us. The inclusion of these matrices in COMPADRE is already a great service to the scientific community and we hope that our study will further increase their utility to researchers, and help to improve inferences derived from the COM(P)ADRE databases.

An example analysis using COMPADRE

To make the issue of sampling uncertainty more concrete, we’ll work through an example analysis with COMPADRE. Specifically, we’ll test the hypothesis that relatively long-lived species tend to experience relatively low year-to-year variation in population growth rates (λ). For simplicity, we’ll limit this analysis to species categorized as herbaceous perennials, and unmanipulated populations with at least three annual transition matrices in COMPADRE (and a few more selection criteria noted in the RMarkdown document here).

variance in lambda as function of life expectancy

In the figure above, each point represents a population (as defined in COMPADRE), and the best-fit line is from a linear mixed model that accounts for non-independence of populations from the same species. There are of course different modeling approaches we could have taken — estimate life expectancy at the species level rather than population level, use a more complete model of phylogenetic non-independence, etc. — but we’ll save some of that for later.

For now we’d like to know, how wide are the error bars associated with each point in the figure above? The regression model assumed zero uncertainty in both life expectancy and variance(log λ), but as previously noted, both variables are estimates of population parameters with inherent sampling uncertainty. Let’s take a detour here to try to estimate sampling uncertainty for a single population.

Modeling uncertainty in transition rates

Consider a set of matrices available in COMPADRE from a 6-year study of the perennial forb Agrimonia eupatoria (Rosaceae) in southern Sweden (Kiviniemi 2003)2. The matrices give us point estimates for each transition rate in each year, which we can use to calculate point estimates for derived parameters such as life expectancy, λ, and variance(log λ). But to estimate the uncertainty in all these parameters, we need information from outside COMPADRE3.

First, we need to know how the transition rates were estimated. Based on the original paper, survival transitions were estimated directly from the fates of marked individuals (i.e. Aij = number transitioned from stage i in year t to stage j in year t+1 / number in stage i in year t), and the single fecundity transition was estimated using the anonymous reproduction method (i.e. Asr = number in seedling stage in year t+1 / number in reproductive stage in year t). Given this methodology, to reconstruct the raw counts from which each transition rate was estimated, all we need are the denominators in the equations above (i.e. stage-specific sample sizes for each transition period), which the original paper helpfully provides.

The figure below shows point estimates for each transition rate (open circles), as well as 90% and 99% confidence intervals based on the relevant sampling distribution (thin and thick bars; assuming multinomial and Poisson distributions for the survival and fecundity transitions, respectively)4.

Estimates and CI for each transition rate

To estimate the sampling distributions of the derived parameters, we generate thousands of simulated projection matrices by repeatedly drawing from the sampling distribution of each transition rate. The sampling distributions for the derived parameters are summarized below, again alongside the corresponding point estimates.

Sampling distributions for the derived paramet

For some transition periods, the point estimate for life expectancy is quite far from the respective confidence interval. This can occur when the sample of individuals in one or more stage classes experiences 100% survival, in which case the point estimate for life expectancy may be very high. But the sampling distribution for those one or more survival parameters will only include values ≤ the point estimate (e.g. see the 1996-97 seedling-to-juvenile transition)5.

Incorporating sampling uncertainty into our analysis

We can now add the sampling uncertainty for the population of Agrimonia eupatoria to our original figure.

Sampling uncertainty for the population of Agrimonia eupatoria

On one hand, the sampling uncertainty for this population seems high. On the other hand, we have a lot of data, and the relationship between life expectancy and variance(log λ) is strong. If we extrapolate (wildly) from this and a few other populations for which we have data, to make simple assumptions about the distribution of sampling uncertainty among all populations, we’ll find that the observed degree of uncertainty is unlikely to change the results of the current analysis. The figure below depicts predictions from an extension of the previously-described mixed effect model that now also incorporates simulated measurement error in both life expectancy and variance(log λ). The results are essentially unchanged.

Figure incorporating simulated measurement error

Perhaps this will be the case for many analyses with COMPADRE. But perhaps some types of analyses based on smaller subsets of data, or with more marginal effect sizes, will be more strongly influenced by sampling uncertainty. Either way, we think it warrants investigation, and we hope to report back with the answer.

References

Kiviniemi, K. (2002). Population dynamics of Agrimonia eupatoria and Geum rivale, two perennial grassland species. Plant Ecology, 159, 153-169. https://doi.org/10.1023/A:1015506019670

Notes

1Some of the matrices in COM(P)ADRE may in fact be based on data from entire biological populations rather than samples. Whether these map to ‘statistical populations’ will depend on the research question.

2Kiviniemi (2003) studied two populations of A. eupatoria, denoted A and B. Our analysis only includes population B, because the annual matrices for population A were mostly non-ergodic.

3Apart from uncertainty in the underlying transition rates, the uncertainty in variance(log λ) is also a function of the number of years over which it was estimated. This latter component of uncertainty is straightforward to model, but we ignore it here for simplicity.

4Because the transition rates reported in the original paper were estimated independently across years and transition types, our estimates of sampling error make the same assumptions. But now that we’ve reconstructed the raw data, we could of course model the transitions using a more nuanced correlational structure — e.g. partially pooling across years, or allowing for correlations among transition types.

5Note also that the point estimate for life expectancy for the 1994-95 transition was incalculable, because the estimated transition rates implied a 100% survival loop between the final two stage classes (i.e. infinite life expectancy).

Advertisements

We got the grant!

On August 5, the COMPADRE/COMADRE team was awarded an NSF grant to further develop our matrix databases. The funded project, “An Open-Access Global Repository of Plant and Animal Demographic Data”, will be led by Judy Che-Castaldo at Lincoln Park Zoo in Chicago, IL. This funding comes from the Advances in Biological Informatics program and will increase the function of the database and make it more user-friendly.

fullsizeoutput_2f04

Celebrating our grant success at the ESA meeting in Portland, OR

There are three main parts to the funded project. In the first part, we will finish transferring our data into a relational database, which will run more efficiently and be less error-prone than our old system of spreadsheets. A second part will be to build a data-entry portal that our digitization team will use, so that the data input will be consistent across our digitization nodes around the world. Down the line, this portal will be opened to other researchers who can then contribute their own matrix data. The third part of the project will be to refresh our database website to make it more accessible to a wide range of users, including researchers, teachers, students, and conservation managers.

In addition to improvements to the database itself, we will also be bringing on board a project coordinator who will oversee data digitization and communication across all of our participating nodes. Together, we will develop educational materials and hold user engagement workshops at several scientific conferences each year to spread the word and encourage even greater use of our demographic matrix data for research and in classrooms.

We are so excited about this next step in the COMPADRE/COMADRE project! We hope you will follow along and give us your feedback as we continue to make our databases better and more useful for you.

By Judy Che-Castaldo

Our upcoming workshop in Portland, OR

Over the last few years we have run numerous workshops on using the COMPADRE Plant Matrix Database and COMADRE Animal Matrix Database, and on matrix population models (MPMs) more generally.

Where better to run our next workshop than the upcoming Ecological Society of America (ESA) meeting in Portland, Oregon?

This yearly conference brings together academics, students, and practitioners for a few days of talks and workshops on ecology and allied fields. Attendance is in the thousands — the last time it was in Portland (2012) the meeting drew an amazing crowd of 5000! Although not all attendees will have an interest in MPMs (shame!), there are sure to be more than a handful who’d like to know more.

To help with this, this year we are running a half-day workshop entitled “Introduction to Matrix Population Models and Comparative Population Biology Using the COM(P)ADRE Matrix Databases“.

Drawing from experience garnered over the last few years we will take attendees on a five hour journey from the very basics of matrix modelling to comparative MPM analysis using R. The expert instructors are drawn from the COM(P)ADRE committees and include Owen Jones (Uni Southern Denmark), Roberto Salguero-Gomez (Uni Oxford), Judy Che-Castaldo (Lincoln Park Zoo) and Iain Stott (Uni Southern Denmark).

ESA 2017 attendees were given an opportunity to book for the workshop when they registered for the main conference. However, it should be possible to register as a last minute attendee on-site.

If you can’t make it this time, rest assured that we will continue to run similar workshops regularly at relevant meetings/conferences. We also run them on request: We blogged about one of those here.

Here’s looking forward to some matrix modeling fun in a few days!

 

 

 

Promotional poster now available in Portuguese

With the generous help of Mariana Silva Ferreira, a PhD student based at the Federal University of Rio de Janeiro – Brazil, we now have a Portuguese version of our COMPADRE poster. Thank you Mariana!

Mariana’s translation is to Brazilian Portuguese, which I understand to be very similar to European Portuguese.

We hope that the poster will bring our matrix database to a new audience of researchers and students.

You can download the PDF by clicking on the image below.

COMPADRE Poster in Brazilian Portuguese

COMPADRE Poster in Brazilian Portuguese

You can find the poster in other languages including English (of course), Italian, German, French, Chinese, Japanese, Turkish, Hungarian and Afrikaans here.

 

 

COMPADRE poster now available in 10 languages.

The COMPADRE Plant Matrix Database is a global enterprise containing data collected from all corners of the world. In our efforts to encourage the use of COMPADRE by as diverse a group of users as possible we have translated our poster summarising the enterprise into three more languages: Turkish, Afrikaans and Hungarian. This means that the poster is now available in 10 languages!

Here are the new posters. The others can be found here.

COMPADRE Turkish

Turkish Version

COMPADRE Afrikaans

Afrikaans Version

COMPADRE Hungarian

Hungarian Version

COMPADRE at Berlin’s csv,conf.

One of the COMPADRE/COMADRE core committee members (Owen Jones) recently attended the “csv,conf” in Berlin. This one-day conference was a fringe event of the bigger Open Knowledge Festival and was about data – it was for those who collect or aggregate it, those who make it available online and those who analyse and visualise it.

There were a heap of interesting talks – for example, Felienne Hermanns spoke on why we should treat spreadsheets like Excel (which we use as our COMPADRE data entry platform) as a kind code, and employ the techniques of good coding practice to them (e.g. by building in error checking and validation at each step). Another was Karthik Ram‘s talk about a new package for R called testdat, which will be a useful tool to validate our COMPADRE/COMADRE metadata. For example, it can help identify outliers that could represent date entry errors, and things like non-numeric entries in numeric columns etc.

I gave a short talk about the COMPADRE and COMADRE population matrix databases. I covered some of the history of the databases and highlighted why these kinds of data are so important. I also highlighted some of the issues we have had to deal with along the way – one of these is how to handle data entry and error checking/validation on what are fast becoming large and unwieldy spreadsheets. We need to balance the need for a cheap, easy-to-use tool, with the need to have a robust error-free output.

Excel is great because it is already familiar to the COMPADRINOs* and has an easy learning curve. On the other hand, the fact that it is not always “what-you-see-is-what-you-get” means that errors can creep in unnoticed. For example, a number can be registered as a text string so that 0.00 is recorded as being different than 0, or sometimes as a date — very frustrating!

Fortunately, we do not distribute COMPADRE/COMADRE data in its “raw” Excel form – we save them out as CSV files and then combine them into an structured RData list object. While doing this, Rob Salguero-Gomez and I, the supervisors of COMPADRE and COMADRE, have developed routines to carry out a range error checks and validations for all the metadata and matrices allowing us to identify and correct any errors and inconsistencies before data distribution**.

Here’s the abstract for the talk —

Evolutionary biologists aim to make sense of population behaviour in species across the tree of life. However, the collection of animal and plant population data is laborious and costly so analyses that try to generalise across many species are not feasible unless data are shared among researchers, or obtained from the literature. I will report on the 30+ year journey of construction of two databases that collate demographic data from published literature on more than 2000 species with an aim of making it openly available to all. I will briefly outline why these data are important, describe the process of data production, and contemplate the lessons learned along the way.

Unfortunately the talk wasn’t recorded, but you can find the slides for it here at Figshare.

*The COMPADRINOS are the wonderful team of students based at the MPIDR in Rostock that do the data acquisition and data entry work for the databases.

**No doubt some errors will still creep in – please let us know if you spot any (compadre-contact AT demog DOT mpg DOT de)

COMPADRE Posters

The COMPADRE Plant Matrix Database is an international enterprise. The database contains globally distributed data, and has an international committee with representatives from all corners of the globe.

To help promote the use of COMPADRE we have produced a series of posters in several languages. So far we have them available in English, Spanish, French , German, Japanese,  Italian and Chinese. Get them here and feel free to print and distribute them!

A selection of the COMPADRE Plant Matrix Database posters. These posters describe the database and advertise that it is available for anyone via the web.

A selection of the COMPADRE Plant Matrix Database posters. These posters describe the database and advertise that it is available for anyone via the web.