Sharing health data: good intentions are not enough
Elizabeth Pisani a & Carla AbouZahr b
a. London School of Hygiene and Tropical Medicine, Keppel Street, WC1E 7HT, London, England.
b. Department of Health Statistics and Informatics, World Health Organization, Geneva, Switzerland.
Correspondence to Elizabeth Pisani (e-mail: firstname.lastname@example.org).
(Submitted: 18 November 2009 – Revised version received: 05 January 2009 – Accepted: 07 January 2010.)
Bulletin of the World Health Organization 2010;88:462-466. doi: 10.2471/BLT.09.074393
As they prepare for careers in science, today’s students doubtless hear the same clichés as we did a generation ago: science advances collaboratively; we reproduce and extend the work of others; we stand on the shoulders of giants. In some fields, such as genomics, these axioms are becoming true. In epidemiology and public health, however, data sharing and collaboration remain more aspirational than real.
Students embark on a career in health research in the spirit of sharing; they want to help improve the well-being of others. For all the talk of collaboration, they will enter a world in which another axiom dominates: “publish or perish”. That system puts the interests of public health researchers in direct conflict with the interests of public health.
Benefits of sharing
The situation was not so different in genomics less than 15 years ago. Then, after years of hoarding their findings in individual laboratories and progressing at an expensive snail’s pace, in 1996 researchers agreed to share all their data openly.1 Now laboratories sequence during the day and post their results that same night; other researchers can begin to stand on their shoulders the very next day. As a result, genetic research is advancing faster than any other area of biomedicine.2
Genomics has taught us that sharing data with other scientists is a way to add value without costing a lot. It allows the same data to be used to answer new questions that may be relevant far beyond the original study. And it allows for meta-analyses that are free from the distortions introduced when only summary results are available.3,4 We could get far more out of public health research if we followed a similar path, if we squeezed more scientific and policy insights out of data that have already been collected.
Routine health and service use statistics can be just as useful for policy analysis as research data. Many countries are reluctant to release detailed service use data because analysis by disinterested outsiders may contradict politically acceptable interpretations. Most countries do, however, contribute aggregate statistics freely to large international databases maintained by multilateral organizations, although they are not always granted free access to those databases when they want to use them. Such restrictions on access, imposed unnecessarily by agencies wanting to protect their institutional mandates, cripple the potential utility of these expensive resources. Researchers and governments are also reluctant to see the data they provide used and manipulated by others in ways they don’t understand because secondary users (including international agencies) do not always publish their methods.
Research data are desperately underused too, in part because of a critical shortage of competent data managers.5 In other fields – genetics, banking and retailing – data management is a valuable skill. People are trained and develop careers in the field. In public health research, data management is the poor cousin of analysis. Undervalued and underfunded, inadequate data management undermines the rest of the scientific enterprise. One review in the United Kingdom of Great Britain and Northern Ireland found that many of the variables collected in epidemiological studies were never cleaned and coded, so they could not be used even by the primary researchers, let alone shared.6 In complex population-based surveys in developing countries, data management and analysis skills are in even shorter supply, so a higher proportion of data probably goes to waste.7
When we’re dealing with public health research, wasted data can translate into shorter, less healthy lives. Improving data management so that data can be shared is a first step to reducing that waste. But it will not be enough. We need to change the incentives that pit the interests of individual researchers against the interests of public health, that pit institutional interests against the more rapid advancement of knowledge and understanding. Governments may hold micro-data back from international organizations, but there’s no excuse for international organizations to limit access to the aggregate data that governments do provide.
It’s easier to understand why individual researchers are reluctant to share data they have collected. That reluctance will certainly remain entrenched as long as their employers – research councils, foundations and universities – regard publication of research papers in peer-reviewed biomedical journals as the main yardstick of success.8 If, however, “publish [papers] or perish” were to be replaced by “publish [data] or perish”, the picture might change rapidly, as it did in genomics.
What did that experience teach us? That a change in the culture of science requires the buy-in of key research teams, yes, but that it also requires considerable and very concrete commitments from funders. The two largest funders of the Human Genome Project, the Wellcome Trust and the National Institutes for Health, invested massively in the infrastructure needed to share data on a large scale for the long term. They also changed funding mechanisms to emphasize team work and the value of roles such as data management, rather than just looking at publication and citation records. Inevitably the rapid change of culture raised some tensions, but those have now largely been resolved.2 It would be perfectly feasible for research funders to take similar steps in other fields so that personal and professional incentives are aligned rather than in conflict.
Genomics and the social sciences (which have a dramatically better record of sharing data than most biomedical sciences) have developed techniques to deal with two of the other main obstacles to sharing of public health research data – confidentiality and consent. In part because of the development of research tissue banks (biobanks), broad consent procedures are increasingly becoming a norm.9 Anonymization removes some of the obstacles associated with consent, and techniques for protecting identities are improving constantly. Despite concerns about the theoretical possibility of identifying individuals in shared data sets, no breaches of confidentiality have yet been recorded in anonymized data sets.10 Social and economic sciences have also gone further in making the sharing of data sets easy through standard metadata, both for aggregate data through Statistical Data and Metadata Exchange (SDMX) standards and for individual data using Data Documentation Initiative (DDI 3.0) standards. A further lesson from other fields: it is possible to make data widely available to the research community while still safeguarding integrity, through the use of standardized data use agreements and licences.11,12 These define who may use data and how, and may require secondary analysts to contribute both derived data and a record of their analytic methods back to the database, so that primary and other users can both verify and benefit from their work.
The data that we collect and don’t make full use of do not come free. The collection of routine health statistics is paid for by our tax money. Most research aiming to reduce ill-health in the developing world is also funded either from the public purse or by charitable foundations. It is irrational to invest so much in collecting data and yet so little in ensuring that we make the best use of it.13 It is also ethically unsound; people who participate in research have a right to expect that the results will be used to improve life for them and/or for their communities.
Funders and standard-setters have been aware of this for some time. Gradually, they are urging or adopting policies that aim to increase the use and recycling of data. Although they don’t all yet practice what they preach, several international organizations, including the Organisation for Economic Co-operation and Development and the World Health Organization, have issued statements calling for increased access to routine statistics and other publicly-funded data.14,15 Many biomedical journals have recently addressed the importance of data sharing in editorials and commentary articles.16–18 A few biomedical journals expect researchers to make the data that underlie research articles available to others on request. An even smaller number of journals have followed the lead of Annals of Internal Medicine and now require authors to state whether and how they will make protocols, analysis tools and data available to others. But even Annals stops short of requiring authors to publish data sets along with their articles. “If we did that, we‘d have a very thin journal,” commented editor Christine Laine at a recent conference on biomedical publication.19
There are indications that public and foundation funders of public health research wish to strengthen data sharing policies, shepherding epidemiologists down the road already travelled by geneticists.20–23 Many field researchers who have battled difficult climates, erratic electricity supplies, fuel shortages and recalcitrant local authorities will doubtless resent increasing pressure to “give data away”. Some are also apprehensive that people looking at the data in the comfort of some distant, well resourced office will spot the errors that are the inevitable by-product of research in the real world.
Governments are equally reluctant to expose their data to interpretations other than those published by their official statisticians. There is a fear, too, that data may be used by others not just for professional but for economic gain. This is sometimes cast as a “north–south” divide; one spectre raised is of pharmaceutical companies exploiting data from developing countries to develop products that those countries then can’t afford.24
Feelings of ownership over hard-won data, viscerally held even by researchers who support the idea of data sharing in principle, are understandable. And peer reviewers, mostly researchers themselves, are reluctant to approve funding for data management if it cuts into budgets for data collection. But funders of science are themselves under pressure to get the most out of expensive research studies. They have to wrestle with two important questions: how much data sharing is desirable and how much is feasible?
Researchers sometimes argue that interpretation of their data is so dependent on understanding local conditions that the data would be worthless to other scientists. This is often a reflection of inadequate documentation, but also a necessary failure of imagination. Sailors keeping log books on whaling boats in the 1600s could not have predicted that, centuries later, the data would be an important source of information for climate change scientists.25 Most funders have stringent peer-review procedures; few invest in research that they believe is of only very localized importance, and few wish to support research that produces data of such poor quality that it has no further value. Publicly-funded data can also be invaluable to students learning data management and analysis skills. It thus seems fair to expect that almost all public health research funded by taxpayers or charities might be useful to secondary analysts. If a piece of research is considered worthy of publication in a peer-reviewed journal, the underlying data should also be worth publishing.
How feasible would it be to make these data available to the scientific community? Technically, the challenges are not trivial, but they have been overcome in several other fields; they are broken down here into manageable parts. We maintain that the major constraints to feasibility are a cultural resistance to change from within our own scientific community, and a reluctance of any institution to take leadership of the data sharing agenda. We also believe, however, that the imperative to share data will only grow stronger. The research community should look at this pressure from funders as an opportunity rather than an imposition.
Goals for funders and researchers
Here we propose several goals to which funders and researchers can jointly aspire and towards which progress can be measured: (i) all data of potential public health importance funded by taxpayers or foundations will be appropriately documented and archived in formats accessible to the wider scientific community; (ii) all data provided by governments to databases developed by publicly-funded organizations will be freely available to any user, at the level of detail at which it was provided; (iii) the publication of a research article in a biomedical journal will be accompanied by the publication of the data set upon which the analysis is based; (iv) funders and employers of researchers will consider publication of well managed data sets as an important indicator of success in research, and will reward researchers professionally for sharing data; and (v) all planned research will budget and be funded to manage data professionally to a quality adequate for archiving and sharing.
Plan of work
These goals can only be achieved with considerable investment in several practical areas. We propose the following plan of work, necessary to underpin progress towards our stated goals.
Fill the gaps in data management
There is a need to develop metadata standards, which will lead to improved documentation and allow data to be combined more easily across time, locations and sources. This will probably require the extension of DDI and SDMX standards to encompass areas of public health interest. Agreement is also required on standards for anonymization and safeguarding of confidentiality.
We need to develop a search portal that will allow data to be discovered across a range of repositories, and standards for repositories similar to those used for registries of clinical trials.26 We also need to invest in training in data management for public health, especially in developing countries, and the development of career paths in bioinformatics.
Increase incentives to share data
We need to further develop and adopt reliable citation standards for data sets, such as those proposed by DataCite collaboration,27 and ensure they are indexed in databases such as PubMed. Standards and procedures for peer review or quality control of data sets are also needed. Digital fingerprinting of data would allow tracing of secondary use 28,29 and we should develop methods and measures to track the value that sharing data adds to the work of both primary researchers and funders of research. There is a need to agree on norms and standards governing fair use periods for primary researchers, data access policies and data use agreements.
To underpin the long-term viability of data libraries, we need to invest in expanding existing infrastructure to cover curation and access of data of public health importance. This calls for a business or funding model that assures the long-term viability of data archives.
All of these areas have already been identified as critical to promoting data sharing, often repeatedly so.5,30–32 Funders, governments, publishers and many researchers want these things to happen, it seems. Some of the organizations calling for greater sharing of public health research data have expressed willingness to pay for parts of the work. But none are willing to take charge of the agenda, committing themselves to orchestrating the dull, messy but essential work of developing the norms and standards that will allow data sharing to revolutionize public health research.
It is time to move beyond expressions of good intentions and to get on with the practical work that will allow data to be shared. The first thing that is needed is leadership. We challenge other participants in this round table to commit to coordinating, funding or carrying out the work described in this paper. Only after someone takes the lead in tackling these issues will today’s students of public health be able to climb onto the shoulders of the current giants in our field.
- Smith D, Carrano A. International large-scale sequencing meeting. Human Genome News 1996;7. Available from: http://www.ornl.gov/sci/techresources/Human_Genome/publicat/hgn/v7n6/19intern.shtml [accessed 26 February 2010].
- Kaye J, Heeney C, Hawkins N, de Vries J, Boddington P. Data sharing in genomics – re-shaping scientific practice. Nat Rev Genet 2009; 10: 331-5 doi: 10.1038/nrg2573 pmid: 19308065.
- Nüesch E, Trelle S, Reichenbach S, Rutjes AWS, Bürgi E, Scherer M, et al., et al. The effects of excluding patients from the analysis in randomised controlled trials: meta-epidemiological study. BMJ 2009; 339: b3244- doi: 10.1136/bmj.b3244 pmid: 19736281.
- Elobeid MA, Padilla MA, McVie T, Thomas O, Brock DW, Musser B, et al., et al. Missing data in randomized clinical trials for weight loss: scope of the problem, state of the field and performance of statistical methods. PLoS ONE 2009; 4: e6624- doi: 10.1371/journal.pone.0006624 pmid: 19675667.
- Lord P, MacDonald A, Sinnot R, Ecklund D, Westhead M, Jones A. Large-scale data sharing in the life sciences: data standards, incentives, barriers and funding models (The “Joint data standards study”). Glasgow & Edinburgh: National e-Science Centre; 2006. Available from: http://www.nesc.ac.uk/technical_papers/uk.html [accessed 26 February 2010].
- Corti L, Wright M. MRC Population data archiving and access. London: Medical Research Council; 2002.
- Chandramohan D, Shibuya K, Setel P, Cairncross S, Lopez AD, Murray CJ, et al., et al. Should data from demographic surveillance systems be made more widely available to researchers? PLoS Med 20085e57- doi: 10.1371/journal.pmed.0050057 pmid: 18303944.
- Field D, Sansone SA, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, et al., et al. Omics data sharing Science 2009326234-236 doi: 10.1126/science.1180598.
- Mascalzoni D, Hicks A, Pramstaller P, Wjst M. Informed consent in the genomics era. PLoS Med 2008; 5: e192- doi: 10.1371/journal.pmed.0050192 pmid: 18798689.
- Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, et al., et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 2008; 4: e1000167- doi: 10.1371/journal.pgen.1000167 pmid: 18769715.
- Application to use restricted microdata. Minneapolis: IPUMS International. Available from: https://international.ipums.org/international/ [accessed 26 February 2010].
- UK Data Archive. End user licence. Colchester: University of Essex; 2008. Available from: http://www.data-archive.ac.uk/aandp/access/licence.asp [accessed 26 February 2010].
- Pisani E, Whitworth J, Zaba B, AbouZahr C. Time for fair trade in research data. Lancet 2010; 375: 703-5 doi: 10.1016/S0140-6736(09)61486-0 pmid: 19913902.
- OECD Principles and guidelines for access to research data from public funding. Paris: Organisation for Economic Co-operation and Development; 2007.
- Global strategy and plan of action on public health, innovation and intellectual property. Geneva: World Health Organization; 2008.
- How to encourage the right behaviour. Nature 2002; 416: 1- doi: 10.1038/416001b.
- Data's shameful neglect. Nature 2009; 461: 145- doi: 10.1038/461145a.
- PLoS Medicine Editors. Next stop, don't block the doors: opening up access to clinical trials results. PLoS Med 2008; 5: e160- doi: 10.1371/journal.pmed.0050160 pmid: 18630986.
- Laine C, Berkwits M, Mulrow C, Schaeffer MB, Griswold M, Goodman S. Reproducible research: biomedical researchers’ willingness to share information to enable others to reproduce their results. In: Sixth International Congress on Peer Review and Biomedical Publication, Vancouver, Canada, 10–12 September 2009. Available from: http://www.ama-assn.org/public/peer/abstracts-0910.pdf [accessed 26 February 2010].
- NIH guide: final NIH statement on sharing research data. Bethesda: National Institutes of Health; 2003. Available from: http://grants.nih.gov/grants/oer.htm [accessed 26 February 2010].
- MRC Policy on data sharing and preservation. London: Medical Research Council; 2008. Available from: http://www.mrc.ac.uk/PolicyGuidance/EthicsAndGovernance/DataSharing/PolicyonDataSharingandPreservation/index.htm [accessed 26 February 2010].
- Policy on data management and sharing. Wellcome Trust; 2007. Available from: http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTX035043.htm [accessed 26 February 2010].
- Sharing public health data: a code of conduct. London: Wellcome Trust; 2008. Available from: http://www.wellcome.ac.uk/About-us/Policy/Spotlight-issues/Data-sharing/Public-health-and-epidemiology/index.htm [accessed 26 February 2010].
- Supari SF. Saatnya dunia berubah: tangan Tuhan di balik virus flu burung / Siti Fadilah Supari [in Indonesian]. Jakarta: Sulaksana Watinsa Indonesia; 2008.
- International Comprehensive Ocean-Atmosphere Data Set. Washington, DC: National Oceanic and Atmospheric Administration; 2009. Available from: http://icoads.noaa.gov/ [accessed 26 February 2010].
- International Clinical Trials Registry Platform, WHO Registry Criteria, version 2.1. Geneva: World Health Organization; 2009. Available from: http://www.who.int/ictrp/network/criteria_summary/en/index.html [accessed 26 February 2010].
- DataCite - International initiative to facilitate access to research data. Hannover: German National Library of Science and Technology; 2009. Available from: http://www.datacite.org/ [accessed 26 February 2010].
- Paskin N. Digital Object Identifier (DOI) System. In: Encyclopedia of library and information sciences. New York: Taylor & Francis; 2008.
- Altman M, King G. A proposed standard for the scholarly citation of quantitative data. D-Lib 2007.
- Lowrance W. Access to collections of data and materials for health research. London: Medical Research Council; 2006.
- Pisani E. OpenEpi: a new culture for public health data? London: Wellcome Trust; 2008.
- National Academy of Sciences. Ensuring the integrity, accessibility and stewardship of research data in the digital age. Washington, DC: National Academy Press; 2009.