Gdoc/Admin

How we’re building a team for better data at Our World in Data

Data is too often published in a way that’s hard to understand, check and build upon – reinforcing the low valuation it gets in society. Here is how we’re trying to break out of this bad equilibrium.

I joined Our World in Data (OWID) as a researcher back in 2017, at a time when the whole team could fit comfortably in a car.

Recently we’ve been able to grow the team, and one thing this has allowed us to do is divide our work into more specialized roles. Today, OWID is produced as a close collaboration between three teams: a research team, a data team, and a team of software and web developers.

This is not a common way of organizing scientific research. But I think it’s our key strength, and the primary engine that has enabled us to have a much bigger impact than I could ever have imagined when I joined.

In this post, I explain how these teams work together, and why I think OWID has hit on a good approach for doing research in building this kind of collaborative team.

This post is mainly intended to provide more background and motivation to people considering applying for our current job vacancy (as of 31/01/2022). We are looking for a data analyst with a very good knowledge of the research and data relating to economic development – on topics like poverty, inequality and economic growth. They will be working closely with me and the founder and director of OWID, Max Roser, as we revise and expand our work on these topics in the coming months and years.

However, this post may also be of interest to people who follow our work more generally and are curious about how we organize ourselves as a team.

The division of labor

An important, very old idea in economics is that the division of labor between workers specializing in different tasks can increase their collective productivity.

It is surprising then to see how limited the degree of specialization often is within the production of economic research itself. Researchers are often responsible for the whole chain of activities from beginning to end: understanding the literature, finding and handling the data, conducting analyses and building models, writing papers, making the tables and figures, and communicating their research to wider audiences.

One downside to this way of organizing research is that it’s hard for a single person to be an expert on all these different fronts. Managing all these tasks yourself generally means doing some of them badly, or at least inefficiently. The fact that people have different strengths is what makes collaboration so powerful: colleagues gain from each other’s comparative advantages.

As the team at OWID grew in the last years it became increasingly clear that we needed to coordinate our work within three subteams: researchers, whose focus is understanding and communicating the academic research frontier on a given topic; web developers and engineers, who build the digital publishing infrastructure we use; and a data team who manages the data and charts we use across all our work and ensures their quality.1

This specialization allows people to focus on what they enjoy and what they’re best at, and to invest their energy in becoming better at it still. This clear division of labor is still very new for us and is a work in progress. But, as I discuss below, it has already massively improved the quality and efficiency of our work, increasing its impact by orders of magnitude when compared to what each of us could do individually.

Limits to specialization

One reason why a lot of scientific research is conducted by one-man bands is that it can be difficult to bring together people with different backgrounds to produce good research. Important understandings often arise within each part of the research production line that somehow need to be shared across them. For instance, those in the weeds of the data are the ones most aware of its quirks or shortcomings that need to be taken into account. Conversely, it’s easy to make inappropriate comparisons or transformations of the data without having a very good understanding of the methodologies and research context behind it.

This problem – that increased specialization can come at the cost of making coordination harder – is a general one and not especially unique to scientific research.2 And OWID is of course also subject to this trade-off. Whilst we organize our work into specialist teams, these teams still need to be able to understand one another to make collaboration possible.

As such, we value people who are able to wear multiple hats: our specialists must also be generalists to a degree. Our researchers need to be able to code well in order to collaborate with and learn from the data and development teams. And the data team we are building includes people who not only have excellent data skills, but who also have an excellent knowledge of the data and research in a given area of our work.

Already in our data team we have experts on global health, political science, energy and the environment, and the Sustainable Development Goals. With our current vacancy we are looking to add to this an expert on economic data – the metrics that inform us about issues like poverty, inequality and growth.

Collaboration within research can be made easier

Whilst some tradeoff between the gains from increased specialization and the costs of coordination is perhaps inevitable, the terms of this trade-off are not fixed. Much can be done to facilitate collaboration and shared understanding across different research tasks.

For instance, whilst computer programming is an integral part of most fields of scientific research, many researchers are not familiar with standard tools and practices used within the wider programming and software development community to facilitate collaboration: version control systems like GitHub; strategies for making code more readable and reusable; norms concerning documentation; and helpful interactive formats for sharing scripts that analyze data like Jupyter notebooks or Markdown.

Partly this reflects a language barrier: academic researchers are often only proficient with software packages and programming languages that have little currency outside their discipline. Stata – ubiquitous in many social sciences, but almost unheard of outside of this context – is a notable example of this.

In turn this adds to the already substantial cost that researchers looking to skill up in industry-standard tools and practices face. Up-skilling often means starting from scratch with new programming languages, learning new systems, and shifting established habits to make code and data work together in a more structured way.

At OWID we’re making this investment to lower the costs of collaboration and coordination. We’re training our researchers in the more widely-used language Python – as a lingua franca to enable collaboration and learning across the teams. And this learning is allowing us to adopt the tools and practices needed to switch to a more collaborative approach to data and research.

Our experience at OWID is that combining researchers, data specialists and software developers generates a positive feedback loop for cooperation: working closely with data specialists and developers substantially lowers the costs researchers face in adopting the tools and practices that make collaboration between these teams easier.

Collaboration within research should be made easier

Shifting scientific research towards a more specialized, collaborative approach is not just about increasing productivity. There are also benefits for the quality of the research in terms of its reliability and value for others.

Increased reliability

In the one-man-band approach to organizing research that is common in many research disciplines, often only one pair of eyes will ever look at many of the computational steps involved in processing the data and producing the final analyses and outputs before they are published.

That’s inherently fragile. And tellingly, it’s very different to what professional programmers do in the software industry, where code review is the norm.

Moreover, it makes it harder for researchers to justify taking the trouble to produce and publish their data in a replicable, well-documented way that’s easy for others to understand and check. And because it’s often hard to scrutinize the data and code underlying research, people rarely do.

Overall, it means mistakes are more likely to occur and are less likely to be spotted, contributing to wider concerns about the replicability of published scientific research.

Increased value

Many of the poor practices that make data hard to scrutinize also make it harder for others to build on.

The value of good data work lies not just in the initial research it informs but in the chain of subsequent innovations and learning that it can enable. Data and code shared in an open, accessible and well-documented way can be used by others over and over to conduct their own analyses and build new tools. Publishing data is not only an end in itself, it is the input into other people’s work.

The one man-approach to research makes it harder to realize these positive externalities, by encouraging data and code that is poorly documented, poorly structured, written in idiosyncratic programming languages, or published without regard to standards and norms that increase reusability.

A bad equilibrium for research data

If, as I am arguing, a more specialized, collaborative and accessible approach to organizing research is such a good idea, why hasn’t it already happened?

Firstly, it is of course not true that there has been no progress in this direction. Many research teams do excellent data work and publish in a way that’s easy to replicate and build on. OWID’s efforts here are very much part of a growing movement. Nonetheless, given that the tools to enable a much more collaborative approach have been around for a long time, the pace of change has been surprisingly, disappointingly slow.

One possible explanation for this I often hear is that the institutions in which researchers conduct their work do not provide the right incentives. Good data work is often undervalued relative to other contributions to research. There are, for instance, very few peer-reviewed academic journals where researchers can publish data itself, as a valuable research output in its own right.3

The way that research outputs get valued and funded can make it difficult even for researchers who are committed to open, accessible data to prioritize this aspect of their work.

Whilst I very much agree with this complaint, the structure of incentives also in turn requires explanation: why is it that good data work is valued so little in society?

I see this as a chicken-and-egg situation. When data so often gets published in a way that undermines its value – hard to understand, hard to check and hard to build upon – it is perhaps not so surprising that it is given a low valuation in society. We are in a bad equilibrium where poor data practice within research is both the expectation and the reality.

Data is valued when it’s managed and presented well

As well as improving efficiency, the move towards a more specialized team has also greatly increased the quality of work. For instance, we are improving the structure and documentation of our data pipelines and moving more of our data work to GitHub to make it easier for others to check it and build on it (see for instance our datasets on CO2 emissions, energy, and COVID-19). New and improved tools like our Data Explorers have made it easier for people to find the data they want and put it in a wider context. And we are improving the design and structure of our website to make our work more discoverable, both within our website and through search engines.

Working in this way with three specialized teams is still relatively new for us. Many of the changes it has led to are still very much a work in progress. In many aspects of our work, we are still far from where we would like to be.

But the impact that we’ve been able to have as a small team so far points to this strategy paying off. Our work is used by international organizations like the WHO, the UN, the OECD, the WTO and the IMF; by governments and world leaders; by leading news outlets like The New York Times, The Guardian and The Economist; by researchers in public health, climate science, economics, and many other fields, including by Nobel laureates; by Google in its search results; and by the 100 million visitors to our website over the last year.

The impact we are having is recognized by the individuals and organizations that support our work. And this support in turn allows us to continue to improve on the many areas of our work that still fall short of our aspirations.

As well as being a big source of motivation for me personally, seeing the increasing quality and impact of the work of our team also makes me optimistic about the future of data in research more generally. It demonstrates that a bad data equilibrium – where both the expectation and the reality is that data is hard to access, understand and build on – is not at all inevitable. It proves just how much people value data when it’s managed and communicated well.

Building a team for good data and research

Underpinning the impact of our work at OWID is the team of generalist-specialists who produce it: colleagues with particular expertise matched to their particular responsibilities, but who also have the broad range of skills and knowledge required to work in the collaborative way we do.

We are currently looking to hire one more such colleague to join our data team: someone with substantial experience and excellent skills in managing data, but someone who also has an excellent knowledge of the research and data on topics like poverty, inequality and economic growth.

We are looking for this person as we embark on a substantial revision and expansion of our work in these areas, which I will be working on from the research side. Based on our recent experience, I’ve no doubt that finding the right data specialist with this range of skills and experience will have a transformational effect on the quality and impact of our work here.

If you think that sounds like you, please consider applying to our current vacancy. I’m looking forward to hearing from you and to working together to push the world towards a better equilibrium for data – a world where data is as valuable and as valued as it should be!

Data Analyst (Poverty and Economic Development)

Here are the details of our vacancy and how to apply.

Further reading

If you’re interested in the question of the division of labor within scientific research and possible future directions, I recommend reading The speed of science, an essay by Saloni Dattani – a colleague here at OWID – and Nathaniel Bechhofer.

Acknowledgements

Thanks to all the researchers, data scientists and developers at OWID that commented on and helped improve this article.

Endnotes

  1. In their essay, The speed of science, Saloni Dattani (a colleague here at OWID) and Nathaniel Bechhofer make some suggestions for what a division of labor within scientific research in general might look like and how this could help the speed and quality of scientific output.

  2. A helpful statement of this idea of a trade-off between the gains from specialization and the costs of coordination is the 1992 paper by economists Gary Becker and Kevin Murphy, ‘The division of labor, coordination costs, and knowledge’. Becker, Gary S., and Kevin M. Murphy. 1992. “The Division of Labor, Coordination Costs, and Knowledge.” The Quarterly Journal of Economics 107 (4): 1137–60. Available at NBER here.

  3. One notable exception here is Nature’s journal Scientific Data, where in 2020 we published our COVID-19 testing dataset. See Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020).

Cite this work

Our articles and data visualizations rely on work from many different people and organizations. When citing this article, please also cite the underlying data sources. This article can be cited as:

Joe Hasell (2022) - “How we’re building a team for better data at Our World in Data” Published online at OurWorldinData.org. Retrieved from: 'https://ourworldindata.org/building-a-team-for-better-data' [Online Resource]

BibTeX citation

@article{owid-building-a-team-for-better-data,
    author = {Joe Hasell},
    title = {How we’re building a team for better data at Our World in Data},
    journal = {Our World in Data},
    year = {2022},
    note = {https://ourworldindata.org/building-a-team-for-better-data}
}
Our World in Data logo

Reuse this work freely

All visualizations, data, and code produced by Our World in Data are completely open access under the Creative Commons BY license. You have the permission to use, distribute, and reproduce these in any medium, provided the source and authors are credited.

The data produced by third parties and made available by Our World in Data is subject to the license terms from the original third-party authors. We will always indicate the original source of the data in our documentation, so you should always check the license of any such third-party data before use and redistribution.

All of our charts can be embedded in any site.