Accelerating public policy research with easier, safer synthetic data
In this blog, Dr Paul Calcraft and Dr Alex Sutherland from the Behavioural Insights Team explain how simple, low-fidelity synthetic data could be used to accelerate policy research without compromising on accuracy or data security.
To continuously improve how policies are developed and run, governments need to understand how policies actually affect people’s day-to-day lives. This isn’t just an activity for government – it may be better for external researchers and academics to measure the impact of policies, rather than the government marking its own homework. Regardless of who is trying to understand the impact of policies, doing so well requires data, and lots of it. But expanding the use of data for research also increases concern about individual privacy.
We’ve found a way to ease the research process without sacrificing individual privacy. If widely adopted, we believe it will lead to better, faster, more policy-relevant research being produced by academics, research institutes and government departments themselves.
The solution we are proposing is synthetic data, but perhaps not as you currently know it. (And if you’re like most people, you don’t currently know it at all. And if you’re really impatient you can skip right to the bit where we make it easy.)
Synthetic data is an emerging technology that some people claim is the perfect solution to performing research on private data. But despite being invented in the 1990s, it has not been widely adopted by governments. The Behavioural Insights Team, funded by ADR UK, investigated why, and whether the UK government may be missing opportunities to accelerate public policy research.
Synthetic data is like an artist’s impression of the information
Synthetic data is like taking some original data and smudging it to look more like an artist’s impression. It’s a new version of a dataset that is generated at random, but made to follow the structure and some of the patterns of the original dataset. Each piece of information in the dataset is meant to be plausible (e.g., an athlete’s height will usually be between 1.5 and 2.2 metres, and would never be 1 kilometre), but it is chosen randomly from the range of possible values, not by pointing to any original individual in the dataset. The full report goes into more detail on how it works.
Synthetic data can be created at different levels of fidelity to more closely or loosely match the original data. With high fidelity synthetic data, relationships in the data, such as correlations, are preserved. We can see overall what the dataset looks like, and the variables within it, but if we zoom in too much we’re not getting a high resolution picture of an individual.
Fig. 1: Examples of High-Fidelity and Low-Fidelity Synthetic Data.
Using high-fidelity synthetic data you can run statistics and often get the right answer, but no actual individual’s information is present. In the athlete example, you could establish that weight and height are strongly correlated (taller people tend to be heavier) and to what extent (the correlation is about 60%). But you wouldn’t be able to check Usain Bolt’s height or weight, or even know whether Usain Bolt was in the original dataset at all.
Critics are concerned that the privacy guarantees of high-fidelity synthetic data are not ironclad. They are also concerned that the randomness introduced to improve privacy could harm the accuracy of research (the U.S. Census Bureau's move to synthetic data is worrying researchers). We tend to agree, and perhaps more importantly, so do the people that hold the data and decide whether to release it.
However, there’s a different way to use synthetic data that is lower risk and more private, yet could accelerate policy research in all areas of government. Low-fidelity synthetic data doesn’t preserve any of the relationships between any of the columns of data. So in the athlete example, weight and height would not be related, but any row of data would still have a sensible weight or height. And they would be stored in the same format as the weight and height information in the original data. Low-fidelity data, because it is based on such a simplified view of the dataset, has a much lower risk of exposing or disclosing any personal information.
Having a dataset like this, without any real or accurate individual information, still allows a researcher to understand what a dataset looks like. This helps them make a more informed decision about what they might be able to learn from the real data if they were to get access to it. Or it may indicate that it isn’t worth pursuing an application. There are endless examples of researchers following time-consuming legal and ethical processes to obtain data they then cannot use. Once they receive it, they might find it doesn’t contain the information they thought it would, or it’s too hard to use for their project. This is inefficient and frustrating, contributing to wasted resources. Providing low-fidelity synthetic data early in the process, without requiring full ethical and legal processes (because there is no personal data involved), would side-step much of this.
Another benefit can be when there is a delay in the researcher receiving the data. This is very common when individual privacy is involved because there are, appropriately, many ethical and legal processes to navigate. But with access to low-fidelity synthetic data, the researcher can begin figuring out how precisely they will analyse it, so when it does arrive they are ready to go. In many cases they can write their entire analysis plan (and even their analysis code), figure out all the quirks, and get others to check it over, using only this low-fidelity synthetic data. When the real data arrives, it is then much quicker for them to run the analysis and deliver the results, as illustrated below.
Fig. 2: Improving the timelines of data analysis.
Low-fidelity synthetic data can also be used in researcher training. Many datasets are unique in the way they are laid out or the way they work. ADR UK is expanding its training programmes to help researchers understand how to work with important datasets that work or are linked together in surprising or unusual ways. Synthetic data can be used as the bedrock for training. Researchers can work with a synthetic version to really understand how each type of dataset works and be trained appropriately, without requiring each trainee to write an application to use each dataset.
Unfortunately, synthetic data is currently poorly understood, especially the distinction between high-fidelity and low-fidelity, and the benefits of low-fidelity synthetic data. It is not yet a widely known technology in government, even among government analysts and researchers. We and others are trying to change that, because we think this is a substantial missed opportunity.
We’re making it easy
As well as getting the word out, we have also released a Python notebook that guides a user through generating low-fidelity synthetic data. This is designed to be self-explanatory for anyone who works with data - even if they have limited experience with Python. The main reason we wrote this new notebook is that existing tools, methods and tutorials for generating synthetic data focus on high-fidelity synthetic data. By writing a guided script that only generates low-fidelity, lower risk synthetic data, we keep things simpler and safer for researchers and data owners.
BIT and ADR UK are also talking to the Office for National Statistics (ONS) about how to integrate this approach with the ONS Secure Research Service and the forthcoming ONS Integrated Data Service. Creating a central platform where synthetic datasets are available to browse by researchers could open up new possibilities for policy-driven research and cross-government collaboration. We are missing out on scores of valuable research projects simply because researchers don’t know that the data they’d need is actually available somewhere across government.
For government policy and services to work well for society and citizens, they must be built on good data. The widespread use of low-fidelity synthetic data can support that aim by making research safer, easier and quicker. Try out our code, make your data more widely available (or ask your partners if they could), and if you want to learn more, read the full report.