Report investigates how synthetic data can be used in government

Category: ADR England

14 December 2021

The report, Accelerating public policy research with synthetic data, was funded by ADR UK to help increase the uptake of administrative data research whilst maintaining data security. The principal recommendation of the report is that ADR UK should encourage the use and sharing of low-fidelity synthetic data across government and with researchers. This will raise awareness about the different sources of administrative data now accessible and help researchers to develop their applications to access it.

Synthetic data is a new copy of a dataset that is generated at random but made to follow some of the patterns of the original data set. It can then be used to do some exploratory analysis on the data without the need to access the original data through a trusted research environment. It also has huge potential for training researchers on how to use specific datasets. Access to data like this could considerably improve the efficiency and security of data analysis.

Alex Sutherland, Chief Scientist and Director of Research & Evaluation, Behavioural Insights Team, said: "Synthetic data can provide low risk, earlier access to key information researchers need to train up and get started on analysis projects. We've shown that low-fidelity synthetic data can be generated quickly and easily, so we can shorten research timelines and deliver policy-relevant results faster."

The report explores the complex terminology of synthetic data, particularly the differences between low-fidelity and high-fidelity synthetic data:

  • Low-fidelity data does not preserve any relationships between different pieces of information. It is easier to generate and has a low risk of disclosing information about individuals. It is useful for training researchers in how to analyse the data and for helping them write analysis code that can be run later on the real data.
  • High-fidelity data preserves the relationships between information, but no points correspond to real individuals. It is closer to the original dataset and therefore can be more useful but is more difficult to generate and has a higher risk of disclosing information about individuals. Like low-fidelity data, it is useful for training and for writing analysis code that can be run later on real data. However, in addition, it could be used to identify potential relationships in the data.

There are challenges with using and generating synthetic data, including balancing the privacy of individuals with the usefulness of the dataset. There are also concerns about quality of synthetic data compared to real data, which varies depending on the techniques used to generate this data.

Many of the recommendations outlined in the report focus on addressing these challenges. The BIT team have categorised the recommendations around three themes: technological considerations; risk aversion and lack of knowledge; and use of advanced privacy preserving technologies. They have also recommended that public attitudes towards synthetic data are explored and clear language and labelling developed to help communication and understanding.

Dr Emma Gordon, Director of ADR UK, said:We are pleased this report from the Behavioural Insights Team recommends the use of synthetic data to train researchers and test code. This will help more researchers to safely optimise their use of ever-more complex administrative datasets by improving their understanding of the data and testing how it can be used. As such, it will support ADR UK to maximise the research potential of these valuable data sources.”

The Team also developed a prototype Python notebook that generates only lower fidelity synthetic data, making it easy for a researcher to do this themselves.”

Read the full report, Accelerating public policy research with synthetic data.

Share this: