Improving data sharing and access with synthetic data

Category: Research using linked data, ADR England

23 September 2021

What if society-level patterns in behaviour and outcomes could be easily analysed by researchers to inform policy and services, without risking the privacy of any individual citizen? An idea from a Harvard professor in 1991 [1] may provide exactly that: synthetic data.

What is synthetic data?

Synthetic data is a new copy of a data set that is generated at random but made to follow the structure and some of the patterns of the original dataset. Each piece of information in the dataset is meant to be plausible (for example, an athlete’s height will usually be between 1.5 and 2.2 meters, and would never be one kilometer), but it is chosen randomly from the range of possible values, not by pointing to any original individual in the dataset.

Data that is generated in this way reveals very little, if anything, about any individual in the original dataset, but still represents the data well as a whole. It can therefore be used for:

  • Doing some types of exploratory analysis on data without requiring a fully secure environment and information governance procedures.
  • Training researchers on how to handle particular datasets in unique or challenging formats (common in administrative data) without requiring full access to secure environments for all trainees.
  • Improving the efficiency and safety of analysing personal or confidential information by allowing researchers to write and test their analysis code on low-risk synthetic data before getting access to the real data. It may also be desirable for researchers never to access the real data, but to send their code to the data holder for the data holder to run securely, sending back only the aggregate results.

If synthetic data is to fully realise its potential, we should first understand how it is perceived by practitioners and other stakeholders in government.

Research outline

Building upon phase one of their engagement (Applying Behavioural Insights to Cross-government Data Sharing), BIT will explore whether and how synthetic data could aid cross-government data sharing for research.

They will engage departmental researchers and key stakeholders in government administrative data to understand existing, future and potential applications of synthetic data.

Processes for accessing administrative data can be onerous. Many hurdles are necessary to ensure correct and careful use of this data, because much of it may be individually sensitive. However, these hurdles can also sometimes be a hindrance to research efforts. Project delays and cancellations owing to data sharing issues are common, impeding the use of high-quality research to inform policy and public services. Since synthetic data does not contain any records relating to existing individuals, it could be made more easily accessible to researchers than the real data, and accelerate research processes.

What is the potential of this work?

The extent to which the potential of synthetic data can be realised depends on the capacity and willingness of government departments to implement synthetic data approaches.

This research will address the following key questions:

  • To what extent, if any, is synthetic data already used by government departments, and if so, how is it used?
  • What kind of applications do stakeholders envisage for cross-departmental use of synthetic data?
  • How do these applications differ from other uses of synthetic administrative data (for example, with respect to synthetic data intended for academic researchers or the general public)?
  • What are the potential drawbacks and limitations of synthetic data?
  • Could synthetic data be a valuable resource for training, especially in the analysis of administrative data?
  • Is there an appetite for the development of an automated pipeline for the production of synthetic data?
  • Is there a role for advanced forms of privacy protection (such as differential privacy) in these applications?

As part of this project, the BIT team will also produce a simple Python script that will generate low-fidelity synthetic data as an example of how one approach can be implemented. It may also serve as a single recommended mechanism for generating low-fidelity, low-risk data, without the complications of larger code libraries that are also capable of producing higher-risk, high-fidelity synthetic data. This could form the basis of an automated approach to the generation of synthetic data from deposited real datasets.

In the long-term, this work will establish whether there is support for the development of synthetic data-based approaches towards data sharing, and how ADR UK and others can best support the approach to maximise this technology’s potential.

Project details

  • Project lead: Paul Calcraft, Behavioural Insights Team
  • Funded value: This work is one phase of a wider programme of work being undertaken by the Behavioural Insights Team to address existing barriers to data linkage, with a total funded value of £318,050
  • Duration: March 2020 – December 2021

This project is funded via the ADR UK Strategic Hub Fund.

References

1: "Discussion: Statistical Disclosure Limitation". Journal of Official Statistics. 9: 461–468. 1993.

Category: Research using linked data, ADR England

Share this:

You are currently offline. Some pages or content may fail to load.