Why we’re exploring the potential of synthetic data to support public-good research
In this blog, ADR UK Research Manager Balint Stewart sets out why we are funding a grant holder to explore the use and potential of synthetic data. Synthetic data mimics the structure and characteristics of real dataset, without containing any information about real people. This is a powerful tool in widening the use of administrative data for public good research, while protecting the security of public data.
Find out more by registering for our applicant webinar on Friday 3 March, 13:00 - 14:30.
Applying for data access can be time-consuming and uncertain
Researchers increasingly rely on secure access to sources of sensitive data to undertake their analyses. Infrastructure has been developed since the mid-2000s that enables researchers to access and use such data for their investigations – this infrastructure is known as trusted research environments.
Many trusted research environments operate under the principles of the Five Safes framework. This stipulates that the following considerations should be made when designing a data access solution:
- Safe People (accredited researchers)
- Safe Projects (approved projects)
- Safe Data (accessed through a trusted research environment)
- Safe Place (accessed through a secure setting)
- Safe Outputs (checked for risk of any re-identification)
Typically, data services which operate trusted research environments establish their own processes for accrediting researchers and projects, but these processes can sometimes be lengthy. They also rely on imprecise information provided by the researcher about the data they are applying to access. This happens because the researchers are not able to see the data they want to use until after they have been through the application process, and had their project approved.
As a consequence, research projects can be held up while a researcher waits for access to data. And, when the researcher gets access to it, they may find the data does not meet their expectations.
Synthetic data can help researchers understand the dataset more quickly, and enable hands-on training
Low-fidelity synthetic data presents a potential solution to this problem. This is a version of the data that resembles the real data, but does not include any information about real individuals. For example, low-fidelity synthetic data will show the structure of the real dataset, the types of variables it contains, and the type of analysis it will enable.
Accessing this type of data gives researchers the opportunity to understand the real data and plan their research accordingly, before going through the lengthy process of applying to use it. This has the potential to enable a researcher to:
- submit a higher quality application that is more likely to be approved
- generate code to analyse the data, based on their understanding of the structure of the synthetic data, while waiting for approvals to access to the real data. This can significantly reduce the time between initially applying to access data and completing analyses, since researchers are able to carry out these activities in parallel.
In addition, by providing trainers with an accessible version of the data for students to practice on, synthetic versions of secure data can also enable accessible hands-on training for learners wishing to learn how to analyse secure administrative and social survey datasets,
Low-fidelity synthetic datasets are typically created by randomly generating values within each variable. These values roughly follow the distribution of the real data within the variable, but the relationships between the variables are not preserved (this is called univariate synthetic data). Consequently, low-fidelity synthetic data is a powerful tool in supporting data security. It is much less likely to inadvertently reproduce information about a real individual than ‘high-fidelity’ synthetic data, which mimics the real data much more closely. However, it can still be very useful to researchers in understanding the structure of the data and using it to generate code.
How new funding will make an impact
To date, there are a number of different methods to produce synthetic data, including an easy-to-use tool for the generation of low-fidelity synthetic data. Some progress has been made to make synthetic versions of secure data available. Examples include synthetic versions of justice system datasets from the ADR UK-funded Data First programme at the Ministry of Justice, and the linked Grading and Admissions Data for England (GRADE).
Yet we are far from seeing synthetic data operationalised to the point where trusted research environments can produce it routinely and at scale. There is also a lack of evidence to support decisions among data owners and data services about how the governance around this might be best implemented. Data owners and services need real-world use case studies on costs and benefits, to inform more systematic approaches to creation and sharing of synthetic data.
That’s where our new funding opportunity comes in. In partnership with the Economic and Social Research Council and UK Research and Innovation, ADR UK is funding an individual or team to explore how the potential of synthetic data can be harnessed at scale.
Recipients of this grant will evaluate the current uptake, utility and governance of synthetic versions of datasets held in trusted research environments including the Office for National Statistics (ONS) Secure Research Service and UK Data Service. The grant holder will produce a report based on this evaluation, setting out recommendations for how synthetic data production and provision can be achieved at scale.
This has the potential to inform an approach through which synthetic data is widely used to support research in the public interest, in support of our ambitions in training and capacity building.
You can register for an applicant webinar on Friday 3 March, 13:00 – 14:30 to hear more, and to ask any questions. Applications are open now and will close on Tuesday 9 May.