Evaluating the benefits, costs and utility of synthetic data
Categories: Research using linked data, Public engagement, Potential, ADR UK Partnership
12 July 2024
Two complementary projects will explore the use of synthetic data. They will focus on low-fidelity synthetic data, which is artificial data that has been created to reflect the format of the original data (its layout and the types of information it contains) but without preserving any relationships between variables. The projects will explore the potential benefits, costs and utility of synthetic data for administrative data research.
Synthetic data – also known by other names such as artificial, dummy, simulated, mock or fake data – is an emerging area of development for supporting research using securely held administrative data. ADR UK has identified that low-fidelity synthetic datasets could be used:
- for training purposes
- to explore whether a dataset could be helpful for a specific research project
- to support researchers with developing their code, understanding the structure of the data, and testing different statistical methods, before they can get access to the real data.
These projects will collect evidence and insights on synthetic data from the perspective of two stakeholder groups:
- Data owners and data providers, including trusted research environments: One project will explore the benefits and costs of synthetic data for this group, and mechanisms for providing synthetic data
- The public: One project will explore the public’s understanding of and attitudes towards synthetic data.
These projects will inform recommendations to scale the production and sharing of synthetic data, taking into account the views of these stakeholders. They will include developing shared terminology and agreed governance structures for synthetic data.
This is a jointly funded initiative from the Economic and Social Research Council (ESRC)’s Data & Infrastructure Programme and ADR UK.
Read more about the projects below.
Balancing the data scales: A cost-benefit analysis of low-fidelity synthetic data for data owners and providers
There is growing discourse around synthetic data. It has potential to not only address data challenges in a fast-changing landscape, but also to foster innovation and rapidly advance data research. From optimising data sharing and utility, to sustaining and promoting reproducibility, to mitigating the risk of disclosure, synthetic data has emerged as a solution to various complexities in the data ecosystem.
The project
Using a mixed-methods approach, this project seeks to explore how data owners and trusted research environments view the operational, economic, and efficiency aspects of using low-fidelity synthetic data. It aims to produce a comprehensive report on its findings.
The project has various areas of work:
- Conduct stakeholder engagement and expert panel consultations. A thorough literature review will be conducted to map current knowledge and identify gaps related to synthetic data. This will focus on technical requirements, ethical considerations, and sharing models. The review will establish a foundational knowledge base for subsequent research phases.
- Create a survey targeting data creators. This will explore their readiness and perceptions towards the production and dissemination of synthetic data. It will address any technical, operational, and financial challenges, and investigate perceived and achieved benefits. It’s anticipated that the survey will provide valuable insights into scaling the production of synthetic data.
- Produce case studies with organisations, such as the Office for National Statistics. These will examine the practical aspects of synthetic data sharing mechanisms, usage, and cost structures, offering a real-world perspective on existing frameworks.
- Facilitate focus group discussions with representatives from trusted research environments. These will delve into the operational challenges and opportunities of synthetic data integration in secure environments, enhancing understanding of its practical implications.
Potential of the research
The project has three primary focuses:
- To assess various models for synthetic data sharing, evaluating their implications and efficiencies for data owners and trusted research environments. This will cover aspects such as production, curation procedures, metadata sharing, and data discoverability.
- To measure how synthetic data can improve efficiency for data owners and trusted research environments. This will analyse its impacts on resources, secure environment usage load, and researchers’ uptake of synthetic versus real datasets.
- To evaluate the costs incurred by data owners and trusted research environments in creating and maintaining low-fidelity synthetic data.
By achieving a clearer understanding of the efficiencies and costs associated with low-fidelity synthetic data, the project will provide critical insights that can inform policy decisions and operational strategies. Ultimately, this work aims to foster a more robust and accessible data-sharing ecosystem that can significantly benefit the research community.
Deliberative workshops with members of the public: Establishing trust in the use of synthetic data
Different types of synthetic data can pose different levels of risk to confidentiality, depending on how closely they match the original dataset. Low-fidelity synthetic data preserves fewer relationships between variables, so has lower disclosure risk, but may still have utility for researchers for training purposes, familiarisation with datasets and developing and testing codes. Data owners, including the NHS and other UK Government departments, have already begun making some low-fidelity datasets available for researchers, with a variety of access arrangements.
Despite enthusiasm from researchers to expand the provision of synthetic data, there has been no widespread consultation with the UK public.
The project
This project will undertake a public consultation with people from across the four UK nations, working with a community engagement agency (Egality Health) to help recruit the cohort.
There will be four workshops in the summer and autumn of 2024 to explore public attitudes towards the use of synthetic data for research. Workshop content and dissemination strategies will be guided by an expert steering group and public collaborators. The workshops will focus on:
- perceived benefits and risks of synthetic data
- the acceptability of different access arrangements
- language used to describe synthetic data
- techniques for communicating about synthetic data with the public.
Potential of the research
This project will provide vital information about public perceptions of the use of synthetic data for research, which has the potential to shape future decisions in this area.
The project’s outputs will include a set of recommendations for researchers and data owners based on the key themes identified from the first three workshops. In the final workshop, the project team will agree final recommendations with the public consultation group before producing a full written report.
The recommendations from the project will be relevant to data owners and providers who release, or are planning to release, synthetic data, as well as to the researchers who access it. The project team will also produce an accessible output for the public on the topic of synthetic data, based on the recommendations.
This research will address the following key questions:
What do members of the UK public…
- currently understand about ‘synthetic data’ and its application for research?
- feel are the potential benefits, risks and ethical issues relating to the use of synthetic data for research?
- think about different access methods for synthetic data?
- feel about the language currently used to describe synthetic data?
- believe are the best ways for data owners and researchers to communicate with the public about the use of synthetic data in research?
- think about future directions for the use of synthetic data in research?
ADR UK was unable to award funding for a third grant to explore the experiences of researchers using synthetic data. However, the researcher perspective is being considered through other means including interviews and a live dialogue at an international workshop on this topic. It is expected that this work will feed into the final joint report of the two funded projects.
Project details
Balancing the data scales: A cost-benefit analysis of low-fidelity synthetic data for data owners, data providers
Project lead: Cristina Magder
Project co-leads: Dr. Hina Zahid, Maureen Haaker, Dr. J. Kasmire,
Funded value: £112,986 (Full economic cost)
Duration: April 2024 - March 2025
Deliberative workshops with members of the public: Establishing trust in the use of synthetic data
Project leads: Dr. Fiona Lugg-Widger and Dr. Rob Trubey
Funded value: £120,893
Duration: March 2024 – February 2025
Categories: Research using linked data, Public engagement, Potential, ADR UK Partnership