A new user guide for Python code to develop synthetic data
This blog by Emily Oliver, Head of Research & Capacity Building at ADR UK, explains the importance of synthetic data for research. It explores the potential of a prototype Python notebook and user guide in creating secure, low-fidelity versions of this data.
As ADR UK continues its mission to drive up the use of administrative data for research for public good, it’s remarkable how often the concept of synthetic data crops up.
Not because researchers don’t want to access the real data, but because they want a version of the data to play with before they embark on the journey to access the real data. ‘Play with’ could mean developing code, testing analysis methods, training others - or even using synthetic data to check whether the real data is going to give them what they need to answer their research question. All these uses point to the fact that the existence of synthetic data is important. As a protagonist of administrative data for research for public good, ADR UK needs to support its creation.
But early on, we also learned that data owners are not always comfortable with the concept of synthetic data. They have concerns about its disclosure risk, or about it being mistaken for real data, and are not always convinced by its potential utility.
To understand this better, ADR UK commissioned the Behavioural Insights Team to explore how synthetic data is perceived by practitioners and stakeholders in government. Their findings recommended that ADR UK should focus on low-fidelity data creation. They then followed up this work with the release of a prototype Python notebook as a practical tool for implementing their recommendations.
What’s reassuring about this tool is that it only allows the user to generate low-fidelity data, which is the safer and lower-risk form of synthetic data. It creates a version of the data that follows the structure and some of the patterns found in the real data. As such, it is plausible and represents the data as a whole. At the same time, because it doesn’t preserve statistical relationships between columns, it reveals very little - if anything - about any individual in the dataset.
The tool has now been extensively tested and is available for use. You need Python (preferably Python 3), two common Python libraries (NumPy and pandas), and a software tool for viewing, editing, and running Python notebooks such as VSCode or Jupyter.
Helpfully, the user guide provides clear, step-by-step instructions, including how to ensure your system can run it. It guides the user through methods to run the cells in the notebook, explains how output files can be saved, and even tells you how to check that the notebook has worked. There’s a useful section on troubleshooting as well as further information for more advanced users.
The tool’s creators, Dr Paul Calcraft, Martina Maglicic, and Dr Iorwerth Thomas, set out specifically to make something that would appeal to both data owners and researchers. They also wanted it to be self-explanatory for anyone who works with data - even if they have limited experience with Python. The result is clear and straightforward to use, maximising safety and minimising complexity.
Of course, low-fidelity synthetic data isn’t a silver bullet. There will be instances where higher fidelity synthetic data is both more appropriate and more useful. In a recent workshop where different approaches to creating synthetic data were discussed, participants agreed that the value of different tools was entirely reliant on the end utility of the synthetic dataset.
ADR UK is currently seeking applications for funding to test the utility of low-fidelity synthetic data. New datasets created using the Python notebook are eligible. To help you decide if this is for you, we’re planning a workshop to introduce the notebook and its uses. If you would be interested in attending, please indicate your interest.