ADR UK Research Fellows: Methodological developments within administrative data research
Categories: Research using linked data, ADR UK Research Fellows, ADR England, Office for National Statistics
9 August 2021
ADR UK is funding three research teams as part of the Economic & Social Research Council (ESRC) Research Methods Development Grants.
The research teams are being funded to address methodological challenges related to ADR UK’s mission to join up the abundance of administrative data already being created by government and public bodies across the UK and make it available for research in the public interest. This is being done via the development of new research methods; the application of existing methods in a novel way; and the development of our understanding of methodological challenges.
Read about the projects being funded via the scheme below.
Creating longitudinal datasets for linked administrative data research using synthetic data
Dr Katie Harron (UCL), Professor Pam Sonnenberg (UCL) , Professor Bianca Lucia De Stavola (UCL), Professor James Carpenter (London School of Hygiene & Tropical Medicine) and Professor Andrew Copas (UCL) aim to provide guidance on appropriate use of synthetic administrative data. Synthetic data – or artificial data that looks like the original data source without containing information on any ‘real’ individuals – preserves the statistical properties of the original data sources.
If widely shared, synthetic data could allow researchers to understand the structure of data sources, develop analysis plans and algorithms, and test different statistical models. This could be done in parallel to applying for access to linked administrative datasets, streamlining the research process. Final refinements and analyses could then be conducted on the real data.
This project will assess the feasibility, performance and comparability of a range of methods for generating synthetic data. The findings will demonstrate ways of making existing data more widely useable and improving access to new data linkages through advanced methods. Relevant guidelines for the generation of synthetic data in different settings (for example, for training or exploring data prior to applying for the real data) will also be developed.
View project details
This project aims to explore the following research questions:
- What do data providers, researchers and the public identify as the existing facilitators of and barriers to the generation and use of synthetic versions of linked administrative data sources?
- What is the feasibility, performance and usefulness of a range of synthetic data generation methods?
- What are appropriate guidelines for the generation of synthetic data in different settings?
- What are the additional technical considerations required for the generation of linked administrative data?
For this project, the group will focus on generating synthetic versions of three datasets:
- The third National Survey of Sexual Attitudes and Lifestyles (Natsal-3)
- Hospital Episode Statistics (HES), which contains detailed information on health from records of admissions to NHS hospitals in England
- National Pupil Database (NPD), which contains details of school attainment, attendance and exclusion for pupils attending English state schools.
Project lead: Dr Katie Harron, UCL
Duration: January 2021 - June 2024
Funding amount: £161,389
Linkage of national longitudinal cohort studies and administrative data: A mutually beneficial arrangement
UCL researchers Dr Richard Silverwood, Professor Lisa Calderwood, Professor Bianca De Stavola and Professor George Ploubidis are analysing linked datasets to better understand the methods for addressing missing data, data quality and residual confounding (incorrect findings that occur as result of variables that have not been measured).
Research-ready administrative datasets tend to be population-representative and large in size, but as the data is not collected specifically for research purposes, it often lacks information in key domains.
Meanwhile, the UK has a long and successful history of national longitudinal cohort studies in which the same group of individuals are followed over time, potentially for several decades. These cohort studies aim to be population-representative at initiation and collect rich information across a wide range of research areas. However, they inevitably suffer from missing data, particularly due to decreasing levels of participant response over time. Linkages between existing cohort studies and administrative data, with the potential to draw on the different strengths of each data source, have therefore become increasingly common over recent years.
The overall aim of this project is to investigate ways to leverage the strengths of each data source in mutually beneficial ways to address important questions around missing data, data quality and residual confounding. This will be done using existing data linkages in health and education with UCL Centre for Longitudinal Studies (CLS) cohorts.
View project details
This project aims to explore the following research questions:
1. How can linked administrative data aid the handling of missing longitudinal cohort study data?
2. How can linked cohort study data improve our understanding of the quality of administrative data?
3. How can linked cohort study data help address residual confounding in analyses of administrative data?
This project will use data across two of the world-leading national longitudinal cohort studies run by CLS and funded by ESRC:
- The 1958 National Child Development Study (NCDS), which follows the lives of an initial 17,415 people born in Great Britain in a single week of 1958.
- Next Steps (formerly the Longitudinal Study of Young People in England), which follows a representative sample of young people born 1989-1990.
Additionally, the project will use three sources of administrative data linked to the cohort studies:
- National Pupil Database (NPD), which contains details of school attainment, attendance and exclusion for pupils attending English state schools
- Hospital Episode Statistics (HES), which contains detailed information on health from records of admissions to NHS hospitals in England
- Higher Education Statistics Agency (HESA), which contains detailed information on higher education attendance in the UK.
Project lead: Dr Richard Silverwood, UCL
Duration: January 2021 - June 2024
Funding amount: £157,577
Machine learning methods for studying the trajectories of young offenders in administrative data
Professor Imran Rasul (UCL) and Dr Monica Costa-Dias and Dr Sarah Cattan of the Institute of Fiscal Studies are using machine learning techniques to study young people’s life course trajectories in education and criminal behaviour.
They aim to understand how these trajectories start and evolve to help inform how policies can be best designed and targeted to prevent young people engaging in crime. To do so, they will identify different types of trajectory, describe the characteristics of individuals with different types of trajectory, and map their journeys through the education and criminal justice systems.
This project will then aim to advance the use of machine learning techniques to analyse UK administrative data. They will apply a type of methods called ‘sequence analysis’, which have mostly been applied to survey data, to cluster individuals in terms of their educational and offending trajectories. They will test the viability and usefulness of using these techniques in administrative data research. If successful, this will provide researchers with a tool to explore administrative data as a rich source of longitudinal information about individuals’ lives and contacts with public services.
View project details
This project aims to explore the following research questions:
- What are the different patterns of offending and the connections between trajectories in education and crime?
- What are the early family, school and neighborhood characteristics that are associated with different types of trajectories?
- How do these trajectories interact with, and how are they shaped by, the criminal justice system?
- How viable and useful are sequence analysis methods to study the diversity of trajectories in crime and education in UK administrative and linked administrative data?
This project will link the following datasets:
- Ministry of Justice’s Police National Computer, which includes offending records and court outcomes
- National Pupil Database (NPD), which contains details of school attainment, attendance and exclusion for pupils attending English state schools.
Project lead: Professor Imran Rasul, UCL
Duration: January 2021 - June 2024
Funding amount: £171,441
Categories: Research using linked data, ADR UK Research Fellows, ADR England, Office for National Statistics