Working with administrative data can be complex and challenging. In this section, we address some of these challenges by providing explanations and insights to help you understand these complexities and use the data effectively:
Understanding data linkage
Records for individuals within administrative data often have unique identifiers - such as an NHS number or National Insurance Number. You can therefore link data by matching common identifiers across datasets. The resulting linked dataset is then de-identified and added to a trusted research environment, and researchers can apply for accreditation to access this dataset. Note that while linking data by matching common identifiers can be effective, it is not immune to error. Error in recording the unique identifier can still lead to (normally) small numbers of missed matches and false positives.
What if administrative data does not contain these identifiers? Can it still be linked?
Linking data using statistical methods
In England, Scotland, and Northern Ireland, most administrative datasets do not contain the identifiers described above. In these instances, different statistical methods must be used to match individuals on the basis of identifiable characteristics - such as their name and postcode. Approaches to link data include clerical (i.e. manual), deterministic, and probabilistic methods - or some combination of these.
It is important for researchers to understand how the data has been processed before, during, and after linkage. This is because processing methods can cause biases in the data. During linkage, for example, match-rate errors might not be random for different reasons - such as if there is a higher frequency of spelling errors during data collection for individuals with foreign names. These individuals may therefore be less likely to appear in the linked dataset if the linkage algorithm matches individuals based on their names. Such linkage errors disproportionately affect disadvantaged groups, and can undermine data analyses and any evidence generated.
Resources and training
To learn more about the principles and execution of data linkages, explore the following resources:
- An Office for National Statistics working paper on the development of their data linkage methods for data hosted on secure distributed computer systems
- A cross-government review with contributed articles on linkage methods and recommendations for government data linkage
- GUILD: GUidance for Information about Linking Data sets
- A two-day Introduction to data linkage course, delivered by the National Centre for Research Methods
- Interactive blogs on probabilistic linkage
- Introduction to Splink: a software package for probabilistic record linkage at scale
- Tutorials and demos using the Splink data linkage tool.
You can also explore the below slide deck on dealing with personal identifiable information and attribute data in data linkage and research.
Dealing with messy and complex data
Administrative data is by nature not primarily intended for statistical or research work. This can create 'messy' datasets which may contain missing values, duplicated records, inconsistent formats, or incorrect entries. What issues can arise from ‘messy’ data, how can they be dealt with, and what are the implications for interpreting and presenting research outcomes and statistics?
The two slide decks provided here are versions of those used at in-person workshop at the ADR UK Conference 2023, which aimed to answer these questions. Participants included academics, data owners and analysts.
The slides include definitions and principles that should be followed. They then move on to how data can best be structured to facilitate analysis - from simple flat tables, to the use of spines and indices, to relational databases and graph databases. Messiness and error are then covered, as well as approaches to addressing error and its impact on linkage and analysis.
Additional resources are provided at the end of the second slide deck.