Understanding data linkage
This page provides a summary of the challenges of data linkage and signposts researchers to resources and training on data linkage.
Records for individuals within administrative data often have unique identifiers - such as an NHS number or National Insurance Number. You can therefore link data by matching common identifiers across datasets. The resulting linked dataset is then de-identified and added to a trusted research environment, and researchers can apply for accreditation to access this dataset. Note that while linking data by matching common identifiers can be effective, it is not immune to error. Error in recording the unique identifier can still lead to (normally) small numbers of missed matches and false positives.
What if administrative data does not contain these identifiers? Can it still be linked?
Linking data using statistical methods
In England, Scotland, and Northern Ireland, most administrative datasets do not contain the identifiers described above. In these instances, different statistical methods must be used to match individuals on the basis of identifiable characteristics - such as their name and postcode. Approaches to link data include clerical (i.e. manual), deterministic, and probabilistic methods - or some combination of these.
It is important for researchers to understand how the data has been processed before, during, and after linkage. This is because processing methods can cause biases in the data. During linkage, for example, match-rate errors might not be random for different reasons - such as if there is a higher frequency of spelling errors during data collection for individuals with foreign names. These individuals may therefore be less likely to appear in the linked dataset if the linkage algorithm matches individuals based on their names. Such linkage errors disproportionately affect disadvantaged groups, and can undermine data analyses and any evidence generated.
Resources and training
To learn more about the principles and execution of data linkages, explore the following resources:
- An Office for National Statistics working paper on the development of their data linkage methods for data hosted on secure distributed computer systems
- A cross-government review with contributed articles on linkage methods and recommendations for government data linkage
- GUILD: GUidance for Information about Linking Data sets
- A two-day Introduction to data linkage course, delivered by the National Centre for Research Methods
- Interactive blogs on probabilistic linkage
- Introduction to Splink: a software package for probabilistic record linkage at scale
- Tutorials and demos using the Splink data linkage tool.
You can also explore the below slide deck on dealing with personal identifiable information and attribute data in data linkage and research.