Understanding data linkage

Records for individuals within administrative data often have unique identifiers - such as an NHS number or National Insurance Number. You can therefore link data by matching common identifiers across datasets. The resulting linked dataset is then de-identified and added to a trusted research environment, and researchers can apply for accreditation to access this dataset. Note that while linking data by matching common identifiers can be effective, it is not immune to error. Error in recording the unique identifier can still lead to (normally) small numbers of missed matches and false positives.  

What if administrative data does not contain these identifiers? Can it still be linked?

Linking data using statistical methods

In England, Scotland, and Northern Ireland, most administrative datasets do not contain the identifiers described above. In these instances, different statistical methods must be used to match individuals on the basis of identifiable characteristics - such as their name and postcode. Approaches to link data include clerical (i.e. manual), deterministic, and probabilistic methods - or some combination of these.  

It is important for researchers to understand how the data has been processed before, during, and after linkage. This is because processing methods can cause biases in the data. During linkage, for example, match-rate errors might not be random for different reasons - such as if there is a higher frequency of spelling errors during data collection for individuals with foreign names. These individuals may therefore be less likely to appear in the linked dataset if the linkage algorithm matches individuals based on their names. Such linkage errors disproportionately affect disadvantaged groups, and can undermine data analyses and any evidence generated.

Resources and training

To learn more about the principles and execution of data linkages, explore the following resources: 

You can also explore the below slide deck on dealing with personal identifiable information and attribute data in data linkage and research.