Splink: Free software for probabilistic record linkage at scale
Categories: Blogs, ADR England, Crime & justice
24 August 2022
In this blog, Ministry of Justice (MoJ) data scientist Robin Linacre introduces Splink, a free and open source software library for record linkage at scale that has now been downloaded over three million times. This software implements a best-practice approach, and has been developed as part of the Data First programme, ADR UK’s collaboration with MoJ.
In 2019, my team was challenged to develop a new data linking methodology to produce new, higher quality linked datasets from the justice system. The ultimate goal - which we’ve now achieved - was to share new linked datasets with academic researchers, as part of the ADR UK-funded Data First programme.
Data linking is needed when data lacks unique identifiers that would allow people to be joined up across different datasets. In the absence of these unique identifiers, probabilistic record linkage is a technique that can be used to find which records pertain to the same person.
In deciding our approach, we noted two important observations:
- Record linkage is a pervasive problem across academic, public and private sector organisations - but there is little consistency in approach.
- No free software existed that was able to perform probabilistic linkage of large datasets of the size held at MoJ.
This suggested it would be useful to Data First and the wider record linkage community to develop new record linkage software. This software would implement a best practice methodology and be capable of linking very large datasets. As a result, Splink was born.
Splink is now in its third version. It is a freely available, open source Python package with the following key features:
- Faster and more accurate than other free tools
- Able to link huge datasets, of tens of millions or records or more
- Its development has benefitted from guidance from our academic advisors - three professors who are experts in data linkage
- The software produces a wide range of interactive data visualisations that help to build effective models, explain linkage predictions, diagnose problems, and quality assure models
- The software is compatible with multiple databases and big data processing engines, meaning it can run on a wider range of computer systems.
This software has been used to link some of the largest datasets held by MoJ as part of Data First. More widely, the demand for Splink has been higher than we expected - and we recently reached 3 million downloads. We’re aware of its use in other government departments including the Office for National Statistics, the private sector, and Splink has even been used for published research from Stanford University.
You can find out more about Splink on our website or download and start using Splink. We’d be very happy to hear from researchers interested in using Splink for their work - get in touch at robin.linacre@digital.justice.gov.uk.