Good research needs good data: What happens behind the scenes of data analysis?

Categories: Research using linked data, Blogs, Crime & justice

14 October 2021

Rates of violent crime in England and Wales have increased in recent years, resulting in large amounts of funding being allocated to violence prevention initiatives. It is crucial that such initiatives are evaluated in order to establish what can be done to successfully reduce violence – at the moment there is insufficient evidence on what works (and what does not).

In 2020, the Department for Education (DfE) and Ministry of Justice (MoJ) created a large dataset linking together de-identified education and crime data. Our project aims to investigate whether this linked dataset can be used to reliably assess whether interventions aimed at reducing violent crime were effective.

Administrative datasets such as these have several advantages. The key ones are their size and coverage (they generally contain data on a large proportion of the population over long periods of time), and the fact that the information is generally objectively measured. However, using administrative data comes with several challenges. For example, despite the fact the data has generally undergone various quality control checks, it is often still “messy”. Users often have to carry out extensive data management tasks to clean and prepare the data ready for analysis.

Our project is no exception to this. One of the key tasks we had to complete was to work out who we could include in our analysis and what information was available on them. This was not quite as straightforward as it sounds! The “linked dataset” was in fact many different datasets, each containing different (and sometimes the same) bits of information. For example, each school census – which is carried out three times during the school year – contains information on key characteristics such as the pupil’s year and month of birth, their gender and ethnicity, and the school year they are in at that time. These datasets are easily linkable via the unique pupil matching reference (a pseudonymous ID number) and – for the most part – this information is consistent or in agreement across all censuses. However, what do you do about the minority with more than one ethnicity recorded, or with more than one year of birth? And what if this does not match up with their age as recorded in the MoJ data, for instance? What do you do if you want to know a pupil’s absence rate in year eight if they are recorded as being in year eight in both 2000/2001 and 2002/2003? These are just examples of many decisions that had to be made before being able to answer the question we set out to answer.

These sorts of issues are faced by all researchers using administrative data and this processing inevitably takes time (and computing power!). However, the end product is – in our case – a dataset large enough to allow us to investigate our outcome of interest (serious violence). If we had used a survey or cohort study, there may have been insufficient numbers to study this relatively rare outcome. In addition, those involved in serious violence may be less likely to participate in research studies and/or less likely to report this type of outcome accurately – two other key advantages of using administrative data for this work.

But what is the answer to the question we posed? In short, the answer is yes – the dataset can be used to evaluate violent crime reduction interventions, but with some caveats and cautions. The full findings will be described in our project reports which will be published on the ADR UK website in due course.

Share this: