Administrative data is made up of information collected when people interact with public services. Inevitably, because this data is collected by different people in different ways, there are likely to be inconsistencies and gaps in the datasets. These gaps can affect the analyses of researchers using the data and the process of linking different datasets.
This page contains a commentary on considering bias in administrative data, and an overview of considering bias in linked data.
Considering bias in administrative data
Written by: Professor Iain Brennan, University of Hull (July 2023)
Because of their scale and their large catchment populations, administrative datasets can reveal association and causal pathways that may have previously been unknown or unidentifiable. This potential is significant and exciting, but users of administrative data must carefully consider the conditions and mechanisms that led to the data being generated.
For example, users should recognise potential selection bias, which might be caused by the process of collecting, recording, and recoding administrative data. Administrative data showing patterns around, for instance, school exclusions, care received, or criminal convictions, could contain biases in who did and did not receive services or how services are administered. These biases in the data may therefore present an inaccurate view of exposures and outcomes. Using administrative data without considering how it was generated, or what might be missing from official records, risks reinforcing pre-existing biases. This could further stigmatise marginalised or discriminated-against groups.
In addition, bias in the data-generating process creates a form of statistical selection bias. This can adversely affect the accuracy of findings. For example, using a sample of patients in hospital to test for a risk factor for a health condition can result in spurious associations. These associations could be explained by the reason the patient was in hospital (this is known as Berkson’s Paradox or collider bias). Similarly, if there is racial bias affecting the likelihood of a conviction resulting in a prison sentence, examining the relationship between ethnicity and, say, speech, language and communication needs (also associated with prison sentence) can result in a spurious association.
Although there is now more awareness of these issues in fields like epidemiology, they are still largely absent from contemporary social science training. As more social scientists harness the power of administrative data, it is crucial that we understand its hidden biases. Not doing so risks our making incorrect inferences and reinforcing existing inequalities in society.
One potential solution is to increase the availability of training in causal inference. There is a growing range of accessible introductions and primers in causal inference. There are also several training courses in causal inference. Training is intensive, and can last up to five days, but most find it illuminating and rewarding.
Considering bias in linked data
Written by: Dean Jathoonia and Jen Hampton, Office for National Statistics
There are many ways of assessing the quality of matched data, including by measuring:
- match rate (how many records were matched out of all of the records one tried to match)
- precision (how many of the matches that were made should have been made)
- recall (how many of the matches that could have been made, were made)
- bias in the matched data.
These four measures should be explained to all users of linkage services. This can help them understand the results of matching, and how they may affect their analysis.
A simple example
Consider a linked dataset that contains a person’s ID and hairstyle, but no other information. If the original datasets (from which the linked dataset is derived) had been linked using surnames, then the resultant linked file could be biased against females. This is because they are more likely to change their name after a change in marital status, and therefore will not match.
If an analyst is unaware of this, they may conclude that the majority of the population has short hair. This conclusion would be incorrect: it did not account for the bias that resulted from the linked data not correctly including females.
A realistic example
In the UK, it is generally more difficult to match people whose names are not typically English names. This is often due to spellings, linguistic nuances, and naming conventions. Often this could lead to biases for certain groups, such as people of Asian ethnicity. There are certain health conditions that are more prevalent among Asian people, such as diabetes and high blood pressure. If an analyst were looking at a linked dataset containing health information, and that linked dataset were biased, then this could lead to the analyst drawing incorrect or inaccurate conclusions about the incidence of these conditions among the population. This is because the given data improperly represents the true ethnic composition of the population.
Quantifying bias
In reality, bias is much more difficult to identify and quantify. It can occur in any variable, and in any dataset where there is likely to be some difference between different socio-demographic groups. It can also occur within nested groups. For example, there may be a difference between males and females, and another difference between young and elderly people, and therefore there could be two overlapping biases.
There are many ways of thinking about bias in linked data. It is often useful to consider the extent to which a given demographic group is under- or over-represented in linked data compared to the representation of that group in raw data, proportional to the overall match rate.
For example, if raw data contained 100 males and 100 females and the overall match rate was 80%, then a perfect proportional representation in the linked data would be to see 80 males and 80 females. (Note: This does not mean there has been a perfect match! Indeed, in this case there absolutely has not been: only 80% of people have been matched. All this does mean is that the resultant linked dataset, however successful the match rate, is proportionally representative of the original dataset used as the denominator.
In practice, this ‘perfect representation’ is rarely achieved. Some groups are over-represented, and some groups are under-represented, with the match rate being an average across groups. If the example above had matched 90 males and 70 females in the linked data, there would be an over-representation of males and an under-representation of females. This is compared to the projected proportionality of 80 of each, based on the 80% overall match rate.
Having an over-representation is not in itself a problem for that specific group. Again, take the example above: 90 males and 70 females were matched with an overall match rate of 80%. The fact that 90% of males matched is actually a good thing - assuming the matches are true positives (matches that should be made) and not false positives (incorrect links). It is the fact that there exists a ‘discrepancy’ between this figure and the overall match rate that indicates there is a problem elsewhere. In this case, the problem is the matching of females.
What can we do about bias?
The ONS is producing a tool to help measure these biases. The takeaway is that finding a bias is an indication of two things:
- There may exist some issue in the matching algorithms that is not adequately capturing all true positive matches for a given group.
- In this case, the party linking the data (the linker) may wish to re-examine the data and the matchkeys or model to determine whether anything can be done to improve the matching. As an example, if there were a bias for females compared to males then the linker might consider what matchkeys are operating more successfully for males rather than females, such as surname (where females are more likely to change their name, and therefore not match). If there were a bias for Chinese or Romanian people, the linker may wish to consider whether a transposition of forename and surname might help, as those cultures often write the surname first (whereas English naming conventions write the surname last).
- The matching algorithm is fine, but there exists some issue in the data that is skewing the matching process (such as missingness or a data recording issue) that prevents matches being made for a given group.
In both cases, the linker should outline to the user where bias might arise, and instruct the analyst to be aware of that. The analyst, in turn, should consider how any resultant bias may affect their analyses and their conclusions, and to what extent. Their analytical commentary should take this into consideration.
If the particular research interest is in subgroups where matching is more difficult, it may be that additional resources are required to look at matching specifically within these groups. This may involve additional clerical matching, or fine tuning of methods, so that the match can meet the researcher’s needs. The researcher may also need to accept that they will have to work with lower quality data for their community of interest.
An analyst needing high-level aggregate data on general trends may be prepared to tolerate a greater level of bias than a methodologist using a model that is highly sensitive to bias, such as dual system estimation. Bias should always be minimised, but once the linker has done as much as possible to ensure the matching algorithm is performant, the analyst must decide whether the remaining bias is acceptable. Either way, the analyst must be mindful of that bias when producing and analysing their statistics. They should also communicate the bias in any publication or report.
When used in these ways - focusing attention on bias in analysis, and directing the linker to where they should improve their algorithm - an indication of bias can be used. This works alongside measuring the match rate, precision, and recall, providing an overall assessment of the linked data.
For further information, contact the Data Linkage Hub at ONS.