Covid-19 Infection Survey – upscaling the statistical processing pipeline
19 June 2023
Authors: Dr Charlotte Murphy (Office for National Statistics), University of Oxford, IQVIA, The Glasgow Lighthouse Laboratory, UK Health Security Agency, The University of Manchester and Wellcome Trust
Date: April 2020
This research project involved updating the statistical processing pipeline for the Covid-19 Infection Survey. The update enabled the production and analysis of critical data to be carried out in a timely manner. This ensured time sensitive information was disseminated to government during the Covid-19 pandemic, to inform critical decision and policymaking.
The Covid-19 Infection Survey was set up from April 2020 to estimate the prevalence of Covid-19 infection in the UK. The survey provided important information about the sociodemographic characteristics of the people and households who have contracted Covid-19, vaccination status, the presence of antibodies, and symptoms experienced. A statistical pipeline was required to process and analyse critical data including swab results, antibody tests, and demographic information. As the study expanded there was an increased demand for an updated and automated pipeline to improve speed, memory, and storage-related issues.
The UK Government, Scottish Government, Welsh Government, and Northern Ireland Executive are the main users of the Covid-19 Infection Survey. It is used to track the Covid-19 pandemic and help inform decisions about restrictions and related policies. Academics and health researchers also use the survey to research the Covid-19 pandemic, the characteristics of those testing positive, and any associated inequalities.
This project used 10 datasets, including data from the ONS Secure Research Service, IQVIA, the Glasgow Lighthouse laboratory, UK BioSample Centre, University of Oxford laboratory, and UK Health Security Agency to create the Covid-19 Infection Survey data.
The upscaling of the Covid-19 Infection Survey statistical pipeline was a collaboration between the statistical processing team, who oversee the historic pipeline, a python development team, and a proving team. They also liaised with external teams and parties, such as Covid Infection Survey methodology and analysis teams and University of Oxford, as well as IQVIA who provided the raw survey data.
The historic pipeline (written in the Stata coding language) was translated into meaningful logic so that it could be translated into Python. This involved creating specifications for all aspects of the pipeline, parcelling the work into pre-defined releases to allow for a gradual transition to the new pipeline, and liaising with external parties to gather any new specifications or changes required. In total, the historic pipeline was parcelled into five main releases. The acceptance criteria for each release were to prove the output from the new pipeline was identical to the historic pipeline or, where differences emerged, provide evidence for this being an improvement to the pipeline. This process ensured results were not inadvertently changed or biased. The new Python-based pipeline was developed in an alternate secure network - the Data Access Platform (an internal ONS platform) - due to its more optimal processing capabilities.
For each release, the teams used a Kanban work style to iteratively pass specifications back-and-forth between the developers, the proving team, and the statistical processing team. Once a release had passed the proving stage, parallel runs were implemented in the production environment. Each parallel run involved running the new and historic pipelines simultaneously and comparing the outputs to ensure results were comparable. If a release passed this stage it was then moved into production, and the historic code and process within the ONS Secure Research Service was retired.
The researchers found that moving to the Python platform had several benefits. Automating the pipeline meant less human interaction was needed to run the code. The team also saw significant improvements in the processing time (roughly three-four hours) as the Python processing speed was considerably faster than the Stata pipeline.
The output of the new pipeline was also considerably smaller, saving storage space. The compression of data currently saves roughly 50% of storage space compared to comparable output from the historic pipeline.
Several improvements were also made to the data cleaning and quality assurance processes. This meant that new and historical data improved in quality, enabling the team to provide better quality outputs.
The major impact of this project is the ability to continue producing and delivering the Covid-19 Infection Survey data in a timely manner. This ensures time-sensitive information is disseminated to government to inform decision and policymaking.
In the past twelve months, this data has informed government on regional and sub-regional infection rates, the Omicron variant, and antibody levels and immunity. These have directly influenced policy around the easing of lockdowns and the response to emerging variants.
Publications and reports
- ONS fortnightly antibodies report, November 2022: Coronavirus (COVID-19) Infection Survey, antibody data, UK: 2 November 2022
- ONS weekly bulletin, November 2022: Coronavirus (COVID-19) Infection Survey, UK: 18 November 2022
- ONS monthly bulletin, November 2022: Coronavirus (COVID-19) Infection Survey, characteristics of people testing positive for COVID-19, UK: 16 November 2022
- Welsh Government weekly releases, November 2022: Coronavirus (COVID-19) infection survey (positivity estimates)
- Scottish Government weekly releases, July 2022: Coronavirus (COVID-19): infection survey
- University of Cambridge MRC Biostatistics Unit, August 2022: Nowcasting and Forecasting of the COVID-19 Pandemic
- Oxford Academic Clinical Infectious Diseases article, August 2022: Omicron-Associated Changes in Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) Symptoms in the United Kingdom
- Oxford Academic Open Forum Infectious Diseases article, September 2022: Risk of Long COVID in People Infected With Severe Acute Respiratory Syndrome Coronavirus 2 After 2 Doses of a Coronavirus Disease 2019 Vaccine: Community-Based, Matched Cohort Study
- Nature Medicine article, February 2022: Antibody responses and correlates of protection in the general population after two doses of the ChAdOx1 or BNT162b2 vaccines
- MedRxiv preprint article, January 2022: The challenge of limited vaccine supplies: impact of prior infection on anti-spike IgG antibody trajectories after a single COVID-19 vaccination
- MedRxiv preprint article, January 2022: Lineage replacement and evolution captured by the United Kingdom Covid Infection Survey
- The Lancet Regional Health Europe, December 2021: Monitoring populations at increased risk for SARS-CoV-2 infection in the community using population-level demographic and behavioural surveillance
- The New England Journal of Medicine correspondence, December 2021: Tracking the Emergence of SARS-CoV-2 Alpha Variant in the United Kingdom
Blogs, news posts, and videos
- NatCen report, April 2022: Public Confidence in Official Statistics 2021
Presentations and awards
- Commended for Reproducibility, ONS Research Excellence Awards 2022
About the ONS Secure Research Service
The ONS Secure Research Service is an accredited trusted research environment, using the Five Safes Framework to provide secure access to de-identified, unpublished data. If you use ONS Secure Research Service data and would like to discuss writing a future case study with us, please ensure you have reported your outputs here: Outputs Reporting Form