Covid-19 Infection Survey – upscaling the statistical processing pipeline

Covid-19 Infection Survey – upscaling the statistical processing pipeline

This research used data made available via the Office for National Statistics (ONS) 

Secure Research Service

, which is being expanded and improved with ADR UK funding.

Authors: Dr Charlotte Murphy (Office for National Statistics), University of Oxford, IQVIA, The Glasgow Lighthouse Laboratory, UK Health Security Agency, The University of Manchester and Wellcome Trust

Date: April 2020

Research summary

This research project involved updating the statistical processing pipeline for the Covid-19 Infection Survey. The update enabled the production and analysis of critical data to be carried out in a timely manner. This ensured time sensitive information was disseminated to government during the Covid-19 pandemic, to inform critical decision and policymaking.

The Covid-19 Infection Survey was set up from April 2020 to estimate the prevalence of Covid-19 infection in the UK. The survey provided important information about the sociodemographic characteristics of the people and households who have contracted Covid-19, vaccination status, the presence of antibodies, and symptoms experienced. A statistical pipeline was required to process and analyse critical data including swab results, antibody tests, and demographic information. As the study expanded there was an increased demand for an updated and automated pipeline to improve speed, memory, and storage-related issues.

The UK Government, Scottish Government, Welsh Government, and Northern Ireland Executive are the main users of the Covid-19 Infection Survey. It is used to track the Covid-19 pandemic and help inform decisions about restrictions and related policies.  Academics and health researchers also use the survey to research the Covid-19 pandemic, the characteristics of those testing positive, and any associated inequalities.

Data used

This project used 10 datasets, including data from the ONS Secure Research Service, IQVIA, the Glasgow Lighthouse laboratory, UK BioSample Centre, University of Oxford laboratory, and UK Health Security Agency to create the Covid-19 Infection Survey data.

Methods used

The upscaling of the Covid-19 Infection Survey statistical pipeline was a collaboration between the statistical processing team, who oversee the historic pipeline, a python development team, and a proving team. They also liaised with external teams and parties, such as Covid Infection Survey methodology and analysis teams and University of Oxford, as well as IQVIA who provided the raw survey data.

The historic pipeline (written in the Stata coding language) was translated into meaningful logic so that it could be translated into Python. This involved creating specifications for all aspects of the pipeline, parcelling the work into pre-defined releases to allow for a gradual transition to the new pipeline, and liaising with external parties to gather any new specifications or changes required. In total, the historic pipeline was parcelled into five main releases. The acceptance criteria for each release were to prove the output from the new pipeline was identical to the historic pipeline or, where differences emerged, provide evidence for this being an improvement to the pipeline. This process ensured results were not inadvertently changed or biased. The new Python-based pipeline was developed in an alternate secure network - the Data Access Platform (an internal ONS platform) - due to its more optimal processing capabilities.

For each release, the teams used a Kanban work style to iteratively pass specifications back-and-forth between the developers, the proving team, and the statistical processing team. Once a release had passed the proving stage, parallel runs were implemented in the production environment. Each parallel run involved running the new and historic pipelines simultaneously and comparing the outputs to ensure results were comparable. If a release passed this stage it was then moved into production, and the historic code and process within the ONS Secure Research Service was retired.

Research findings

The researchers found that moving to the Python platform had several benefits. Automating the pipeline meant less human interaction was needed to run the code. The team also saw significant improvements in the processing time (roughly three-four hours) as the Python processing speed was considerably faster than the Stata pipeline.

The output of the new pipeline was also considerably smaller, saving storage space. The compression of data currently saves roughly 50% of storage space compared to comparable output from the historic pipeline.

Several improvements were also made to the data cleaning and quality assurance processes. This meant that new and historical data improved in quality, enabling the team to provide better quality outputs.

Research impact

The major impact of this project is the ability to continue producing and delivering the Covid-19 Infection Survey data in a timely manner. This ensures time-sensitive information is disseminated to government to inform decision and policymaking.

In the past twelve months, this data has informed government on regional and sub-regional infection rates, the Omicron variant, and antibody levels and immunity. These have directly influenced policy around the easing of lockdowns and the response to emerging variants.

Research outputs

Publications and reports

Blogs, news posts, and videos

Presentations and awards

About the ONS Secure Research Service

The ONS Secure Research Service is an accredited trusted research environment, using the Five Safes Framework to provide secure access to de-identified, unpublished data. If you use ONS Secure Research Service data and would like to discuss writing a future case study with us, please ensure you have reported your outputs here: Outputs Reporting Form

Share this: