Sharing good code: a researcher’s reflection
In this blog, Hannah Hodge Waller, of the Office for National Statistics (ONS) Secure Research Service discusses the process of creating and contributing code to a new managed code repository within the Secure Research Service. She interviews Dr Van Phan about how her team’s code was used as part of the pilot of this repository. Hannah was the Senior Statistical Code Sharing Manager for the Secure Research Service, and Van is a researcher for the ADR UK-supported Wage and Employment Dynamics project.
Over the last 18 months, we’ve been hard at work planning and creating a shared code repository within the Secure Research Service.
While many community- and government-led policies have helped to open up access to data, resources that researchers might use, like code that others have created, can be very helpful. Sharing code increases transparency and accountability in research and reduces duplication.
The Code Sharing Repository holds output-checked code, submitted by researchers using the Secure Research Service. Some code can act as building blocks, for example creating new derived variables from datasets not available in the dataset. Ultimately, all code in the repository should add value to existing or future research.
The ONS Secure Research Service piloted this repository and worked with code created by the ADR UK-funded Wage and Employment Dynamics data enhancement and linkage project. I caught up with Dr Van Phan from the project to talk about her experience of preparing code for sharing.
An interview with Dr Van Phan
Could you tell me a bit about the Wage and Employment Dynamics project?
Wage and Employment Dynamics aims to enable a better understanding about the dynamics of earnings and employment in Great Britain by linking major datasets from different government bodies.
In phase 1, we explored links between the Annual Survey of Hours and Earnings (ASHE) and 2011 Census data. For phase 2, we’ll look at links between ASHE, Pay As You Earn Real Time Information, and HM Revenue and Customs’ Self-Assessment data. Phase 3 will analyse ASHE and the Migrant Worker Scan data.
This provides a trio of datasets and projects. It can be used as a sustainable wage and employment spine (forming a research-ready dataset) for wider groups of users, such as academics and policymakers.
What code has your team written and shared so far?
At this stage, the Wage and Employment Dynamics team has shared our code for phase 1, which has two basic functions:
- First, to transform and enhance the current version of ASHE by adding extra variables which we think may be helpful for research. We also added extra weights to deal with the cross-sectional non-response issues, and longitudinal sampling issues.
- Second, to incorporate this enriched version of ASHE with the linked Census-ASHE dataset and construct extra weights to remove any linkage biases.
This was definitely a complex and challenging example of code to use for our first pilot case. Did you have to change anything in your research processes to create sharable code?
We didn’t change too much. The code was initially set up by Professor Felix Ritchie, who is an experienced data analyst and an expert in data privacy and compliance. We provided additional annotations for each block of code in our files, which we hope will make it easier for users to follow and understand.
It was really great to work with you to make and implement suggestions to the code. What changes did you end up implementing?
The Statistical Code Sharing team helped us to review our initial code and suggested protecting the driver directory and controlling for the version of each sub file. We added an additional file to manage these. We also used the new Secure Research Service available template to create a README file, which alongside our data documentation helps future users of the code.
What do you think are the main benefits to sharing your code and how do you think people will use it?
By adding extra variables and weights, we hope that future researchers across sectors will save time in their work and avoid having to duplicate this code.
Researchers from the Low Pay Commission are already using our code to conduct analysis for an upcoming report, and researchers from the Joseph Rowntree Foundation have also requested our code.
It's great to hear that people are using the code already! What advice would you give to other researchers wanting to share code?
I see these things as being important:
- The code should be standardised and well annotated, so people find it easy to follow and understand
- Make sure that the code and annotations do not reveal anything identifiable, including aggregated results
- Prepare a README file to help people understand and use the code
- Make sure that correct disclaimers are added to your code files and to the README files, being clear about the code’s limitations
- Work closely with other teams, including the data owners and support teams.
We have been working to make the Code Sharing Repository available to all users in the ONS Secure Research Service, so that all researchers will have access. The Code Sharing Drive will appear as a mapped drive within the Secure Research Service.
We have also set up a process for researchers to contribute their own code, where they think it will be useful to others.
Best practice guidance around writing effective and sharable code, including exemplar README files, is also being made available on the upcoming ADR UK Learning Hub website this autumn, so do watch this space!