A fundamental cornerstone of data analysis is reproducibility: given the same data and methods, analysts should be able to reproduce one another’s findings. This is especially important if those analyses will inform policies that directly impact people’s lives.
Reproducible analytical pipelines make analytical processes more transparent and results more accessible. They also support collaboration and increase the efficiency with which analyses can be carried out. They do this by enabling reuse of code and reducing duplication of effort.
ADR UK supports this approach by applying the FAIR Principles to improve the Findability, Accessibility, Interoperability and Reuse of data.
An introduction to statistical code
When performing statistical analysis, a programming language is typically used. Examples are R, Python, SPSS, and STATA. Programming tools allow researchers to automate processes, re-run analysis, and document analysis steps.
The code that is used in these programmes can be enormously helpful for showing how results were derived from analysis and for demonstrating reproducibility – where one can do the same thing and obtain identical results.
Analysts can share their code with other researchers so that those who are new to a dataset can make use of existing code to build their own code. Sharing code is also beneficial for supporting data cleaning work, reusing variables already derived, and helping to identify new methods and processes which can spark conversations between researchers.
If you are not already a proficient coder, where should you start learning to code? The UK government and many data-oriented businesses have moved towards the use of open-source tools. R and Python, for example, support ‘reusable’ activities.
The section Writing good statistical code can help you to begin or continue your learning journey.
What is good code?
Good code follows several principles.
- Readable:
- well documented
- tidy and follows clean code conventions
- contains a README file for other researchers to understand and use.
- Modular:
- split into smaller chunks or modules
- uses functions or sub routines for repetitive code (more efficient).
- Employs version control:
- able to identify and correct errors easily
- provides an audit trail of changes made to the code.
Find out more in the next section Writing good statistical code. You can also read a blog on sharing good code.
Where can I share code?
Trusted research environments may have code sharing repositories, where researchers can view, create, and submit code. They can also provide feedback on other researchers’ code.
For instance, the ONS Secure Research Service (SRS) code sharing repository hosts ‘value-added’ code created and donated by researchers (see Code sharing in a trusted research environment). This can vary from pipelines to code snippets, but typically would fall under one of the following categories:
- cleaning code
- preparatory work
- deriving new variables
- histories e.g., employment.
Code sharing in a trusted research environment differs to code sharing more widely (such as through a public-facing GitHub).
For instance, all code shared within the ONS SRS code repository outlined above must be checked for statistical disclosure control. This is so that any accredited researcher with access to the SRS can have access to all code. The code file clearance procedure is part of the standard outputs checking procedure. Researchers can request specific code to be added to their projects area.
For further information, consult Code sharing in a trusted research environment.