This section contains content on coding skills, writing and sharing code, considering code quality, and good code citation.
A fundamental cornerstone of data analysis is reproducibility: given the same data and methods, analysts should be able to reproduce one another’s findings. This is especially important if those analyses will inform policies that directly impact people’s lives.
Reproducible analytical pipelines make analytical processes more transparent and results more accessible. They also support collaboration and increase the efficiency with which analyses can be carried out. They do this by enabling reuse of code and reducing duplication of effort.
ADR UK supports this approach by applying the FAIR Principles to improve the Findability, Accessibility, Interoperability and Reuse of data.
An introduction to statistical code
When performing statistical analysis, a programming language is typically used. Examples are R, Python, SPSS, and STATA. Programming tools allow researchers to automate processes, re-run analysis, and document analysis steps.
The code that is used in these programmes can be enormously helpful for showing how results were derived from analysis and for demonstrating reproducibility – where one can do the same thing and obtain identical results.
Analysts can share their code with other researchers so that those who are new to a dataset can make use of existing code to build their own code. Sharing code is also beneficial for supporting data cleaning work, reusing variables already derived, and helping to identify new methods and processes which can spark conversations between researchers.
If you are not already a proficient coder, where should you start learning to code? The UK government and many data-oriented businesses have moved towards the use of open-source tools. R and Python, for example, support ‘reusable’ activities.
The section Writing good statistical code can help you to begin or continue your learning journey.
What is good code?
Good code follows several principles.
- well documented
- tidy and follows clean code conventions
- contains a README file for other researchers to understand and use.
- split into smaller chunks or modules
- uses functions or sub routines for repetitive code (more efficient).
- Employs version control:
- able to identify and correct errors easily
- provides an audit trail of changes made to the code.
Where can I share code?
Trusted research environments may have code sharing repositories, where researchers can view, create, and submit code. They can also provide feedback on other researchers’ code.
For instance, the ONS Secure Research Service (SRS) code sharing repository hosts ‘value-added’ code created and donated by researchers (see Code sharing in a trusted research environment). This can vary from pipelines to code snippets, but typically would fall under one of the following categories:
- cleaning code
- preparatory work
- deriving new variables
- histories e.g., employment.
Code sharing in a trusted research environment differs to code sharing more widely (such as through a public-facing GitHub).
For instance, all code shared within the ONS SRS code repository outlined above must be checked for statistical disclosure control. This is so that any accredited researcher with access to the SRS can have access to all code. The code file clearance procedure is part of the standard outputs checking procedure. Researchers can request specific code to be added to their projects area.
For further information, consult Code sharing in a trusted research environment.
Writing good statistical code
When we write code, we should expect that someone else will want or to read, understand, and use parts of it. Sharing our code can help other researchers save time and be more efficient and collaborative.
Writing good quality code can feel daunting. Two simple things to start you off are to:
Follow a quality assurance checklist
Create a README file.
Examples of these are set out in Code sharing in a trusted research environment.
The ONS has curated guides to writing good statistical code.
They have also delivered training on coding, which can be viewed below.
Beginners can refer to the helpful guide Principles – Quality assurance of code for analysis and research. This comprehensive document covers topics such as code review, testing methodologies (unit/ systems/integration testing), and documentation standards (code comments, docstrings, README files). The purpose of this resource is to help developers improve the quality and reliability of their code by following ONS-approved quality assurance practices.
Using reproducible analytical pipelines
The UK's National Statistician, Professor Sir Ian Diamond, said when he was the Head of the ONS Analysis Function: “Reproducibility is the cornerstone of analysis. Analysts should get the same results as each other when using the same data and methods".
This and more is outlined in the Reproducible Analytical Pipelines (RAP) strategy, created by the UK Civil Service Analysis Function. It provides information on implementing reproducible analytical pipelines, and mentions that training and capacity building is required for creating RAPs. It also highlights the best practices and guidance for open-source analysis. Its purpose is to enable learners to perform analysis in line with data management, code versioning, documentation, packaging, code style, and input data validation.
To find out more on the principles, practices, and training around reproducible and open research, explore the links below:
The UK Reproducibility Network is a peer-led network of academic researchers championing open research principles and training for UK-based researchers. Their ‘primer’ series provides short video introductions to key concepts in reproducible and open research across disciplines
The Turing Way: A Handbook for Reproducible Data Science is a comprehensive guide that is suitable for beginners, covering the importance and benefits of reproducibility in scientific research. It provides valuable information on version control and offers a command dictionary complimented by practical advice on establishing good practices. The resource also outlines best practices in research data management (FAIR principles, formatting, toolkits), structuring repositories, and accessibility and collaboration guidelines, as well as topics related to ethical aspects of data science
The Framework for Open and Reproducible Research Training website provides teaching resources for reproducible research.
Code sharing in a trusted research environment
Some trusted research environments have a code sharing repository which serve to hold code submitted by researchers. Useful, value-added code can take a variety of forms but includes:
- cleaning code
- preparatory work
- deriving new variables.
All code added to a repository is checked for statistical disclosure control. The code file clearance process is part of the standard outputs checking procedure.
Code sharing best practices
This section sets out the process for researchers who wish to share their code. It includes information you need to provide and processes you should follow.
As a researcher your responsibilities are to:
- ensure you provide a README file for your code
- ensure your code is not malicious or disclosive in any way
- work with the relevant team(s) to make any changes where advised
- follow coding best practices, so that code is readable and understandable
- ensure additional materials (such as look up tables and linked code) are shared though the correct processes.
The relevant team at the trusted research environment may be able to:
- point you to best practices guidance and templates
- ensure that the code has a sensible folder structure and file names
- check that your code has key documentation to assist future researchers using the shared code (such as annotation, software versions, and a README file).
The team are not responsible for:
- ensuring the code fully functions
- ensuring the validity of research or outputs.
Code quality assurance
There are many different approaches to creating and writing code. This quality assurance checklist,created by the ONS, aligns best practice and standards across various languages. The document is designed for self-evaluation against the categories included.
Good code should contain a README file. This sets out how the code should be used, any additional packages or data the user may require, and how to attribute the code and/or data.
In the ONS SRS code repository, a sample README document is provided. A .md (markdown) file; and a .RMD file are also included within the SRS code repository for you copy and edit as you wish.
It is recommended to edit these files within RStudio, which requires no knowledge of R. A brief tutorial can be found at Creating a README file in RMarkdown.
From time to time, you may wish to update your code. This may be because you have improved it, made a method more efficient, or rectified errors.
This updated code must go through the correct procedures to maintain reduced disclosure risks and assess code documentation.
Researchers should inform the relevant team(s) at the trusted research environment if they wish to be informed when another researcher has found errors in or made suggestions to their code. If errors are found and the researcher is not contactable, the code submission may be removed.
Citing code with a digital object identifier
A digital object identifier (DOI) is used to create a persistent identifier for documents and datasets. The unique identifier is attached to the code, and a metadata record is maintained. The record points to the most recent URL (web address) of the publication, information about the author, publication date, and other information needed for researchers to cite the work.
You may be able to source a metadata record and DOI for your code through your organisation. Economic and Social Research Council researchers can use the UK Data Service’s Reshare repository.
The ONS Secure Research Service has written a range of articles and blogs on code sharing:
- Sharing good code: a researcher's reflection
- Sharing researcher-generated code and value-added documentation in a trusted research environment.
There are also some links from the ONS Secure Research Service you might find useful:
Available shared code
This page will be used to display information about shared code that is available in one of ADR UK’s partner TREs and will be added to as more code becomes available.
ONS Secure Research Service
All shared code can be viewed in the ONS Secure Research Service (SRS) code repository. The repository can be accessed in the SRS via a short cut. You can view any folder and copy this to your own workspace for use or interrogation.
This table contains information about available code.
|Code share title||Relevant datasets||Short description||Language||Authors||DOI||Case study (if applicable)|
|Code to support ASHE linked to 2011 Census||Annual Survey of Hours and Earnings (ASHE) linked to 2011 Census - England and Wales||Code supports Phase 1 of the linking process matched on an employee's name, sex, age, and residential postcode||Stata||Whittard, Ritchie||TBC||Blog: New linked dataset available to provide insights into earnings and employment in Britain|
Report an issue with code
If you encounter an issue with code, email email@example.com. Indicate a clear description of the issue, pointing to relevant section(s) of the code affected, and state its impact on functionality and project dependencies.