Getting started with LEO: studying the post-16 activities of young people
Categories: Blogs, ADR England, Children & young people, World of work
23 October 2025
Dave Thomson is Chief Statistician at FFT Education and one of the co-investigators on the Youth Transitions Community Catalyst. He specialises in analysing administrative data related to education in England and has been leading work to map data sources and develop training materials for the project.
In this blog, Dave explores how researchers can use the Longitudinal Education Outcomes (LEO) dataset to study the post-16 activities of young people - from education and training to employment - and shares practical insights from working with this powerful but complex data resource.
Over the last five years, researchers studying education in England have increasingly drawn on a relatively new data resource, Longitudinal Education Outcomes (LEO). LEO links together multiple administrative datasets – from school histories in the National Pupil Database, to employment and earnings data from HM Revenue & Customs, and benefits data from the Department for Work and Pensions.
Perhaps its most widely known use has been to calculate the returns to higher education. But the potential goes far beyond that. Among many other things, LEO can be used to explore post-16 pathways, transitions into (and out of) education, training and work, and long-term outcomes for different groups of young people.
For example:
- We’ve used it to study the impact of raising the participation age (the age up to which young people are required to continue learning)
- We’ve examined post-16 pathways for lower attainers
- Colleagues at Impetus-PEF have used it to analyse their youth jobs gap focusing on young people not in education, employment and training (NEET).
The challenges of working with LEO
The strength of LEO is its scale. With (almost) complete coverage of the state-educated population in England since 2002, it allows researchers to study small groups in a way that would be impossible with survey data. But that same scale brings challenges:
- Size and complexity: the datasets are vast, requiring significant processing power and storage.
- Fragmentation: post-16 education data is spread across several sources - the School Census, the Individualised Learner Record (ILR) and the HESA student record. Employment and benefits data are held separately, as is the NCCIS dataset of post-16 activities recorded by local authorities.
- Sparse documentation: guidance is limited, which makes it harder for new researchers to get started.
In short, even once you’ve navigated the application process, the bar to using LEO is high.
A resource to help: the Youth Transitions Hub
To lower that barrier, we’ve put together a GitHub code repository and wiki. This contains the code we routinely use to take raw LEO data and transform it into a more usable form.
What it includes:
- Data extraction: pulling out the bits you actually need and leaving the rest
- Data cleaning: fixing inconsistencies and standardising formats
- Reshaping: structuring the data so it can be queried more easily
- Indicators:
- yearly and monthly summaries of each person’s post-16 activities
- “pathways” showing the routes learners take through school and further education.
Because LEO is provided in a SQL Server database, the code is written in T-SQL. We know not everyone will be familiar with this, so we’ve added detailed comments explaining what each script does.
If you use the repository and run into difficulties, we’d love to hear from you - just head to the discussion page and get in touch.
For more insights from the Youth Transitions Community Catalyst project, join our LinkedIn group.