A Practical Introduction for Researchers
Associate Professor of AI
UWE Bristol
When an AI helps you analyse data, it often uses something called a “neural network” behind the scenes. Think of a neural network as a series of interconnected layers, much like how your brain processes information.
Here’s a simplified view of a neural network:
In a neural network, “parameters” are the internal variables that the network learns from the data. They are crucial because they determine how the network transforms the input data to produce the output.
During the AI’s “learning” process, these weights and biases are constantly adjusted until the network can accurately perform the task (e.g., categorising data, making predictions).
Yes! Artificial Intelligence (AI) can help you analyse your research data, even if you have little to no programming experience.
Our Goal: To learn how to instruct an AI to perform data analysis tasks for us using Python.
We will cover:
When browing models in LM Studio, you’ll encounter filenames that look like a jumble of letters and numbers. Deciphering these is key to selecting the right model for your needs.
Let’s take a common filename as an example:
llama-2-7b-chat.Q5_K_M.gguf
Each part tells us something important about the model.
llama-2-7b-chat
llama-2
: Indicates the base model architecture. This is Meta’s Llama 2. Other common ones include Mistral, Mixtral, Gemma, etc.7b
: Denotes the size of the model in billions of parameters. Larger models (e.g., 70b) are more capable but require more resources.chat
: Specifies the fine-tuning. This model is trained for conversational AI. Other common suffixes include:
instruct
: For instruction-following tasks.code
: Optimized for code generation..Q5_K_M
This is where it gets a bit more technical, but it’s crucial for performance vs. quality.
Q5
: Represents the quantisation level. This refers to how many bits are used to store each parameter.
Q2
, Q3
) mean smaller file sizes, faster inference, but lower quality.Q8
) mean larger files, slower inference, but higher quality.Q5
is often a good balance for general use._K
: Indicates a specific quantisation method or “kernel.”
K
(or _K_M
, _K_S
, _K_L
): These are newer, more efficient quantization methods in GGML/GGUF that aim to retain more information.Q4_0
, Q5_0
, etc., without the _K
._M
: (In Q5_K_M
) signifies a specific variation of the K
quantization, often a “medium” level of a given K
method, offering a balance between quality and performance..gguf
.gguf
: This is the file format for models optimized to run efficiently on CPUs (and increasingly, GPUs) using the GGML library.
.ggml
or .bin
files, but .gguf
is the current and recommended format.Q5_K_M
).chat
or instruct
fine-tunes.code
models.instruct
or balanced base models.Q5_K_M
or Q4_K_M
for a good balance.Q6_K
or Q8_0
.Q3_K_M
or Q2_K
.By understanding the components of a model filename in LM Studio, you can make informed decisions about:
Experiment with different quantisations to find the sweet spot for your system and tasks!
Think of data analysis as a journey with four key steps. You can ask your AI to help with the Python code for every single one.
Data Cleaning & Preparation: The most important step! Getting your data tidy and ready for analysis (e.g., handling missing values).
Exploratory Data Analysis (EDA): Getting to know your data. Asking the AI to generate summary statistics and basic plots to find initial patterns.
Analysis & Modelling: Answering your core research questions. This could be comparing groups, looking for relationships, or even making predictions.
Visualisation & Interpretation: Creating clear, publication-quality charts and graphs to communicate your findings effectively.
Getting good code from an AI depends entirely on how you ask. The key is to describe your goal and your data, not to try and write code yourself.
Imagine you have a spreadsheet and you want to see what’s inside it.
A bad prompt:
This is too vague. The AI doesn’t know what kind of file, what you want to do with it, or what tools (libraries) to use.
A good prompt:
"I am a PhD student with no programming experience. I have a CSV file called `research_data.csv` that contains my experiment results.
Please write a simple Python script using the `pandas` library to:
1. Load the `research_data.csv` file.
2. Show me the first 5 rows so I can see what the columns and data look like."
This is great! It gives a role, context (research_data.csv
), the specific tool (pandas
), and a clear, step-by-step goal.
For the AI to give you correct code, you must tell it about your data’s structure. Think of it as giving the AI your data’s CV or “data dictionary”.
Example prompt context:
"My CSV file has the following columns:
- `participant_id`: A unique number for each person.
- `treatment_group`: Text, either 'Control' or 'Test'.
- `response_time_sec`: The participant's response time in seconds, which is a number.
- `score`: The participant's score on a test, from 1 to 100."
Providing this stops the AI from guessing and helps it write much more accurate code.
Goal: Use the AI to write Python code to load and clean a messy dataset.
Scenario: Below is the content of a file, synthetic_ashe-census_dataset.csv
. It has a number of numbers stored as text.
Your Task:
Ask your AI assistant to write a Python script that uses pandas
to:
The data (synthetic_ashe-census_dataset.csv
):
synth_ID,synth_sex,synth_ethpuk12_cen,synth_hlqpuk11_cen,synth_disability_cen,synth_lrespuk11_cen,synth_age_cen,synth_ecopuk_cen,synth_hours_cen,synth_health_cen,synth_mainlang_cen,synth_marstat_cen,synth_religion_cen,synth_carer_cen,synth_fmspuk11_cen,synth_dpcfamuk11_cen,synth_tenure,synth_pubpriv_ashe,synth_region,synth_occupation,synth_industry,synth_hbpay_ashe,synth_weight_ashe_cen,synth_weight_ashe
10000001,1,1,5,1,1,34,2,3,1,1,2,1,1,3,2,2,1,1,6,9,44.36,0.858283725,0.851008829
10000002,1,1,2,1,1,34,2,3,2,1,2,1,1,3,8,6,2,1,4,12,30.78,1.062945803,1.058440288
10000003,2,1,5,2,1,51,2,3,2,1,1,1,4,8,1,4,2,1,3,7,7.53,1.094525164,1.102092614
10000004,1,-99,5,1,1,36,2,3,1,1,1,2,1,3,1,5,2,1,9,1,37.03,0.858283725,0.85904896
Prompt Suggestion for your AI:
You are an expert data analyst who is helping a PhD student with no coding experience. I have the CSV data below.
Please write a complete Python script using the pandas library that:
1. Loads the data.
2. Identifies and counts any missing values.
4. Prints a statistical summary of the final, cleaned data.
Please add comments to the code to explain what each part does.
Goal: To ask a research question and have the AI generate a plot to help answer it.
Scenario: Using the cleaned data from Exercise 1.
Your Research Question: “Is there a difference in the average score between the ‘Control’ and ‘Test’ groups?”
Your Task: Ask your AI assistant to extend the previous script to:
synth_sex
column.synth_tenure
for each group.matplotlib
or seaborn
to visualise the comparison. The chart must have a title and labelled axes.Prompt Suggestion for your AI:
AI is powerful, but you are still the scientist. You are responsible for your analysis.
The “context window” is the amount of text (input + output) an LLM can process at once. Managing it efficiently is crucial.
Running LLMs, especially large ones, consumes significant energy. We can make choices to reduce this impact.
Happy analysing!