Coding with AI

Wait, What?

Yes! Artificial Intelligence (AI) can help you analyse your research data, even if you have little to no programming experience.

Our Goal: To learn how to instruct an AI to perform data analysis tasks for us using Python.

We will cover:

The AI-powered data analytics workflow.
How to ask the AI for code (prompting).
Practical, hands-on exercises with Python.
Crucial ethical considerations for researchers.

Choosing an Appropriate Model with LM Studio

Understanding Model Filenames

When browing models in LM Studio, you’ll encounter filenames that look like a jumble of letters and numbers. Deciphering these is key to selecting the right model for your needs.

Deconstructing the Filename: An Example

Let’s take a common filename as an example:

llama-2-7b-chat.Q5_K_M.gguf

Each part tells us something important about the model.

Filename Component 1: Model Family & Size

`llama-2-7b-chat`

llama-2: Indicates the base model architecture. This is Meta’s Llama 2. Other common ones include Mistral, Mixtral, Gemma, etc.
7b: Denotes the size of the model in billions of parameters. Larger models (e.g., 70b) are more capable but require more resources.
chat: Specifies the fine-tuning. This model is trained for conversational AI. Other common suffixes include:
- instruct: For instruction-following tasks.
- code: Optimized for code generation.
- (No suffix): Often a base model, less suitable for direct use without fine-tuning.

Filename Component 2: Quantisation & Method

`.Q5_K_M`

This is where it gets a bit more technical, but it’s crucial for performance vs. quality.

Q5: Represents the quantisation level. This refers to how many bits are used to store each parameter.
- Lower numbers (e.g., Q2, Q3) mean smaller file sizes, faster inference, but lower quality.
- Higher numbers (e.g., Q8) mean larger files, slower inference, but higher quality.
- Q5 is often a good balance for general use.
_K: Indicates a specific quantisation method or “kernel.”
- K (or _K_M, _K_S, _K_L): These are newer, more efficient quantization methods in GGML/GGUF that aim to retain more information.
- Older methods might just show Q4_0, Q5_0, etc., without the _K.
_M: (In Q5_K_M) signifies a specific variation of the K quantization, often a “medium” level of a given K method, offering a balance between quality and performance.

Filename Component 3: File Format

`.gguf`

.gguf: This is the file format for models optimized to run efficiently on CPUs (and increasingly, GPUs) using the GGML library.
- It’s the standard format for LM Studio and offers good compatibility and performance.
- You might occasionally see older .ggml or .bin files, but .gguf is the current and recommended format.

Choosing the Right Model: Key Considerations

Your Hardware:
- RAM: Dictates the maximum model size you can load. (e.g., a 7B model often needs ~8GB RAM for Q5_K_M).
- GPU VRAM: If you have an NVIDIA GPU, VRAM allows for faster inference, especially for larger models or when offloading layers.
Your Use Case:
- Chatbot/Conversational: Look for chat or instruct fine-tunes.
- Code Generation: Seek out code models.
- Creative Writing/General Text: instruct or balanced base models.
Performance vs. Quality:
- Start with Q5_K_M or Q4_K_M for a good balance.
- If quality is paramount and you have resources, try Q6_K or Q8_0.
- If performance or limited resources are key, explore Q3_K_M or Q2_K.

In Summary

By understanding the components of a model filename in LM Studio, you can make informed decisions about:

Model Architecture & Capabilities
Performance Characteristics (Speed & Resource Usage)
Quality & Accuracy

Experiment with different quantisations to find the sweet spot for your system and tasks!

The AI-Powered Analytics Workflow

Think of data analysis as a journey with four key steps. You can ask your AI to help with the Python code for every single one.

Data Cleaning & Preparation: The most important step! Getting your data tidy and ready for analysis (e.g., handling missing values).
Exploratory Data Analysis (EDA): Getting to know your data. Asking the AI to generate summary statistics and basic plots to find initial patterns.
Analysis & Modelling: Answering your core research questions. This could be comparing groups, looking for relationships, or even making predictions.
Visualisation & Interpretation: Creating clear, publication-quality charts and graphs to communicate your findings effectively.

How to ‘Talk’ to an AI About Data

Getting good code from an AI depends entirely on how you ask. The key is to describe your goal and your data, not to try and write code yourself.

Prompt Example: Bad vs Good

Imagine you have a spreadsheet and you want to see what’s inside it.

A bad prompt:

"Write Python code to read a file."

This is too vague. The AI doesn’t know what kind of file, what you want to do with it, or what tools (libraries) to use.

A good prompt:

"I am a PhD student with no programming experience. I have a CSV file called `research_data.csv` that contains my experiment results.

Please write a simple Python script using the `pandas` library to:

1. Load the `research_data.csv` file.
2. Show me the first 5 rows so I can see what the columns and data look like."

This is great! It gives a role, context (research_data.csv), the specific tool (pandas), and a clear, step-by-step goal.

Giving Context: Your Data’s “CV”

For the AI to give you correct code, you must tell it about your data’s structure. Think of it as giving the AI your data’s CV or “data dictionary”.

Example prompt context:

"My CSV file has the following columns:
- `participant_id`: A unique number for each person.
- `treatment_group`: Text, either 'Control' or 'Test'.
- `response_time_sec`: The participant's response time in seconds, which is a number.
- `score`: The participant's score on a test, from 1 to 100."

Providing this stops the AI from guessing and helps it write much more accurate code.

Exercise 1: Cleaning Mess

Goal: Use the AI to write Python code to load and clean a messy dataset.

Scenario: Below is the content of a file, synthetic_ashe-census_dataset.csv. It has a number of numbers stored as text.

Your Task:

Ask your AI assistant to write a Python script that uses pandas to:

Load this data.
Find and count the missing values in each column.
Generate a basic statistical summary of the cleaned data.

Exercise 1: The Data & Prompt

The data (synthetic_ashe-census_dataset.csv):

synth_ID,synth_sex,synth_ethpuk12_cen,synth_hlqpuk11_cen,synth_disability_cen,synth_lrespuk11_cen,synth_age_cen,synth_ecopuk_cen,synth_hours_cen,synth_health_cen,synth_mainlang_cen,synth_marstat_cen,synth_religion_cen,synth_carer_cen,synth_fmspuk11_cen,synth_dpcfamuk11_cen,synth_tenure,synth_pubpriv_ashe,synth_region,synth_occupation,synth_industry,synth_hbpay_ashe,synth_weight_ashe_cen,synth_weight_ashe
10000001,1,1,5,1,1,34,2,3,1,1,2,1,1,3,2,2,1,1,6,9,44.36,0.858283725,0.851008829
10000002,1,1,2,1,1,34,2,3,2,1,2,1,1,3,8,6,2,1,4,12,30.78,1.062945803,1.058440288
10000003,2,1,5,2,1,51,2,3,2,1,1,1,4,8,1,4,2,1,3,7,7.53,1.094525164,1.102092614
10000004,1,-99,5,1,1,36,2,3,1,1,1,2,1,3,1,5,2,1,9,1,37.03,0.858283725,0.85904896

Prompt Suggestion for your AI:

You are an expert data analyst who is helping a PhD student with no coding experience. I have the CSV data below.

Please write a complete Python script using the pandas library that:
1. Loads the data.
2. Identifies and counts any missing values.
4. Prints a statistical summary of the final, cleaned data.

Please add comments to the code to explain what each part does.

Exercise 2: From Question to Visualisation

Goal: To ask a research question and have the AI generate a plot to help answer it.

Scenario: Using the cleaned data from Exercise 1.

Your Research Question: “Is there a difference in the average score between the ‘Control’ and ‘Test’ groups?”

Your Task: Ask your AI assistant to extend the previous script to:

Group the data by the synth_sex column.
Calculate the average synth_tenure for each group.
Create a bar chart using matplotlib or seaborn to visualise the comparison. The chart must have a title and labelled axes.

Prompt Suggestion for your AI:

 Compare the average gross hourly pay (synth_hbpay_ashe) for each education level (synth_hlqpuk11_cen), using the descriptive labels (e.g., 'Degree', 'GCSEs', 'No qual').

Crucial Caveats for Researchers

AI is powerful, but you are still the scientist. You are responsible for your analysis.

Privacy & Ethics: NEVER upload sensitive, confidential, or personally identifiable participant data to a public AI service. Anonymise your data first. Using a local AI model is a safer choice.
“Hallucinations”: The AI can invent facts, code, or statistical methods. ALWAYS ask it to explain its code and its reasoning. If it suggests a statistical test, double-check that the test is appropriate for your data and question.
You are the PI: Treat the AI as an assistant, not as the principal investigator. You must understand and be able to defend every step of your analysis.

Efficient LLM Usage: Context Management

Maximising Performance and Minimising Costs

The “context window” is the amount of text (input + output) an LLM can process at once. Managing it efficiently is crucial.

Understand Your Model’s Context Limit: Different models have varying context window sizes (e.g., 4k, 8k, 32k, 128k tokens). Be aware of your chosen model’s limit.
Summarisation/Extraction: Instead of feeding entire documents, summarise long texts or extract only relevant information before sending it to the LLM.
Iterative Processing: For very long tasks (e.g., analysing a book), break it down into smaller chunks and process them sequentially, maintaining relevant context from previous steps.
Prompt Engineering for Conciseness: Craft prompts that are clear and concise, avoiding unnecessary verbosity. Every token counts!
Selective Information Retrieval (RAG): Integrate Retrieval-Augmented Generation (RAG) to fetch only necessary information from a knowledge base, rather than stuffing everything into the context window.

Efficient LLM Usage: Reducing Environmental Impact

Towards Greener AI

Running LLMs, especially large ones, consumes significant energy. We can make choices to reduce this impact.

Choose Smaller, Capable Models: If a 7B model can achieve your task effectively, avoid using a 70B model. Smaller models consume less energy per inference.
Quantisation: As discussed, lower quantisation levels (e.g., Q4_K_M instead of Q8_0) drastically reduce model size and computational load, leading to lower energy consumption.
Hardware Optimisation: Utilise hardware acceleration (GPUs) where available, as they are often more energy-efficient for parallel computations than CPUs for LLM inference. Ensure drivers and software are optimised.
Local versus Cloud: Running models locally on efficient hardware can sometimes be more energy-efficient than constantly sending data to and from cloud-based APIs, depending on the cloud provider’s infrastructure and your usage patterns.

Conclusion

AI dramatically lowers the barrier to entry for computational data analysis, empowering you to work with your data directly.
The most important skill is learning how to ask good questions and provide clear context.
Start small, experiment with your own (anonymised) data, and always critically evaluate the output.

Happy analysing!