LLM Agentic Patterns & Fine-Tuning: A Practical 2025 Guide for Beginners
Everything you need to start building small, fast, and trustworthy AI agents today—no PhD required.
Quick Take
1.2-second average response time with a 1-billion-parameter model 82 % SQL accuracy after sixteen training steps on free-to-use data 5 reusable agent patterns that run on a laptop with 4 GB of free RAM
Why This Guide Exists
Search engines and large-language-model (LLM) applications now reward the same thing: clear, verifiable, step-by-step help. This post turns the original technical notes into a beginner-friendly walkthrough. Every fact, number, and file path comes from the single source document—nothing is added from outside.
Part 1 — Five Agent Patterns You Can Copy-Paste
Pattern | When to Use It | What It Looks Like |
---|---|---|
Prompt Chaining | One simple job in many stages | Summarize → Translate → Polish |
Routing | Many possible experts, one question | “Is this code or cooking?” → send to coder or chef agent |
Reflection | You want higher quality | Draft → Critique → Rewrite |
Tool Use | You need live data | “What is the weather?” → call weather API |
Planning & Multi-Agent | A big project with sub-tasks | Researcher finds facts, Writer turns them into a blog post |
1.1 Prompt Chaining
Goal: Turn a long English paragraph into a short French summary.
Setup: Two prompts in a row.
Step 1: “Summarize the next paragraph in one English sentence.”
Step 2: “Translate that sentence into French.”
No extra software is needed; any chat interface works.
Measured result: 1.2 seconds on a 2021 MacBook Air.
1.2 Routing
Goal: One model decides which specialist should answer.
Setup: A “router” prompt classifies the user question.
Router prompt example:
“You are a classifier. Reply in JSON: {category: CODE, RECIPE, OTHER}.”
The router’s JSON output then picks the next prompt.
Measured result: Correct routing 96 % of the time over 200 test questions.
1.3 Reflection
Goal: Improve the first draft automatically.
Setup: Two agents, “Writer” and “Critic”, loop once.
Writer: “Write a four-line poem about robots.”
Critic: “Count the lines. If ≠ 4, return FAIL.”
Writer (second call): “Rewrite poem, exactly four lines.”
Measured result: After one loop, 100 % of poems had four lines.
1.4 Tool Use
Goal: Bring in real-world data.
Setup: Model prints a function call, code runs it, model reads the result.
User: “Temperature in London?”
Model output: get_current_temperature(location='London')
System runs API → returns 15 °C
Model finishes: “It is 15 °C in London.”
Measured result: Correct city and unit 99 % of the time in 150 live tests.
1.5 Planning & Multi-Agent
Goal: Write a short blog post from scratch.
Setup: An “Orchestrator” breaks the job into steps.
1. Researcher: find three benefits of AI agents (with URLs)
2. Writer: turn facts into 300-word post
3. Reviewer: check readability
Measured result: End-to-end time 3.8 seconds on AWS t3.large.
Part 2 — Fine-Tuning a 1-Billion-Parameter Model for Text-to-SQL
We used the exact same data and code as the source document; nothing here is new or external.
2.1 What “Fine-Tuning” Means
Take a small general model (Llama-3.2-1B) and teach it one narrow job: turn plain English questions into SQL queries.
2.2 Hardware You Actually Need
-
Free Google Colab T4 GPU (or any 4 GB+ GPU) -
About 30 minutes of wall-clock time
2.3 Data We Used
-
WikiSQL test slice: 2 000 question-SQL pairs -
Format: Conversation style
<|user|> How many rows are in the table?
<|assistant|> SELECT COUNT(*) FROM table;
2.4 Key Settings (Copy Exactly)
Step | Setting | Why It Matters |
---|---|---|
Base model | unsloth/Llama-3.2-1B-Instruct | Small, permissive license |
Quantization | 4-bit (bnb-4bit) | Fits in 4 GB VRAM |
Adapter | LoRA, rank=64, α=128 | Only 1.13 % of weights train |
Train steps | 16 | Fast without over-fit |
Loss mask | Train on response only | No gradient on user prompt |
2.5 Before vs After (Same Question)
Question: “What position does the player who played for Butler CC (KS) play?”
Stage | Output | Correct? |
---|---|---|
Before tuning | SELECT Player FROM table_name WHERE No. = 21 |
❌ hallucinated number |
After tuning | SELECT Position FROM table WHERE Player = martin lewis |
✅ matches gold SQL |
Part 3 — Deploying on Your Own Machine
3.1 Save the Adapter
After training, two files appear:
-
adapter_model.safetensors
(3 MB) -
adapter_config.json
3.2 Merge & Quantize to GGUF
One command turns the tuned adapter into a single file for llama.cpp
:
python llama.cpp/convert.py \
--model Llama-3.2-1B-Instruct \
--adapter adapter_model.safetensors \
--out llama-1b-sql.gguf \
--q4_0
Resulting file: 1.9 GB, runs on CPU at ~30 tokens/second on M1 Mac.
Part 4 — FAQ: Answers People Actually Search
Q1: “Do I need cloud GPUs?”
A: No. The guide trains on a free Colab T4 in 15 minutes.
Q2: “Can I use these patterns for languages other than English?”
A: Yes. The patterns are language-agnostic; only the data changes.
Q3: “Is the fine-tuned model safe for production?”
A: For read-only SQL on known schemas, yes. Always sandbox database access.
Q4: “How big is the performance gap versus GPT-4?”
A: On WikiSQL, the 1-B model hits 82 % accuracy; GPT-4 scores ~85 % at 20× the size and cost.
Q5: “Does Baidu rank this type of content?”
A: Yes, when you follow their on-page rules: keyword-first titles under 54 chars, 108-char meta descriptions, no schema.org required .
Part 5 — Quick-Start Checklist
-
[ ] Download the WikiSQL subset (link in source repo) -
[ ] Spin up a Colab with T4 GPU -
[ ] Run the notebook cell-by-cell (no code edits needed for first test) -
[ ] Convert to GGUF and test locally with llama.cpp
-
[ ] Publish the results with factual claims only (no hype words)
About the Author
Bryan Lai is a technical writer who has contributed to open-source LLM tooling since 2021. He helped draft sections of ISO/TR 23788 on AI content quality. This article was last updated on 15 July 2025 at 09:34 UTC and reflects the exact data in the public repo—no external sources added.