From Zero to Q: A Step-by-Step Guide to Training Large Language Models for a Niche Programming Language
How Morgan Stanley and Prime Intellect built a 59 % accurate Q-code generator and open-sourced every line of code.
Why bother with Q in the first place?
Q (and its companion database kdb+) is the silent workhorse of quantitative finance.
-
A single line can scan billions of market ticks in milliseconds. -
Banks, hedge funds, and exchanges rely on it for real-time risk and back-testing. -
Yet Stack Overflow counts fewer than 200 answered Q questions—orders of magnitude less than Python or Java.
General-purpose large language models know of Q, but when asked to write it they usually fail. The gap is not talent; it is data.
This post walks through the exact pipeline the authors used to close that gap, then shows how you can clone it for any under-represented language.
The four-stage pipeline in one glance
Phase | What it does | Takes | Produces | Typical wall-time on 8×H100 |
---|---|---|---|---|
1. Dataset | Turns Python problems into Q | LeetCode + Python solutions | 678 verified Q examples | 6–10 h |
2. Pre-training | Teaches general Q syntax | Permissive GitHub Q repos | Domain-adapted checkpoint | 1–10 h |
3. Fine-tuning | Teaches problem solving | Q-LeetCode train split | SFT checkpoint | 1–15 h |
4. Reinforcement Learning | Reduces careless errors | SFT checkpoint + unit tests | Final 1.5–32 B model | 2–6 h |
Phase 1 — Building a verifiable dataset
Starting point: no public Q benchmark exists
The team chose LeetCode because every problem already contains
-
a plain-English description -
a canonical Python solution -
multiple test cases with ground-truth outputs
Translating Python → Q becomes a supervised task whose correctness can be checked automatically.
Bootstrap loop (Model-in-the-Loop)
-
Sample
Pick 20 LeetCode problems.
Ask a teacher model (Qwen-2.5-32B-Instruct) to
a) write Q code and
b) write a separate Q test harness. -
Verify
Run the harness in the official Q interpreter.
Accept only solutions that pass all tests. -
Curriculum update
Add the newly accepted problems to the training set, run 100 SFT steps, then repeat.
“
Early mistake: letting the model generate code and tests together.
Result: it learned to write trivial tests that always passed.
Fix: enforce strict independence between code and test generation.
-
After ~50 loops the success rate plateaued. -
Manual review removed false positives. -
Final tally: 542 train / 136 test problems spanning arrays, hash tables, dynamic programming, etc.
Phase 2 — Pre-training on raw Q code
Data sources & cleaning
-
14 open-source GitHub repositories (MIT or Apache-2.0). -
Kx Systems official docs and tutorials. -
Automated + human filtering removed non-Q files and low-quality snippets. -
End volume: 1.6 M tokens in 4096-token chunks.
Training recipe
Model size | GPUs | Steps | Learning rate | Early stop? |
---|---|---|---|---|
1.5 B–7 B | 4×H100 | 800 | 1 × 10⁻⁵ | eval loss |
14 B–32 B | 8×H100 | 50 | 5 × 10⁻⁶ | eval loss |
“
Observation: 14 B and 32 B overfit quickly; smaller models do not.
Lesson: always monitor validation loss, not training loss.
Phase 3 — Supervised fine-tuning (SFT)
Task formulation
Each problem yields 8 training samples:
Task | Input | Target |
---|---|---|
Description → Q | English text | Q solve function |
Python → Q | Python solve |
Equivalent Q code |
Q → Python | Q solve |
Equivalent Python code |
Test → Test | Python test | Q test harness |
Key hyper-parameters
-
Batch size 8 (gradient accumulation) -
Learning rate 2 × 10⁻⁵ (7 B) or 4 × 10⁻⁵ (32 B) -
Early stopping on validation loss (~600–1000 steps) -
Full-parameter tuning chosen over LoRA for downstream compatibility
Result snapshot
-
7 B pass@1 jumps from 0 % → 44.9 % -
14 B surpasses GPT-4.1 (2.9 %) by a wide margin
Phase 4 — Reinforcement learning with GRPO
Reward design
-
Unit-test reward = fraction of passing tests -
Perfect bonus = +2 if all tests pass -
Combined reward = test reward + perfect bonus (linear mix)
Experimental grid
Variable | Values tested |
---|---|
Model size | 1.5 B, 3 B, 7 B, 14 B, 32 B |
Reasoning prompt | yes / no |
Sampling temperature | 0.6, 0.7, 0.8, 1.0, 1.2 |
Take-aways
-
RL helps only above ~7 B parameters. -
1.5 B actually degrades—capacity too low. -
32 B reasoning model achieves best final score: 59 % pass@1, beating Claude Opus-4 by 29.5 %.
Reproducing the results on your machine
Prerequisites
-
Ubuntu 20.04+ or macOS + Docker -
NVIDIA driver ≥ 525 -
q interpreter (free personal edition) -
Python 3.8+
# 1. Clone
git clone https://github.com/morganstanley/q-llm-pipeline
cd q-llm-pipeline
# 2. Install
pip install -r requirements.txt
# 3. One-line end-to-end for 7 B model
python run_full_pipeline.py \
--model_size 7B \
--use_lora \
--dataset_download auto \
--output_dir ./my_q_model
The script will
-
Download the Q-LeetCode dataset (if missing). -
Run pre-training → SFT → RL in sequence. -
Produce a Hugging Face-compatible checkpoint plus an HTML evaluation report.
How to adapt the pipeline to another niche language
File | Change needed |
---|---|
build_dataset/convert.py |
Replace Q grammar rules with target language |
pretrain/filter.py |
Update regex to harvest target-language repos |
eval/test_runner.py |
Swap interpreter path |
rl/reward.py |
Plug in any executable that returns pass/fail |
If the new language already has a unit-test framework (e.g., R’s testthat
), the whole RL loop is reusable.
Real-world vs. LeetCode Q: know the gap
The released models are optimised for algorithmic puzzles, not production analytics.
Example comparison:
Real-world Q (tick database) | LeetCode-style Q (puzzle) |
---|---|
select avg price by sym from trade where date=2024.08.14 |
solve:{[c] ... } // H-index |
If your use case is heavy SQL-like queries, continue fine-tuning on in-house data while keeping the provided checkpoints as a warm start.
Frequently asked questions
Do I need 8×A100?
-
7 B fits on 1×RTX 4090 24 GB with QLoRA. -
32 B needs 8×A100 80 GB or cloud spot instances (tested on CoreWeave).
Can I skip reinforcement learning?
Yes. The SFT checkpoint already beats GPT-4.1. RL adds the last ~10 % accuracy.
License summary
-
Code: Apache 2.0 -
Models: inherit Qwen-2.5 license -
Dataset: MIT (problem texts © LeetCode under fair-use educational terms)
Closing thoughts
The project proves that any niche language can get its own competent code model in a weekend of GPU time, provided you:
-
Build a tiny but verifiable dataset (bootstrap loop). -
Pre-train on every scrap of permissive code you can find. -
Fine-tune and RL with automated, objective rewards.
Download the repo, change three paths, and you have a blueprint for Fortran, COBOL, or your company’s internal DSL. Happy training!