From Zero to Q: A Step-by-Step Guide to Training Large Language Models for a Niche Programming Language

How Morgan Stanley and Prime Intellect built a 59 % accurate Q-code generator and open-sourced every line of code.

Why bother with Q in the first place?

Q (and its companion database kdb+) is the silent workhorse of quantitative finance.

A single line can scan billions of market ticks in milliseconds.
Banks, hedge funds, and exchanges rely on it for real-time risk and back-testing.
Yet Stack Overflow counts fewer than 200 answered Q questions—orders of magnitude less than Python or Java.

General-purpose large language models know of Q, but when asked to write it they usually fail. The gap is not talent; it is data.
This post walks through the exact pipeline the authors used to close that gap, then shows how you can clone it for any under-represented language.

The four-stage pipeline in one glance

Phase	What it does	Takes	Produces	Typical wall-time on 8×H100
1. Dataset	Turns Python problems into Q	LeetCode + Python solutions	678 verified Q examples	6–10 h
2. Pre-training	Teaches general Q syntax	Permissive GitHub Q repos	Domain-adapted checkpoint	1–10 h
3. Fine-tuning	Teaches problem solving	Q-LeetCode train split	SFT checkpoint	1–15 h
4. Reinforcement Learning	Reduces careless errors	SFT checkpoint + unit tests	Final 1.5–32 B model	2–6 h

Phase 1 — Building a verifiable dataset

Starting point: no public Q benchmark exists

The team chose LeetCode because every problem already contains

a plain-English description
a canonical Python solution
multiple test cases with ground-truth outputs

Translating Python → Q becomes a supervised task whose correctness can be checked automatically.

Bootstrap loop (Model-in-the-Loop)

Sample
Pick 20 LeetCode problems.
Ask a teacher model (Qwen-2.5-32B-Instruct) to
a) write Q code and
b) write a separate Q test harness.
Verify
Run the harness in the official Q interpreter.
Accept only solutions that pass all tests.
Curriculum update
Add the newly accepted problems to the training set, run 100 SFT steps, then repeat.

“

Early mistake: letting the model generate code and tests together.
Result: it learned to write trivial tests that always passed.
Fix: enforce strict independence between code and test generation.

After ~50 loops the success rate plateaued.
Manual review removed false positives.
Final tally: 542 train / 136 test problems spanning arrays, hash tables, dynamic programming, etc.

Phase 2 — Pre-training on raw Q code

Data sources & cleaning

14 open-source GitHub repositories (MIT or Apache-2.0).
Kx Systems official docs and tutorials.
Automated + human filtering removed non-Q files and low-quality snippets.
End volume: 1.6 M tokens in 4096-token chunks.

Training recipe

Model size	GPUs	Steps	Learning rate	Early stop?
1.5 B–7 B	4×H100	800	1 × 10⁻⁵	eval loss
14 B–32 B	8×H100	50	5 × 10⁻⁶	eval loss

“

Observation: 14 B and 32 B overfit quickly; smaller models do not.
Lesson: always monitor validation loss, not training loss.

Phase 3 — Supervised fine-tuning (SFT)

Task formulation

Each problem yields 8 training samples:

Task	Input	Target
Description → Q	English text	Q `solve` function
Python → Q	Python `solve`	Equivalent Q code
Q → Python	Q `solve`	Equivalent Python code
Test → Test	Python test	Q test harness

Key hyper-parameters

Batch size 8 (gradient accumulation)
Learning rate 2 × 10⁻⁵ (7 B) or 4 × 10⁻⁵ (32 B)
Early stopping on validation loss (~600–1000 steps)
Full-parameter tuning chosen over LoRA for downstream compatibility

Result snapshot

7 B pass@1 jumps from 0 % → 44.9 %
14 B surpasses GPT-4.1 (2.9 %) by a wide margin

Phase 4 — Reinforcement learning with GRPO

Reward design

Unit-test reward = fraction of passing tests
Perfect bonus = +2 if all tests pass
Combined reward = test reward + perfect bonus (linear mix)

Experimental grid

Variable	Values tested
Model size	1.5 B, 3 B, 7 B, 14 B, 32 B
Reasoning prompt	yes / no
Sampling temperature	0.6, 0.7, 0.8, 1.0, 1.2

Take-aways

RL helps only above ~7 B parameters.
1.5 B actually degrades—capacity too low.
32 B reasoning model achieves best final score: 59 % pass@1, beating Claude Opus-4 by 29.5 %.

Reproducing the results on your machine

Prerequisites

Ubuntu 20.04+ or macOS + Docker
NVIDIA driver ≥ 525
q interpreter (free personal edition)
Python 3.8+

# 1. Clone
git clone https://github.com/morganstanley/q-llm-pipeline
cd q-llm-pipeline

# 2. Install
pip install -r requirements.txt

# 3. One-line end-to-end for 7 B model
python run_full_pipeline.py \
  --model_size 7B \
  --use_lora \
  --dataset_download auto \
  --output_dir ./my_q_model

The script will

Download the Q-LeetCode dataset (if missing).
Run pre-training → SFT → RL in sequence.
Produce a Hugging Face-compatible checkpoint plus an HTML evaluation report.

How to adapt the pipeline to another niche language

File	Change needed
`build_dataset/convert.py`	Replace Q grammar rules with target language
`pretrain/filter.py`	Update regex to harvest target-language repos
`eval/test_runner.py`	Swap interpreter path
`rl/reward.py`	Plug in any executable that returns pass/fail

If the new language already has a unit-test framework (e.g., R’s testthat), the whole RL loop is reusable.

Real-world vs. LeetCode Q: know the gap

The released models are optimised for algorithmic puzzles, not production analytics.
Example comparison:

Real-world Q (tick database)	LeetCode-style Q (puzzle)
`select avg price by sym from trade where date=2024.08.14`	`solve:{[c] ... } // H-index`

If your use case is heavy SQL-like queries, continue fine-tuning on in-house data while keeping the provided checkpoints as a warm start.

Frequently asked questions

Do I need 8×A100?

7 B fits on 1×RTX 4090 24 GB with QLoRA.
32 B needs 8×A100 80 GB or cloud spot instances (tested on CoreWeave).

Can I skip reinforcement learning?

Yes. The SFT checkpoint already beats GPT-4.1. RL adds the last ~10 % accuracy.

License summary

Code: Apache 2.0
Models: inherit Qwen-2.5 license

Closing thoughts

The project proves that any niche language can get its own competent code model in a weekend of GPU time, provided you:

Build a tiny but verifiable dataset (bootstrap loop).
Pre-train on every scrap of permissive code you can find.
Fine-tune and RL with automated, objective rewards.

Download the repo, change three paths, and you have a blueprint for Fortran, COBOL, or your company’s internal DSL. Happy training!

Mastering the Q Programming Language: How Morgan Stanley Built a 59% Accurate Code Generator