From Zero to Q: A Step-by-Step Guide to Training Large Language Models for a Niche Programming Language

How Morgan Stanley and Prime Intellect built a 59 % accurate Q-code generator and open-sourced every line of code.


Why bother with Q in the first place?

Q (and its companion database kdb+) is the silent workhorse of quantitative finance.

  • A single line can scan billions of market ticks in milliseconds.
  • Banks, hedge funds, and exchanges rely on it for real-time risk and back-testing.
  • Yet Stack Overflow counts fewer than 200 answered Q questions—orders of magnitude less than Python or Java.

General-purpose large language models know of Q, but when asked to write it they usually fail. The gap is not talent; it is data.
This post walks through the exact pipeline the authors used to close that gap, then shows how you can clone it for any under-represented language.


The four-stage pipeline in one glance

Phase What it does Takes Produces Typical wall-time on 8×H100
1. Dataset Turns Python problems into Q LeetCode + Python solutions 678 verified Q examples 6–10 h
2. Pre-training Teaches general Q syntax Permissive GitHub Q repos Domain-adapted checkpoint 1–10 h
3. Fine-tuning Teaches problem solving Q-LeetCode train split SFT checkpoint 1–15 h
4. Reinforcement Learning Reduces careless errors SFT checkpoint + unit tests Final 1.5–32 B model 2–6 h

Phase 1 — Building a verifiable dataset

Starting point: no public Q benchmark exists

The team chose LeetCode because every problem already contains

  • a plain-English description
  • a canonical Python solution
  • multiple test cases with ground-truth outputs

Translating Python → Q becomes a supervised task whose correctness can be checked automatically.

Bootstrap loop (Model-in-the-Loop)

  1. Sample
    Pick 20 LeetCode problems.
    Ask a teacher model (Qwen-2.5-32B-Instruct) to
    a) write Q code and
    b) write a separate Q test harness.

  2. Verify
    Run the harness in the official Q interpreter.
    Accept only solutions that pass all tests.

  3. Curriculum update
    Add the newly accepted problems to the training set, run 100 SFT steps, then repeat.

Early mistake: letting the model generate code and tests together.
Result: it learned to write trivial tests that always passed.
Fix: enforce strict independence between code and test generation.

  • After ~50 loops the success rate plateaued.
  • Manual review removed false positives.
  • Final tally: 542 train / 136 test problems spanning arrays, hash tables, dynamic programming, etc.

Phase 2 — Pre-training on raw Q code

Data sources & cleaning

  • 14 open-source GitHub repositories (MIT or Apache-2.0).
  • Kx Systems official docs and tutorials.
  • Automated + human filtering removed non-Q files and low-quality snippets.
  • End volume: 1.6 M tokens in 4096-token chunks.

Training recipe

Model size GPUs Steps Learning rate Early stop?
1.5 B–7 B 4×H100 800 1 × 10⁻⁵ eval loss
14 B–32 B 8×H100 50 5 × 10⁻⁶ eval loss

Observation: 14 B and 32 B overfit quickly; smaller models do not.
Lesson: always monitor validation loss, not training loss.


Phase 3 — Supervised fine-tuning (SFT)

Task formulation

Each problem yields 8 training samples:

Task Input Target
Description → Q English text Q solve function
Python → Q Python solve Equivalent Q code
Q → Python Q solve Equivalent Python code
Test → Test Python test Q test harness

Key hyper-parameters

  • Batch size 8 (gradient accumulation)
  • Learning rate 2 × 10⁻⁵ (7 B) or 4 × 10⁻⁵ (32 B)
  • Early stopping on validation loss (~600–1000 steps)
  • Full-parameter tuning chosen over LoRA for downstream compatibility

Result snapshot

  • 7 B pass@1 jumps from 0 % → 44.9 %
  • 14 B surpasses GPT-4.1 (2.9 %) by a wide margin

Phase 4 — Reinforcement learning with GRPO

Reward design

  • Unit-test reward = fraction of passing tests
  • Perfect bonus = +2 if all tests pass
  • Combined reward = test reward + perfect bonus (linear mix)

Experimental grid

Variable Values tested
Model size 1.5 B, 3 B, 7 B, 14 B, 32 B
Reasoning prompt yes / no
Sampling temperature 0.6, 0.7, 0.8, 1.0, 1.2

Take-aways

  • RL helps only above ~7 B parameters.
  • 1.5 B actually degrades—capacity too low.
  • 32 B reasoning model achieves best final score: 59 % pass@1, beating Claude Opus-4 by 29.5 %.

Reproducing the results on your machine

Prerequisites

  • Ubuntu 20.04+ or macOS + Docker
  • NVIDIA driver ≥ 525
  • q interpreter (free personal edition)
  • Python 3.8+
# 1. Clone
git clone https://github.com/morganstanley/q-llm-pipeline
cd q-llm-pipeline

# 2. Install
pip install -r requirements.txt

# 3. One-line end-to-end for 7 B model
python run_full_pipeline.py \
  --model_size 7B \
  --use_lora \
  --dataset_download auto \
  --output_dir ./my_q_model

The script will

  1. Download the Q-LeetCode dataset (if missing).
  2. Run pre-training → SFT → RL in sequence.
  3. Produce a Hugging Face-compatible checkpoint plus an HTML evaluation report.

How to adapt the pipeline to another niche language

File Change needed
build_dataset/convert.py Replace Q grammar rules with target language
pretrain/filter.py Update regex to harvest target-language repos
eval/test_runner.py Swap interpreter path
rl/reward.py Plug in any executable that returns pass/fail

If the new language already has a unit-test framework (e.g., R’s testthat), the whole RL loop is reusable.


Real-world vs. LeetCode Q: know the gap

The released models are optimised for algorithmic puzzles, not production analytics.
Example comparison:

Real-world Q (tick database) LeetCode-style Q (puzzle)
select avg price by sym from trade where date=2024.08.14 solve:{[c] ... } // H-index

If your use case is heavy SQL-like queries, continue fine-tuning on in-house data while keeping the provided checkpoints as a warm start.


Frequently asked questions

Do I need 8×A100?

  • 7 B fits on 1×RTX 4090 24 GB with QLoRA.
  • 32 B needs 8×A100 80 GB or cloud spot instances (tested on CoreWeave).

Can I skip reinforcement learning?

Yes. The SFT checkpoint already beats GPT-4.1. RL adds the last ~10 % accuracy.

License summary

  • Code: Apache 2.0
  • Models: inherit Qwen-2.5 license
  • Dataset: MIT (problem texts © LeetCode under fair-use educational terms)

Closing thoughts

The project proves that any niche language can get its own competent code model in a weekend of GPU time, provided you:

  1. Build a tiny but verifiable dataset (bootstrap loop).
  2. Pre-train on every scrap of permissive code you can find.
  3. Fine-tune and RL with automated, objective rewards.

Download the repo, change three paths, and you have a blueprint for Fortran, COBOL, or your company’s internal DSL. Happy training!