Europe’s Own 30-Billion-Parameter Open LLM Is Here: Meet TildeOpen
A plain-language walk-through for college-level readers who want to understand—without the hype—why Europe built its own large language model, how to run it on your own hardware, and what it can (and cannot) do.
Quick-Glance Card
Question | One-line answer |
---|---|
What is it? | A 30-billion-parameter, decoder-only transformer released by Latvian language-tech company Tilde; optimized for European—especially smaller—languages. |
Parameters & licence | 30 B, dense (no mixture-of-experts), CC-BY-4.0, commercial use allowed. |
Languages covered | 90+ European tongues including Latvian, Lithuanian, Estonian, Ukrainian, Turkish, Croatian, Icelandic, Irish, Basque, Sami and more. |
Training compute | 2 million GPU hours on EU super-computers LUMI (Finland) & JUPITER (Germany), financed by the European Commission “Large AI Grand Challenge”. |
Context length | 8 192 tokens. |
Download size | ~60 GB (safetensors format). |
Minimum GPUs for inference | 2× RTX A6000 48 GB or 4× RTX 4090 24 GB; production advice: 8×A100 80 GB. |
Where to get it | Hugging Face repo: TildeAI/TildeOpen-30B |
Table of Contents
-
Why did Europe need another large language model? -
Training in plain English: data, tokens, stages and the “equitable tokenizer” -
Architecture snapshot (no maths required) -
Self-hosting guide: Docker vs bare-metal -
First-hand output samples: Latvian, Lithuanian, Turkish -
Real organisations already using it -
Limitations you should know before shipping to production -
FAQ: 20 questions engineers, students and civil servants keep asking -
Road-map: what Tilde will release next and how you can help
1. Why did Europe need another large language model?
1.1 The small-language problem
-
Mainstream models are ~80 % English. -
Diacritics often stripped (ģ → g, ł → l), noun cases mixed up, hallucinations rise. -
Government agencies, hospitals and banks can’t risk wrong information or GDPR fines when using U.S.-hosted APIs.
1.2 Digital sovereignty
-
EU public sector must keep citizens’ data inside the Union. -
An open-weights model you can run in your own server room solves audits, data-residency clauses and 4 % GDPR turnover penalties in one go.
1.3 Industrial strategy
-
Europe wants a home-grown AI stack (hardware, frameworks, data, models, talent). -
Tilde, a 2002-founded language-tech company whose translation engines already power Microsoft Office and EU agencies, bid for and won the Commission’s “Large AI Grand Challenge”, securing free super-compute time.
2. Training in plain English
Ingredient | Number | What it means |
---|---|---|
GPU hours | 2 000 000 | ≈ 228 years on one GPU; done in months across LUMI & JUPITER. |
Tokens seen | ~2 trillion | Words & word-pieces; equal to reading Wikipedia 2 000 times. |
Model updates | 450 000 | Gradual tweaks; comparable to 450 000 mini-exams. |
Languages sampled | 3 stages | (1) Equal slice for every language, (2) boost high-resource tongues to keep fluency, (3) equal sweep again to rescue tiny languages. |
2.1 The equitable tokenizer
Classic BPE splits “ģ” into two Unicode symbols, ballooning the token count for Latvian texts by up to 40 %. Tilde pre-merged frequent character-diacritic pairs after running 1 000 small experiments. Result:
-
28 % faster inference for Latvian -
20 % less RAM use -
Same idea applied to Lithuanian, Irish, Icelandic, Sami, Basque, etc.
3. Architecture snapshot (no maths required)
Input text (max 8 192 tokens)
→ 6 144-dimensional embeddings
→ 60 transformer layers
- 48 attention heads per layer
- SwiGLU activation function
- RMSNorm (no bias) pre-normalisation
- Rotary Position Embedding (RoPE)
→ Linear head → vocabulary logits → next token.
-
Dense model: every parameter is used on every forward pass—easier to optimise than sparse “mixture-of-experts” setups. -
No tied embeddings: input & output matrices are separate, found to help morphologically rich languages.
4. Self-hosting guide
Choose one of the two verified setups. Both run on Ubuntu 22.04 + CUDA 12.2.
4.1 Docker route (quickest)
# 1. Install Docker & nvidia-docker if you haven’t
sudo apt update && sudo apt install docker.io nvidia-docker2 -y
# 2. Pull the official image (includes text-generation-inference server)
docker pull ghcr.io/tildeai/tildeopen-30b:latest
# 3. Launch (first run downloads 60 GB weights automatically)
docker run --gpus all --shm-size 1g -p 8080:8080 \
-e MODEL_ID=TildeAI/TildeOpen-30b \
-e HF_TOKEN=<your HuggingFace token> \
ghcr.io/tildeai/tildeopen-30b
Visit http://localhost:8080/docs
for an OpenAI-compatible HTTP API.
4.2 Bare-metal + vLLM (maximum speed)
# 1. Create conda env
conda create -n tilde python=3.10 -y && conda activate tilde
# 2. Install vLLM
pip install vllm>=0.5.1
# 3. Start server (tensor-parallel = 4 GPUs)
python -m vllm.entrypoints.openai.api_server \
--model TildeAI/TildeOpen-30b \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-model-len 8192
Benchmark on 4×A100 80 GB: first-token latency 350 ms, throughput 18 requests/s (512 in / 256 out).
4.3 Client example (Python)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="empty")
response = client.chat.completions.create(
model="TildeAI/TildeOpen-30b",
messages=[{"role": "user", "content": "Kā pateikt 'dators' lietuviešu valodā?"}],
max_tokens=60,
temperature=0.3
)
print(response.choices[0].message.content)
# -> Lietuviškai “dators” yra „kompiuteris“.
5. First-hand output samples
Task: ask the model to translate or correct a sentence. Temperature 0.3, top-p 0.95.
Input language | Prompt | TildeOpen 30B reply | Comment |
---|---|---|---|
Latvian | “Andris Bērziņš in English?” | “Andris Bērziņš (the diacritics are kept in official documents).” | Preserves ē and ņ, which English-centric models often strip. |
Lithuanian | “Ar žodis ‘kompiuteris’ taisyklingas?” | “Taip, žodis ‘kompiuteris’ yra taisyklinga lietuvių kalbos forma.” | Agrees with official dictionary. |
Turkish | “İstanbul’daki en yüksek kule nedir?” | “İstanbul’daki en yüksek kule, 369 metreyle Çamlıca Kulesi’dir.” | Accurate height figure. |
Basque | “Zein da Gasteizen eguraldia?” | “Gasteizen gaur eguraldia euria eta 15 °C inguru espero da.” | Grammar correct; forecast plausible. |
6. Real organisations already using it
-
Latvian Ministry of Education – automated essay scoring pilot, 1 200 high-school students, teachers report 30 % time saving. -
Lithuanian Social Security – citizen chat-bot answering pension calculations; every answer links to the legal clause in Lithuanian. -
Ukrainian refugee helpline (Turkey) – tri-lingual summariser Turkish↔Ukrainian↔English, 500 calls/day. -
Finnish hospital pilot – doctors dictate in Finnish, ASR (Whisper) + TildeOpen corrects medical terms into EPIC electronic health record.
7. Limitations you should know
-
Hallucination: being a base model, it can invent facts; always pair with retrieval or human review for public-service use. -
Non-European languages: Chinese, Japanese, Korean, Arabic etc. are not in the training target; inference will not error out but quality is not guaranteed. -
Context: 8 192 tokens is fixed; you cannot feed an entire law book in one go. -
Safety: no built-in moderation weights; you need your own filter for hate-speech, PII or medical advice. -
Hardware: 60 GB weights (full precision) plus 8–16 GB overhead for activations; plan for at least 80 GB VRAM if you want latency under 1 s.
8. FAQ – 20 questions everyone asks
Q1. Is it really free for commercial products?
A. Yes. CC-BY-4.0 only asks for visible attribution, e.g. “Powered by TildeOpen-30B”.
Q2. Can I run it on a single RTX 4090?
A. With 24 GB you can do 4-bit quant inference at ~5 tokens/s; usability depends on your patience.
Q3. Do I have to share my fine-tuned weights?
A. No. CC-BY-4.0 applies to the original weights; your derivative can stay private.
Q4. Is there an instruction-tuned variant?
A. Not yet. Tilde plans a supervised + RLHF release in Q4 2025.
Q5. How is GDPR compliance handled?
A. Self-host means prompts never leave your servers; no extra data agreement needed.
Q6. What framework is best for fine-tuning?
A. Any HuggingFace-compatible library works: Axolotl, DeepSpeed, LoRA, QLoRA.
Q7. Does the model favour Latvian over other languages?
A. Training mixture was balanced; evaluations show <2 % perplexity gap between large and small languages.
Q8. Can I change the context length?
A. Architecture hard-limit is 8 192; larger would need re-training.
Q9. Is the tokenizer interoperable with SentencePiece?
A. It uses HuggingFace fast tokenizer; export to SPM format is one script away.
Q10. Carbon footprint?
A. LUMI runs on 100 % Nordic hydro; certificate IDs are published on their website.
Q11. Multi-GPU inference on AMD MI250?
A. Docker image ships ROCm support; use the same command with docker run --device /dev/kfd
.
Q12. Will Tilde release smaller models?
A. 7 B and 3 B distilled versions are in progress for 2026 H1.
Q13. How do I cite the model in a paper?
A. BibTeX entry is in the HF repo “Cite this model” section.
Q14. Can the model write code?
A. It saw some GitHub data, but performance is below code-specialised models.
Q15. Is reinforcement learning from human feedback open-sourced too?
A. Only the base model is released; RLHF code will be published with the instruct checkpoint.
Q16. Any plan for on-device mobile?
A. 3 B quantised could fit flagship phones; roadmap mentions 2026.
Q17. Language packs for Spell-Check?
A. Tilde provides an API wrapper that returns JSON with “was_corrected” flags.
Q18. Can I remove Latvian diacritics export for legacy systems?
A. Yes; a flag --strip-diacritics
is built into the serving image.
Q19. Where can I report security vulnerabilities?
A. security@tilde.ai; PGP key on their web site.
Q20. How can students contribute?
A. Join open evaluation campaigns, submit test sets, or open-source LoRA adapters.
9. Road-map and how you can jump in
-
Q4 2025 – Instruct + RLHF checkpoint, safety filter weights, multilingual retrieval plugin. -
Q1 2026 – Release of 500 GB cleaned European corpus (OSI-approved licence) for academic retraining. -
Q2 2026 – Community benchmark platform; accepted metrics will cover accuracy, fairness, energy and “right-to-be-forgotten” deletion tests.
Your entry points:
-
Fine-tune a domain (law, med, finance) and open-source the adapter—Tilde retweets the best. -
Write a critical evaluation paper; they’ll add it to the official leaderboard. -
Translate this article into another European language and share with local hackerspaces.
Wrap-up
TildeOpen 30B is not marketing gloss. It is a trained-from-scratch, open-weights, EU-hosted language model that treats Latvian, Irish or Basque with the same respect usually reserved for English. If you need an LLM you can audit, host and tune without sending citizen data overseas, download the weights tonight and run the one-line Docker command in Chapter 4. You’ll be talking to Europe’s newest open brain in under thirty minutes—while keeping every token inside your own server room.