One Transformer, Three Modalities: Inside HyperCLOVA X 8B Omni (The Plain-English Walkthrough)
“
Main keywords: HyperCLOVA X 8B Omni, any-to-any multimodal, text-image-speech model, 8-billion-parameter model, Korean-first AI, OmniServe inference, open-weight license
Quick-glance answers (save you a scroll)
| Question | Short answer |
|---|---|
| What is it? | An 8-billion-parameter decoder-only model that reads & writes text, images and speech in a single forward pass. |
| Who should care? | Teams that need Korean/English multimodal AI but only have 3–4 A100s, not 40. |
| Is it really open? | Weights are downloadable. Commercial use is allowed under NAVER’s custom license (credit + no illegal use). |
| How big is the gap vs. GPT-4o? | GPT-4o is still ahead on general English, but this 8B beats many 30B-class models on Korean speech and vision benchmarks. |
| Can I run it tonight? | If you have 3×A100 80 GB and Docker, the provided OmniServe repo will get you an OpenAI-compatible endpoint in ≈30 min. |
1. Why “all-in-one” suddenly matters
Until recently you had two choices:
-
Lego approach – keep an LLM, glue on vision encoder, glue on speech module.
→ Each piece forgets what it knew (catastrophic forgetting) and latency stacks. -
Pipelines – separate models for separate jobs.
→ More servers, more failure points, more cost.
NAVER Cloud’s bet: teach one transformer to interleave text tokens, image tokens and audio tokens inside the same sequence.
That’s HyperCLOVA X 8B Omni—Omni for omnimodal.
2. Architecture sketch (no PhD required)
User input
├─ text ──→ SentencePiece tokens
├─ image ──→ ViT features ──→ ① continuous embedding
│ └─→ ② TA-Tok discrete grid (27×27)
└─ audio ──→ Whisper backbone ──→ ① continuous embedding
└─→ FSQ quantizer ──→ ② discrete tokens (25/s)
All tokens are concatenated → 36-layer transformer (4k hidden) → three optional heads
├─ text head → next-token text
├─ image head → discrete codes → diffusion decoder → pixel PNG (S3 url)
└─ audio head → discrete codes → Unit-BigVGAN vocoder → 16 kHz WAV (S3 url)
Key takeaway: one autoregressive training objective, shared weights, no separate “understand” vs “create” branches.
3. Training recipe (three course meal)
| Course | Token mix | Data size | Goal |
|---|---|---|---|
| 1. Text pre-train | 100 % text | 1.2 T tokens | solid Korean & English baseline |
| 2. Discrete multi-modal | 75 % image, 15 % audio, 10 % text | 302 B | treat new modalities as “extra vocab” |
| 3. Continuous alignment | 73 % vision, 15 % audio, 12 % text | 1.5 T | inject dense features for fine detail |
Tricks worth copying
-
Freeze text embeddings during vocab expansion—text quality stays flat. -
Curriculum loss-masking: vision tokens get 0.5× loss for first 1 T tokens to avoid drowning text. -
Korean high-density OCR + cultural landmarks dataset (not in other open models).
4. Post-training: turning a base model into a polite assistant
Four-stage supervised fine-tuning (SFT):
-
Foundation chat – 50 % text, image caption, ASR, TTS basics. -
Task explosion – 59 % vision tasks (chart QA, diagram reasoning, editing). -
Video & long context – 41 % video clips, teach temporal consistency. -
Intent chain – internal <think>block so the model plans multi-step requests (see Appendix C in the paper).
Result: you can ask “Remove the dog from this photo and read me a Korean summary” in one sentence; the model plans two sub-tasks, calls its own tools, returns image URL + audio URL.
5. Benchmarks (numbers you can quote)
| Task | Data set | HyperCLOVA X 8B Omni | Best same-size rival | Delta |
|---|---|---|---|---|
| Korean knowledge | KMMLU-Pro | 64.9 % | 38.6 % (Qwen2.5-Omni-7B) | +26.3 pp |
| English math | GSM8K | 87.3 % | 87.0 % | tie |
| Vision QA (Eng) | LLaVA-W | 93.8 % | 88.5 % | +5.3 pp |
| Image editing | ImgEdit | 3.83 score | 1.30 (Janus-Pro-7B) | +195 % |
| Korean speech | KsponSpeech clean | WER 28.7 % | 34.9 % | -6.2 pp |
| Speech translation | Fleurs en→ko | ASR-BLEU 24.7 | 0.0 (rival fails) | ∞ |
“pp” = percentage points; lower WER is better.
Human eval (MOS, 30 listeners, Korean TTS)
HyperCLOVA X 8B Omni = 4.22 / 5
Next best rival = 3.40 / 5
→ gap is audible even on laptop speakers.
6. Hardware reality check
Minimum inference stack (taken from OmniServe docs):
| Component | GPU | VRAM | Note |
|---|---|---|---|
| Vision encoder | 1 × A100 80 GB | ≈ 8 GB | can time-share |
| LLM 8B | 1 × A100 80 GB | ≈ 16 GB | must be dedicated |
| Vision decoder | 1 × A100 80 GB | ≈ 16 GB | can time-share |
| Audio codec | shares either card | ≈ 4 GB | lightweight |
| Total | 3 × A100 80 GB | ≈ 48 GB | 4th card recommended for redundancy |
Latency numbers (single request, FP16, local SSD → S3)
-
Text-only: 18 tokens/s -
512×512 image gen: 3.4 s end-to-end -
10 s Korean speech: 0.9 s synthesis
7. Installation guide (copy-paste level)
Prerequisites
-
Linux host with NVIDIA Driver ≥ 525 & CUDA ≥ 12.1 -
Docker + Docker Compose -
S3-compatible bucket (NAVER Cloud, AWS, MinIO—doesn’t matter)
Step-by-step
# 1. Clone the official serving kit
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe
# 2. Python deps for model converter
pip install huggingface_hub safetensors torch openai easydict
# 3. Pull the 16 GB weights
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
--local-dir ./models/HyperCLOVAX-SEED-Omni-8B
# 4. Convert to serving format (Track B = Omni)
python convert_model.py \
--input ./models/HyperCLOVAX-SEED-Omni-8B \
--output ./track_b \
--track b
# 5. Configure S3
cp .env.example .env
# Edit:
# NCP_S3_ENDPOINT=https://s3.your-region.amazonaws.com
# NCP_S3_ACCESS_KEY=*****
# NCP_S3_SECRET_KEY=*****
# NCP_S3_BUCKET_NAME=my-omni-bucket
# 6. Build & run
docker compose --profile track-b build
docker compose --profile track-b up -d
# 7. Wait for “Model loaded” in logs
docker compose logs -f omni | grep -i loaded
Test (curl)
curl -X POST http://localhost:8000/b/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "track_b_model",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
]}],
"max_tokens": 256,
"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
}'
8. API patterns you will actually use
| Goal | Message structure | Extra notes |
|---|---|---|
| Image generation | Send "Draw a sunset over mountains" + system prompt that forces call to t2i_model_generation tool |
Tool must be listed in tools:[] |
| Image editing | Supply original image URL + text "Remove the red car" + same tool |
Decoder supports near-native aspect ratio |
| Text-to-speech | Plain text prompt: "Say hello in a cheerful female voice" |
Audio URL returned inside message.audio.data (base64) |
| Speech recognition | Send input_audio (base64 MP3) + text "What is being said?" |
30 s max, 16 kHz preferred |
| Video summary | Supply MP4 URL as image_url + text "Summarize this video" |
Internally sampled to 120 frames |
9. Limitations you should plan for
-
No video generation – model only understands video; it cannot output MP4. -
Fixed semantic grid – images are tokenized to 27×27; very small text in photos may be lost. -
Audio tempo – 25 tokens/s means singing or music with strict BPM will drift. -
Knowledge cut-off – May 2025; later facts need retrieval plug-in. -
License red lines – political campaigning, porn, deep-fake fraud, weapons development are explicitly forbidden.
10. Real-world adoption ideas
| Sector | Combine these modalities | Business value |
|---|---|---|
| E-commerce | User photo → auto-generated product description + 15-second Korean voice-over | Faster listing, SEO-rich text, accessibility |
| Education | Math diagram upload → step-by-step Korean voice explanation | Private tutoring at scale |
| Tourism | Visitor snaps a palace photo → bilingual (En/Ko) audio story | Zero extra tour staff |
| Accessibility | Continuous camera feed → spoken Korean summaries for low-vision commuters | Compliance + inclusion |
| Game NPC | Designer concept art → in-game 2-D asset + auto-dialogue voice line | Shorter iteration loop |
11. Roadmap & takeaway
The authors spell it out: “8B is our path-finding checkpoint; scaling is next.”
If your workload is Korean-heavy, GPU-light and you need everything in one box, HyperCLOVA X 8B Omni is today’s least-painful starter kit.
Download weights once; the same container handles text QA, image editing, speech translation and TTS without external plugins—no small feat for 8 billion parameters.
Keep an eye on the repo: NAVER has a history of dropping bigger checkpoints once the community shakes out the rough edges.
Install the serving kit tonight, benchmark against your current pipeline tomorrow, and you’ll know within a week whether “omni” is just marketing—or the real deal.
References & citation (straight from the paper)
NAVER Cloud HyperCLOVA X Team. “HyperCLOVA X 8B Omni: An 8-Billion-Parameter Any-to-Any Multimodal Model.” arXiv submission 7126225, January 2026.
PDF: HyperCLOVA_X_8B_Omni.pdf
Model card: HyperCLOVAX-SEED-Omni-8B
Serving code: OmniServe GitHub

