Deep Dive into A.X K1: Architecture Design and Think-Fusion Evolution of a 519B MoE Model

Snippet: A.X K1 is a 519B-parameter Mixture-of-Experts (MoE) model by SK Telecom, activating only 33B parameters for efficient inference. It introduces the Think-Fusion training recipe, enabling a unified model to switch between high-speed “intuition” and deep “reasoning” modes, setting new benchmarks in Korean and multi-language AI performance.


In the pursuit of Artificial General Intelligence (AGI), the industry faces a constant tug-of-war: how to maintain massive model capacity without skyrocketing inference costs. The newly released A.X K1 technical report provides a definitive answer. By leveraging a sparse Mixture-of-Experts (MoE) architecture and the innovative Think-Fusion mechanism, A.X K1 achieves the knowledge depth of a 500B+ model with the agility of a 33B model.

Why 519B Total Parameters with Only 33B Active?

Traditional dense models activate every single parameter for every query, leading to immense computational waste. A.X K1 utilizes a Sparse Mixture-of-Experts (MoE) architecture, functioning like a massive specialized hospital with 193 different “departments.” When a query arrives, the system only activates the most relevant “experts.”

A.X K1 Technical Specifications

Metric Specification
Total Parameters 519B
Active Parameters 33B
Architecture 61 Layers (1 Dense + 60 MoE)
Expert Configuration 192 Routed Experts + 1 Shared Expert
Activation Strategy 8 Routed Experts + 1 Shared Expert per token
Context Length 128K (131,072 Tokens)
Vocabulary Size 163,840 (Optimized for Code & Multilingual)

This design allows A.X K1 to house a 519B-parameter knowledge base while maintaining an inference footprint comparable to a mid-sized 30B model, dramatically lowering the cost of large-scale deployment.


Think-Fusion: Seamlessly Switching Between “Fast” and “Slow” Thinking

Human cognition operates in two modes: “System 1” (fast, intuitive) for simple tasks like saying “hello,” and “System 2” (slow, logical) for solving complex calculus. The Think-Fusion training recipe integrates both modes into a single model.

  • Non-thinking Mode: Ideal for simple queries, providing low-latency, direct, and concise answers to save compute resources.
  • Thinking Mode: Targeted at complex logic, programming, or math. The model generates a comprehensive Chain-of-Thought (CoT), trading increased test-time compute for higher accuracy and deeper reasoning.

By allowing users to toggle between these modes, A.X K1 optimizes the “compute-to-value” ratio—you no longer need to waste high-end reasoning power on trivial chat interactions.


Forged in 10 Trillion Tokens: The Pre-training Journey

The power of A.X K1 stems from its massive pre-training corpus of 10T (10 trillion) tokens. To handle this scale, the development team implemented a sophisticated data pipeline.

1. Multi-Stage Data Processing Workflow

  • Document Parsing: Using a proprietary Vision-Language Model (VLM), the team performed layout analysis and OCR on complex PDFs. This ensures that knowledge from academic papers and technical reports—often lost in standard text scraping—is accurately preserved.
  • Synthetic Data Generation: A dual-pipeline approach was used: one for reconstructing reasoning chains from seed data and another for synthesizing knowledge based on specific themes. This fills the gap where natural data lacks deep logical progression.
  • Curriculum Learning: Data was filtered through quality, domain, and difficulty tiers, allowing the model to learn in a structured manner from simple to complex concepts.

2. Compute-Optimal Training Strategy

Following Scaling Laws, the team optimized the training configuration for a fixed computational budget:

  • Hardware Array: Initially 1,024 NVIDIA H200 GPUs, scaled to 1,536.
  • Total Compute: Approximately FLOPs over 75 days.
  • Dual Normalization: To prevent “Loss Spikes” common in MoE training, RMSNorm was applied both before and after the MLP in MoE layers, ensuring numerical stability at scale.

High-Performance Inference: Running the Giant

Deploying a 519B model requires more than just raw power; it requires architectural ingenuity:

  • MLA (Multi-head Latent Attention): This significantly reduces KV cache memory overhead, which is critical for maintaining performance across the 128K context window.
  • FP8 Precision: Using E4M3 for forward passes and E5M2 for backward passes, the team improved training throughput by roughly 10% without sacrificing stability.
  • Multi-Token Prediction (MTP): This supports speculative decoding, allowing the model to predict multiple future tokens simultaneously, drastically increasing generation speed.

FAQ: Common Questions About A.X K1

Q: Is A.X K1 fine-tuned from an existing model like Llama or Qwen?
No. A.X K1 was trained from scratch. It did not initialize parameters from any pre-existing model, ensuring a completely independent and unbiased knowledge base.

Q: Can a 519B model be cost-effective for enterprise use?
Absolutely. Because it is an MoE model, you only “pay” for the 33B active parameters during inference. When paired with high-efficiency frameworks like SGLang, the operating cost is comparable to mid-weight models while offering much higher peak intelligence.

Q: Which languages does it support best?
While it is highly proficient in English, Code, Chinese, Japanese, and Spanish, A.X K1 holds a “Sovereign AI” advantage in Korean. It is specifically optimized to understand the nuances of Korean culture, law, and social context.


Conclusion

A.X K1 is more than just a large-scale model; it is a masterclass in efficiency. By proving that a 519B-parameter model can run with the speed of a 33B model through Sparse MoE, and by introducing the Think-Fusion control mechanism, SK Telecom has set a new standard for manageable, high-intelligence AI.

For developers and enterprises looking for a model that balances deep logical “System 2” reasoning with rapid “System 1” response, A.X K1 represents the current frontier of accessible LLM technology.


Technical Note:
This article is based on the A.X K1 Technical Report (2026). For implementation details and model weights, please visit the Official Hugging Face Repository.

Would you like me to generate a specific code snippet for deploying A.X K1 using SGLang on your local cluster?