Model Deploymentarchive | Efficient Coder

Tencent Hunyuan Compact Models: The Ultimate Hands-On Guide for Developers

4 months ago 高效码农

Tencent Hunyuan 0.5B/1.8B/4B/7B Compact Models: A Complete Hands-On Guide From download to production deployment—no hype, just facts Quick answers to the three most-asked questions Question Straight answer “I only have one RTX 4090. Which model can I run?” 7 B fits in 24 GB VRAM; if you need even more head-room, use 4 B or 1.8 B. “Where do I download the files?” GitHub mirrors and Hugging Face hubs are both live; git clone or browser downloads work. “How fast is ‘fast’?” 7 B on a single card with vLLM BF16 gives < 200 ms time-to-first-token; 4-bit quant shaves another …

Efficient LLM Deployment on Ascend NPUs: Pangu Embedded & Pro MoE Guide

5 months ago 高效码农

Efficient LLM Deployment on Ascend NPUs: Pangu Embedded & Pangu Pro MoE In this post, we explore two complementary solutions from Huawei’s Pangu team—Pangu Embedded and Pangu Pro MoE—designed for low-latency and high-throughput inference on Ascend NPUs. Drawing exclusively on official technical reports, we translate and adapt core concepts into clear, engaging English suitable for junior college–level readers worldwide. We preserve every detail of system design, training methodology, and deployment best practices to deliver genuine, long‑term value without clickbait or hype. Source: Unsplash Table of Contents Why Efficient Inference Matters Pangu Embedded: Fast & Slow Thinking with Metacognition Dual‑System Framework …