Artificial Intelligence archive | Page 3 of 53

OpenAI Skills Explained: How ChatGPT’s New Feature Transforms AI Workflows

7 days ago 高效码农

OpenAI Quietly Rolls Out Skills: Now Available in ChatGPT and Codex CLI Summary OpenAI has introduced a Skills feature to both ChatGPT and Codex CLI, modeled after Anthropic’s Skills mechanism. A “skill” is a folder containing a Markdown file and optional resources/scripts, enabling tasks like PDF processing, document handling, and plugin development. ChatGPT integrates skills via its Code Interpreter, while Codex CLI supports custom skill installation—both delivering practical, scalable AI capabilities. If you follow AI tool advancements, you may have noticed a subtle but impactful update: OpenAI has quietly added “Skills” to ChatGPT and its open-source Codex CLI. First popularized …

DentalGPT: How a 7B Model is Outperforming Giants in AI Dentistry

7 days ago 高效码农

Exploring DentalGPT: Revolutionizing Dental Diagnosis with Multimodal Complex Reasoning DentalGPT is a specialized multimodal large language model (MLLM) designed for dentistry. By incorporating high-quality domain knowledge and reinforcement learning, it dramatically improves fine-grained visual understanding of dental images and diagnostic reasoning. Built on a dataset of over 120,000 dental images—the largest annotated collection to date—this 7B-parameter model outperforms many state-of-the-art general-purpose MLLMs in disease classification and dental visual question answering (VQA) tasks. Why Dentistry Needs Advanced AI Assistance As a dental professional or recent graduate, you know how demanding it is to interpret complex dental images—whether intraoral photographs or panoramic …

How Budget-Aware Search Agents Break Performance Ceilings (BATS Framework)

7 days ago 高效码农

Running on a Budget, Yet Smarter—How “Money-Wise” Search Agents Break the Performance Ceiling Keywords: budget-aware tool use, test-time scaling, search agent, BATS, Budget Tracker, cost-performance Pareto frontier Opening: Three Quick Questions Hand an agent 100 free search calls—will it actually use them? If it stops at 30 and calls it a day, will more budget move the accuracy needle? Can we teach the machine to check its wallet before every click? A new joint study by Google, UCSB and NYU says YES. “Simply letting the model see the remaining balance pushes accuracy up while keeping the tab unchanged—or even smaller.” …

AI Safety With a Guarantee: How the BEAVER Framework Delivers Provable LLM Safety

7 days ago 高效码农

BEAVER: Adding a “Mathematical Guarantee” to AI Safety Imagine this: you ask a large language model a question, and it could generate ten different answers. How do you precisely know its “confidence” in giving the correct one? The BEAVER framework provides, for the first time, a deterministic, mathematical answer to this critical question. Here’s a tangible scenario: you instruct an LLM to generate a safe Bash command to list a directory. Most of the time, it might output ls -al. But is there a possibility, however small, that it could output a dangerous command like rm -rf /home? Before deploying …

MLE-Agent: Transform AI Engineering with Autonomous Machine Learning Solutions

7 days ago 高效码农

MLE-Agent: Your Intelligent Companion for Seamless AI Engineering and Research In today’s rapidly evolving landscape of machine learning and artificial intelligence, both seasoned researchers and aspiring engineers face a common challenge: how to efficiently and reliably transform innovative ideas into working solutions. From literature review and code implementation to debugging, optimization, and experiment management, each step can consume significant time and effort. Allow me to introduce a powerful ally—MLE-Agent. This is not just another conceptual tool but a well-designed, comprehensive open-source assistant built to act as a “copilot” for machine learning engineers and researchers. It actively participates in your daily …

Qwen3-8B-Drama-Thinking: How AI Screenwriting Reveals Its Creative Process

8 days ago 高效码农

Qwen3-8B-Drama-Thinking: When AI Starts “Thinking” About Screenwriting Core question: How does this model elevate AI scriptwriting from text generation to demonstrating creative thinking? Qwen3-8B-Drama-Thinking is an 8-billion parameter large language model specifically designed for screenwriting. Its breakthrough lies not in producing better scripts, but in visualizing the entire creative process on screen—wrapping three to four thousand tokens of reasoning chains within <think>…</think> tags that meticulously detail everything from thematic deconstruction and character psychology analysis to three-act structure planning. This isn’t mere text generation; it’s a “visualization” of the creative workflow. 1. Core Features: Why It’s a “Creative Thinking Partner” Central …

Open-Source AI Software Engineer: Revolutionizing Industrial-Scale Coding with Confucius Code Agent

8 days ago 高效码农

Confucius Code Agent: An Open-Source AI Software Engineer Built for Industrial-Scale Codebases Have you ever imagined having an indefatigable AI programming partner that can understand massive projects and help you fix complex bugs? Today, open-source AI coding assistants are proliferating, but when we throw them into real-world, industrial-scale codebases—often spanning millions of lines with intricately interconnected modules—they often “freeze.” They either get lost in lengthy context or act like amnesiacs, unable to learn from past experience. Meanwhile, closed-source commercial tools like Cursor and Claude Code, while powerful, have internal mechanisms that are black boxes. You cannot customize them, auditing is …

InfinityStar: Revolutionizing Video Generation with Unified Spacetime Autoregressive Modeling

8 days ago 高效码农

InfinityStar: Unified Spacetime Autoregressive Modeling for Visual Generation Introduction: What is InfinityStar and How Does It Address Challenges in Visual Generation? This article aims to answer the core question: What is InfinityStar, how does it unify image and video generation tasks, and why does it improve efficiency and quality? InfinityStar is a unified spacetime autoregressive framework designed for high-resolution image and dynamic video synthesis. It leverages recent advances in autoregressive modeling from both vision and language domains, using a purely discrete approach to jointly capture spatial and temporal dependencies in a single architecture. Visual synthesis has seen remarkable advancements in …

Android AI Agent: Revolutionizing Mobile Workflows Where Laptops Can’t Go

9 days ago 高效码农

Android Use: The AI Agent That Works Where Laptops Can’t In today’s digital age, AI assistants can browse the web and operate desktop software. Yet, a massive market gap remains: the workflows that happen on mobile devices, in places where a laptop can’t possibly go. Imagine a truck driver submitting paperwork from the cab, a delivery person scanning packages with a handheld device, or a field technician logging work orders on a tablet at a job site—these are the “last-meter” workflows that truly power the economy. Today, we introduce a groundbreaking open-source project: Android Use. This is a library that …

Gemini 2.5 Flash Native Audio: Crossing the AI Voice Assistant Viability Threshold

9 days ago 高效码农

Gemini 2.5 Flash Native Audio: When AI Voice Agents Cross the Threshold from “Functional” to “Actually Useful” What fundamentally changed with Google’s latest Gemini 2.5 Flash Native Audio update? The model now executes complex business workflows with 71.5% multi-step accuracy, maintains 90% instruction adherence across long conversations, and preserves speaker intonation across 70+ languages—making production deployment viable for customer service, financial services, and real-time translation. For years, the gap between AI voice demo videos and real-world deployment has been painfully obvious. Anyone who’s tested a “conversational AI” knows the familiar breaking points: “Sorry, I didn’t catch that,” awkward silence during …

Google Interactions API: The 2025 Guide to Unified Gemini Models & Agents

10 days ago 高效码农

Google Interactions API: The Unified Foundation for Gemini Models and Agents (2025 Guide) Featured Snippet Answer (Perfect for Google’s Position 0) Google Interactions API is a single RESTful endpoint (/interactions) that lets developers talk to both Gemini models (gemini-2.5-flash, gemini-3-pro-preview, etc.) and managed agents (deep-research-pro-preview-12-2025) using exactly the same interface. Launched in public beta in December 2025, it adds server-side conversation state, background execution, remote MCP tools, structured JSON outputs, and native streaming — everything modern agentic applications need that the classic generateContent endpoint couldn’t comfortably support. Why I’m Excited About Interactions API (And You Should Be Too) If you’ve …

How RealVideo’s WebSocket Engine Creates Real-Time AI Avatars on 80GB GPUs

10 days ago 高效码农

Turn Chat into a Real Face: Inside RealVideo, the WebSocket Video-Calling Engine That Speaks Back A plain-language walkthrough for college-level readers: how to install, tune, and deploy a live text → speech → lip-sync pipeline on two 80 GB GPUs, without writing a single line of extra code. 1. What Exactly Does RealVideo Do? RealVideo is an open-source stack that lets you: Type a sentence in a browser. Hear an AI voice answer instantly. Watch a real photograph speak the answer with perfectly synced lip motion. All three events happen in <500 ms inside one browser tab—no plug-ins, no After …

GPT-5.2 Revolution: How OpenAI’s New AI Model Surpasses Human Experts at Work

10 days ago 高效码农

GPT-5.2 Explained: How OpenAI’s New Model Redefines the Professional AI Assistant Do you remember the feeling of having your days consumed by endless spreadsheets, lengthy reports, and complex code debugging? For knowledge workers, time is the most valuable currency. Now, a more powerful AI partner has arrived—one that not only understands your professional needs but can also match or even surpass industry experts in quality. This is OpenAI’s latest series of models: GPT-5.2. Today, we’ll dive deep into every core upgrade of GPT-5.2. Let’s explore how this model, designed for “expert knowledge work” and “persistently running agents,” can actually save …

GLM-TTS: The First Fully Open-Source TTS for Emotional Chinese Voice Cloning

11 days ago 高效码农

GLM-TTS: The New Open-Source Benchmark for Emotional Zero-Shot Chinese TTS Core question most developers are asking in late 2025: Is there finally a fully open-source TTS that can clone any voice with 3–10 seconds of audio, sound emotional, stream in real-time, and handle Chinese polyphones accurately? The answer is yes — and it launched today. On December 11, 2025, Zhipu AI open-sourced GLM-TTS: a production-ready, zero-shot, emotionally expressive text-to-speech system that is currently the strongest open-source Chinese TTS available. Image credit: Official repository Why GLM-TTS Changes Everything — In Four Bullet Points Zero-shot voice cloning: 3–10 s reference audio is …

How UniUGP Solves Autonomous Driving’s Long-Tail Nightmare with a Single Model

11 days ago 高效码农

UniUGP: A Single Model That Understands, Imagines, and Drives Through the Long Tail Why do today’s robot-cars still panic at the sight of a toppled motorcycle on a rainy night? Because they never rehearsed that scene. UniUGP fixes the rehearsal problem by turning every unlabeled video into a training partner and every language phrase into a safety hint. 1 What Exactly Is UniUGP? UniUGP is a unified Understanding-Generation-Planning network for end-to-end autonomous driving. It consumes a short history of images plus a natural-language cue, then returns (a) a chain-of-thought explanation, (b) a physically valid future trajectory, and (c) a photo-realistic …

GLM-ASR-Nano-2512 Review: The 1.5B Model Breaking Speech Recognition Barriers

11 days ago 高效码农

🚀 Breaking the Sound Barrier: An In-Depth Look at GLM-ASR-Nano-2512 and High-Performance Speech Recognition Snippet/Abstract: GLM-ASR-Nano-2512 is an open-source speech recognition model by Zhipu AI with a compact 1.5B parameters. It achieves the lowest average error rate (4.10) among its class, excelling in complex acoustic environments, offering superior dialect support (e.g., Cantonese), and robust performance for low-volume speech. 🌟 Introduction: The Next Generation of Acoustic-to-Text Conversion In today’s fast-paced digital world, the need for accurate, real-time, and robust Automatic Speech Recognition (ASR) is paramount. From transcribing critical professional meetings to enabling hands-free navigation, the technology must perform flawlessly across diverse …

WhisperLiveKit: Real-Time Speech-to-Text with Speaker Identification

11 days ago 高效码农

WhisperLiveKit: Ultra-Low-Latency Self-Hosted Speech-to-Text with Real-Time Speaker Identification If you’re in need of a tool that converts speech to text in real time while distinguishing between different speakers, WhisperLiveKit (WLK for short) might be exactly what you’re looking for. This open-source solution specializes in ultra-low latency, self-hosted deployment, and supports real-time transcription and translation across multiple languages—making it ideal for meeting notes, accessibility tools, content creation, and more. What Is WhisperLiveKit? Simply put, WhisperLiveKit is a tool focused on real-time speech processing. It instantly converts spoken language into text and identifies who is speaking—this is known as “speaker identification.” …

OneStory: How Adaptive Memory Solves Multi-Shot Video Generation’s Biggest Challenge

11 days ago 高效码农

OneStory: Redefining Multi-Shot Video Generation with Adaptive Memory Abstract OneStory addresses the critical challenge of maintaining narrative coherence across discontinuous video shots by introducing an adaptive memory system. This framework achieves a 58.74% improvement in character consistency and supports minute-scale video generation through next-shot prediction and dynamic context compression. By reformulating multi-shot generation as an autoregressive task, it bridges the gap between single-scene video models and complex storytelling requirements. What is Multi-Shot Video Generation? Imagine watching a movie where scenes seamlessly transition between different locations and characters. Traditional AI video generators struggle with this “multi-shot” structure—sequences of non-contiguous clips that …

Google’s MCP Support Unlocks AI Agents: The USB-C for Enterprise AI Finally Arrives

11 days ago 高效码农

Google Launches Official MCP Support: Unlocking the Full Potential of AI Agents Across Services The Evolution of AI: From Intelligent Models to Action-Oriented Agents Artificial intelligence has undergone remarkable transformation in recent years. With the introduction of advanced reasoning models like Gemini 3, we now possess unprecedented capabilities to learn, build, and plan. These sophisticated AI systems can process complex information and generate insightful responses. Yet a fundamental question remains: what truly transforms an intelligent model into a practical agent that can solve real-world problems on our behalf? The answer lies not just in raw intelligence, but in the ability …

How ChatGPT’s Memory System Actually Works: The 4-Layer Architecture Behind the Illusion

11 days ago 高效码农

ChatGPT Memory System Exposed: How It Remembers 33 Facts About You Without a Database When you ask ChatGPT what it knows about you, the response can be surprisingly personal. In one instance, it listed 33 distinct facts, ranging from a user’s name and career ambitions to their current fitness routine. This leads to a fundamental question: how does an AI model store, retrieve, and utilize this information so seamlessly? After extensive experimentation and reverse engineering through direct interaction, a surprising discovery emerged. ChatGPT’s memory system is not the complex, vector-database-driven architecture many might assume. There is no RAG (Retrieval-Augmented Generation) …

« Previous

…