vLLM CLI: A User-Friendly Tool for Serving Large Language Models

If you’ve ever wanted to work with large language models (LLMs) but found the technical setup overwhelming, vLLM CLI might be exactly what you need. This powerful command-line interface tool simplifies serving LLMs using vLLM, offering both interactive and command-line modes to fit different user needs. Whether you’re new to working with AI models or an experienced developer, vLLM CLI provides features like configuration profiles, model management, and server monitoring to make your workflow smoother.

vLLM CLI Welcome Screen
Welcome screen showing GPU status and system overview

What Makes vLLM CLI Stand Out?

vLLM CLI comes packed with features designed to make serving large language models more accessible and efficient. Let’s take a closer look at what it offers:

Interactive Mode: Navigate with Ease

One of the most user-friendly aspects of vLLM CLI is its interactive mode. Instead of memorizing complex commands, you can use a menu-driven interface right in your terminal. This makes it easy to navigate through different options, whether you’re selecting a model to serve, adjusting settings, or checking server status. It’s like having a guided tour through the tool’s capabilities, perfect for those who prefer a more visual approach.

Command-Line Mode: Automate Your Workflow

For users who need to automate tasks or write scripts, command-line mode is a game-changer. You can run direct commands to serve models, check statuses, or stop servers—all without navigating through menus. This is especially useful for integrating vLLM into larger workflows or setting up recurring tasks.

Simple Model Management

Keeping track of your models doesn’t have to be a hassle. vLLM CLI automatically discovers and manages your local models, so you always know what’s available. You won’t have to waste time searching through folders or remembering file paths—everything is organized and accessible through the tool.

Serve Models from Anywhere

You don’t need to download models to your local machine first. vLLM CLI lets you serve models directly from HuggingFace Hub, saving you storage space and time. This means you can start working with a new model almost immediately, without waiting for large file downloads.

Customizable Configuration Profiles

Everyone’s needs are different, which is why vLLM CLI includes configuration profiles. These pre-set configurations cover common use cases, so you can get started quickly. If you have specific requirements, you can also create your own custom profiles to reuse later.

Keep an Eye on Your Servers

Wondering how your server is performing? vLLM CLI’s real-time monitoring feature shows you GPU usage, server status, and even streaming logs. This helps you spot issues early and ensure your models are running smoothly.

System Information at a Glance

Before you start serving a model, it’s important to know if your system can handle it. vLLM CLI provides detailed information about your GPU, memory, and CUDA compatibility, so you can avoid unexpected errors and choose the right models for your hardware.

Easy Log Viewing for Troubleshooting

If something goes wrong when starting a server, vLLM CLI makes it easy to view the complete log file. This helps you quickly identify and fix issues, whether it’s a compatibility problem or a configuration error.

What’s New in Version 0.2.2?

The latest update to vLLM CLI brings some useful improvements:

  • Model Manifest Support: You can now map custom models in a way that works natively with vLLM CLI using a models_manifest.json file. This makes it easier to work with models that aren’t in standard locations.
  • New Documentation: There’s a new guide on serving custom models from your own directories, helping you get set up with non-standard model locations.
  • Bug Fixes: Several issues have been resolved, including problems with serving models from custom directories, and there are various improvements to the user interface.

For a full list of changes in all versions, check out the 👉CHANGELOG.md.

LoRA Adapter Support

LoRA Serving
Serve models with LoRA adapters – select base model and multiple LoRA adapters for serving

vLLM CLI now makes it easier to work with LoRA (Low-Rank Adaptation) adapters. These are small files that modify a base model to perform specific tasks without changing the original model. With vLLM CLI, you can select a base model and add multiple LoRA adapters, opening up more possibilities for customizing model behavior.

Better Model List Display

Model List Display
Comprehensive model listing showing HuggingFace models, LoRA adapters, and datasets with size information

Finding the right model is now simpler thanks to an improved model list display. The list shows all your HuggingFace models, LoRA adapters, and datasets, along with their sizes. This helps you quickly assess which models will work with your available storage and hardware.

Model Directory Management

Model Directory Management
Configure and manage custom model directories for automatic model discovery

You can now easily set up and manage custom directories where your models are stored. vLLM CLI will automatically discover models in these directories, so you don’t have to manually add them each time. This is great if you prefer to organize your models in specific locations.

How to Install vLLM CLI

Getting started with vLLM CLI is straightforward. Here’s what you need to do:

Before You Start

Make sure you have the following:

  • Python 3.11 or later
  • A CUDA-compatible GPU (recommended for best performance)
  • The vLLM package installed

Install from PyPI

The easiest way to install vLLM CLI is through PyPI, Python’s package repository. Just open your terminal and run:

pip install vllm-cli

This command will download and install the latest version of vLLM CLI and its required components.

Build from Source

If you prefer to build the tool from the source code (for example, if you want the latest development version), follow these steps:

  1. Clone the repository: This copies the source code to your local machine.

    git clone https://github.com/Chen-zexi/vllm-cli.git
    cd vllm-cli
    
  2. Activate your environment: Make sure you’re using the Python environment where you have vLLM installed.

  3. Install dependencies: These are additional packages that vLLM CLI needs to work.

    pip install -r requirements.txt
    pip install hf-model-tool
    
  4. Install in development mode: This lets you make changes to the code and test them without reinstalling.

    pip install -e .
    

Important Things to Know

Model Compatibility and Troubleshooting

Not all models will work the same way on every system. Here’s what you need to know:

⚠️ Model and GPU Compatibility: Whether a model works well (or at all) depends on several factors:

  • The specific model’s design and requirements
  • What your GPU can do (like its computing power and memory)
  • The version of vLLM you’re using and which features it supports

If you have trouble serving a model, try these steps:

  1. Check the server logs: vLLM provides detailed error messages that often tell you exactly what’s wrong—like missing requirements or settings that don’t work together.
  2. Look at the official vLLM documentation: The 👉vLLM docs have information about which models are supported and what settings they need.
  3. Check the model’s requirements: Some models need specific settings or particular ways of reducing file size (called quantization).

Managing Models with hf-model-tool

vLLM CLI uses a tool called 👉hf-model-tool to find and manage your local models. This tool, also developed by the creator of vLLM CLI, offers several helpful features:

  • It scans your HuggingFace cache and any custom directories to find all your models
  • It shows you detailed information about each model, like its size, type, and how it’s quantized
  • It shares settings with vLLM CLI, so you don’t have to configure things twice

Settings stay in sync—any model directories you set up in hf-model-tool will automatically be available in vLLM CLI, and vice versa. It’s worth checking out hf-model-tool for more advanced model management. You can even open it directly from within vLLM CLI.

# Install hf-model-tool (already included with vLLM CLI)
pip install --upgrade hf-model-tool

# Scan and manage your local models
hf-model-tool

How to Use vLLM CLI

vLLM CLI offers two main ways to use it: interactive mode (for manual control) and command-line mode (for automation).

Interactive Mode

To start the interactive interface, simply run:

vllm-cli

This opens a terminal interface with menus that you can navigate using your keyboard. It’s great for exploring the tool’s features or when you need to make quick adjustments.

Selecting Models (Including Remote Ones)

Model Selection
Model selection interface showing both local models and HuggingFace Hub auto-download option

The model selection screen shows you all the models available on your local machine and gives you the option to download and serve models directly from HuggingFace Hub. This means you can start using a model from the Hub without first downloading it to your computer.

Quick Serve with Your Last Settings

Quick Serve
Quick serve feature automatically uses the last successful configuration

If you want to serve a model using the same settings you used last time, the quick serve feature makes it easy. It remembers your last successful configuration, so you can get started with just a few clicks.

Custom Configuration Options

Custom Configuration
Advanced configuration interface with categorized vLLM options and custom arguments

For more control, the advanced configuration interface lets you adjust specific settings for vLLM. These options are organized into categories, making it easier to find what you need. You can also add custom arguments if you have specific requirements.

Server Monitoring

Server Monitoring
Real-time server monitoring showing GPU utilization, server status, and streaming logs

Once a server is running, you can monitor it in real time. The monitoring screen shows you how much of your GPU is being used, whether the server is running properly, and even live logs. This helps you keep track of performance and spot any issues as they happen.

Command-Line Mode

For more automation, you can use vLLM CLI directly from the command line with these commands:

# Serve a model with default settings
vllm-cli serve MODEL_NAME

# Serve with a specific profile
vllm-cli serve MODEL_NAME --profile standard

# Serve with custom parameters
vllm-cli serve MODEL_NAME --quantization awq --tensor-parallel-size 2

# List available models
vllm-cli models

# Show system information
vllm-cli info

# Check active servers
vllm-cli status

# Stop a server
vllm-cli stop --port 8000

These commands let you perform common tasks without using the interactive menu. For example, you can serve a model with a single command, check which models are available, or stop a server when you’re done.

Configuring vLLM CLI

vLLM CLI uses configuration files to remember your settings. Here’s where to find them and how to use them:

User Configuration Files

  • Main Config: ~/.config/vllm-cli/config.yaml – This file stores your main settings.
  • User Profiles: ~/.config/vllm-cli/user_profiles.json – This is where your custom configuration profiles are saved.
  • Cache: ~/.config/vllm-cli/cache.json – This file helps speed things up by remembering information like model lists.

Built-in Profiles

vLLM CLI comes with four pre-set profiles that cover common use cases. By default, vLLM uses one GPU, but all these profiles can detect multiple GPUs and automatically adjust settings to use them all (using something called tensor parallelism).

standard – Simple settings with smart defaults

This profile uses vLLM’s default configuration, which works well for most models and hardware setups. It’s a good starting point if you’re not sure which profile to use.

moe_optimized – Made for Mixture of Experts models

{
  "enable_expert_parallel": true
}

This profile is designed for MoE models (like Qwen) by enabling expert parallelism, which helps these complex models run more efficiently.

high_throughput – Settings for maximum performance

{
  "max_model_len": 8192,
  "gpu_memory_utilization": 0.95,
  "enable_chunked_prefill": true,
  "max_num_batched_tokens": 8192,
  "trust_remote_code": true,
  "enable_prefix_caching": true
}

These settings are more aggressive, allowing the server to handle more requests at once. It’s great if you need to process a lot of queries quickly.

low_memory – For systems with limited memory

{
  "max_model_len": 4096,
  "gpu_memory_utilization": 0.70,
  "enable_chunked_prefill": false,
  "trust_remote_code": true,
  "quantization": "fp8"
}

This profile reduces memory usage by using FP8 quantization (a way to make model files smaller) and more conservative settings. It’s useful if you’re working with limited GPU memory.

Handling Errors and Viewing Logs

Error Handling
Interactive error recovery with log viewing options when server startup fails

If a server fails to start, vLLM CLI helps you figure out why. It offers interactive options to view logs, which contain detailed information about what went wrong. This makes it easier to fix the problem, whether it’s a configuration issue or a compatibility problem.

System Information

System Information
Comprehensive system information display showing GPU capabilities, memory, dependencies version, attention backends, and quantization support

Before serving a model, it’s helpful to know what your system can do. The system information display shows you:

  • What your GPU is capable of
  • How much memory you have
  • Which versions of important software (like dependencies) you’re using
  • Which attention backends are available
  • What types of quantization your system supports

This information helps you choose models that will work well with your hardware and avoid unnecessary errors.

How vLLM CLI is Built

Understanding the core components of vLLM CLI can help you use it more effectively. Here’s a breakdown of its main parts:

Core Components

  • CLI Module: Handles the commands you type and parses arguments.
  • Server Module: Manages the vLLM server process, starting and stopping it as needed.
  • Config Module: Takes care of configuration files and profiles.
  • Models Module: Finds models and extracts information about them.
  • UI Module: Creates the interactive terminal interface.
  • System Module: Provides tools to check GPU, memory, and other system details.
  • Validation Module: Makes sure your configurations are valid and will work.
  • Errors Module: Handles errors and provides helpful feedback.

Key Features

  • Automatic Model Discovery: Works with hf-model-tool to find all your models, no matter where they’re stored.
  • Profile System: Uses JSON-based configuration files with checks to ensure settings are valid.
  • Process Management: Keeps track of all running servers and cleans up properly when they’re stopped.
  • Caching: Remembers model lists and system information to make the tool faster.
  • Error Handling: Provides clear, helpful messages when something goes wrong, making it easier to fix issues.

Documentation

Guides

Development

If you’re interested in how vLLM CLI is structured, here’s an overview of the project’s layout:

src/vllm_cli/
├── cli/           # Handles CLI commands
├── config/        # Manages configuration files
├── errors/        # Deals with error handling
├── models/        # Manages model discovery and information
├── server/        # Controls the vLLM server
├── system/        # Provides system information and utilities
├── ui/            # Creates the user interface
├── validation/    # Checks that configurations are valid
└── schemas/       # Contains JSON schemas for validation

This structure keeps different parts of the tool organized, making it easier to maintain and improve.

Environment Variables

You can customize how vLLM CLI works using these environment variables:

  • VLLM_CLI_ASCII_BOXES: Set this to use ASCII characters for boxes in the interface, which can help with compatibility on some systems.
  • VLLM_CLI_LOG_LEVEL: Change the logging level (DEBUG, INFO, WARNING, ERROR) to control how much detail is shown in logs.

Requirements

System Requirements

  • Operating System: Linux (currently, vLLM CLI works best with Linux)
  • GPU: NVIDIA GPU with CUDA support (At the moment, only NVIDIA GPUs are supported, but contributions to add support for other GPUs are welcome)

Python Dependencies

  • vLLM
  • PyTorch with CUDA support

When you install vLLM CLI, these additional packages are also installed:

  • hf-model-tool (for finding and managing models)
  • Rich (for creating the interactive terminal interface)
  • Inquirer (for interactive prompts)
  • psutil (for monitoring system resources)
  • PyYAML (for reading configuration files)

What’s Coming Next?

The development of vLLM CLI is ongoing, with plans for these improvements:

To-Do List

  • [ ] AMD GPU Support: Adding support for AMD GPUs (using ROCm) in addition to NVIDIA CUDA GPUs.
  • [ ] Local Model Support: Making it possible to load models from more types of local directories:

    • [ ] Oracle Cloud Infrastructure (OCI) Registry format
    • [ ] Ollama model format
    • [ ] Other local model formats

Future Enhancements

As the project grows, more features and improvements will be added based on user feedback and new needs in the field of large language models.

License

vLLM CLI is licensed under the MIT License, which means you can use, modify, and distribute it freely, as long as you include the original license and copyright notice.

Contributing

If you’d like to help improve vLLM CLI, contributions are welcome! You can open an issue to report a problem or suggest a feature, or submit a pull request with your own code changes. Every contribution helps make the tool better for everyone.

Whether you’re new to working with large language models or an experienced developer, vLLM CLI offers a user-friendly way to serve and manage LLMs. With its combination of interactive and command-line modes, automatic model discovery, and helpful monitoring features, it’s designed to make your workflow smoother and more efficient. Try it out today and see how it can simplify your work with large language models.