DeepSeek-OCR & Reinforcement Learning Trading Agents: From Deployment to Practical Application

高效码农

2 months ago

Core Questions Addressed in This Article

How to deploy DeepSeek-OCR for efficient PDF-to-Markdown conversion? How to build a custom trading environment and train reinforcement learning (RL) agents using Stable-Baselines3? This article details the practical steps, application scenarios, and troubleshooting methods for both technologies.

Part 1: DeepSeek-OCR – A Powerful Tool for PDF-to-Markdown Conversion

1.1 What Is DeepSeek-OCR, and Why Choose It?

Core Question: What problems does DeepSeek-OCR solve, and what advantages does it offer over other OCR tools?

DeepSeek-OCR is a robust OCR solution designed to accurately convert PDF documents into Markdown format while supporting image OCR recognition. Built on deep learning models and integrated with a FastAPI backend, it enables efficient batch processing and REST API services. It is ideal for scenarios requiring structured document processing—whether organizing academic papers, digitizing corporate contracts, or archiving personal materials—by enabling rapid content extraction and formatting.

Compared to traditional OCR tools, its key advantages include:

Structured output in Markdown, preserving original document elements like headings, lists, and tables
Flexible deployment options, with quick setup of production-grade services via Docker
Multiple processing scripts tailored to different use cases (e.g., plain text extraction, rich text conversion with images)
GPU acceleration for significantly improved efficiency when processing large volumes of documents

Image source: Unsplash

1.2 Quick Start: Two Easy Ways to Get Started

Core Question: How can beginners quickly start using DeepSeek-OCR? What ready-to-use methods are available?

DeepSeek-OCR offers two convenient entry points, catering to both developers who prefer script-based processing and those needing API integration for applications.

Method 1: Batch Processing Script

Ideal for converting a batch of PDF files at once:

Place all PDFs to be processed in the project’s data/ directory
Ensure the DeepSeek-OCR API service is running (see Section 1.3 for setup instructions)
Run the basic conversion script:

python pdf_to_markdown_processor.py

After conversion, Markdown files with the -MD.md suffix will be generated in the data/ directory

Method 2: REST API Service

Best for integrating OCR functionality into your own applications:

Build and start the Docker container following the steps in Section 1.3
Call the API endpoints via HTTP requests (see Section 1.5 for examples)
Retrieve conversion results in JSON format for seamless integration into business workflows

Reflection: In practice, I’ve found each method has its strengths. Batch scripts work best for one-time local processing, while API services are more suitable for long-term OCR capabilities. Choose based on file quantity and usage frequency—scripts are more straightforward for fewer than 10 files, while APIs are more efficient for frequent processing or cross-system calls.

1.3 Environment Preparation: Hardware & Software Requirements

Core Question: What hardware and software are needed to run DeepSeek-OCR? How to verify if your environment meets these requirements?

DeepSeek-OCR relies on deep learning models and has specific hardware requirements. Verifying your environment in advance prevents issues during deployment.

Hardware Requirements

GPU: NVIDIA graphics card supporting CUDA 11.8+ (e.g., RTX 3090, A100)
GPU Memory: Minimum 12GB VRAM (the model itself occupies approximately 9GB)
System RAM: Minimum 32GB (64GB+ recommended for smoother processing of large PDFs)
Storage: 50GB+ free space (for storing models and containers)

Software Requirements

Python: Version 3.8 or higher (required for running local scripts)
Docker: Version 20.10 or higher, with GPU support configured
Docker Compose: Version 2.0 or higher
NVIDIA Container Toolkit: Ensures Docker can access the GPU
CUDA Drivers: Versions compatible with CUDA 11.8

Useful Commands to Verify Your Environment:

# Check if your GPU supports CUDA
nvidia-smi

# Check Docker version
docker --version

# Verify Docker can access the GPU
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi

If nvidia-smi displays GPU information and the Docker command outputs GPU details successfully, your basic environment meets the requirements.

1.4 Docker Deployment: From Model Download to Service Startup

Core Question: How to quickly deploy DeepSeek-OCR via Docker? What are the complete steps?

Docker deployment is recommended as it simplifies environment configuration and ensures service consistency across different systems. Below are the detailed steps:

Step 1: Download Model Weights

The model is the core of OCR functionality and must first be downloaded locally:

# Create a directory to store the model
mkdir -p models

# Method 1: Download via Hugging Face CLI (recommended)
pip install huggingface_hub
huggingface-cli download deepseek-ai/DeepSeek-OCR --local-dir models/deepseek-ai/DeepSeek-OCR

# Method 2: Clone via Git
git clone https://huggingface.co/deepseek-ai/DeepSeek-OCR models/deepseek-ai/DeepSeek-OCR

Step 2: Build and Start the Container

Use the appropriate commands based on your operating system:

For Windows Users:

REM Build the Docker image
build.bat

REM Start the service
docker-compose up -d

REM Check logs to confirm successful startup
docker-compose logs -f deepseek-ocr

For Linux/macOS Users:

# Build the Docker image
docker-compose build

# Start the service
docker-compose up -d

# Check logs
docker-compose logs -f deepseek-ocr

Step 3: Verify the Service Is Running

After starting the service, confirm its status using the health check endpoint:

curl http://localhost:8000/health

A successful response will look like this, indicating the service is ready:

{
  "status": "healthy",
  "model_loaded": true,
  "model_path": "/app/models/deepseek-ai/DeepSeek-OCR",
  "cuda_available": true,
  "cuda_device_count": 1
}

Reflection: Model downloads can take time (depending on your network). I recommend starting when your network is stable—if the download is interrupted, you can resume by re-running the command. Additionally, the first time the container starts, the model loads, which may take a few minutes. The service is only fully available when “Server started” appears in the logs.

1.5 Diverse PDF Processing Scripts: Choose the Right Tool for Your Needs

Core Question: What PDF processing scripts does DeepSeek-OCR provide? Which scenarios are they each suited for, and how to use them?

The project offers 5 distinct processing scripts, each optimized for specific needs. The table below clearly outlines their differences:

Processor Script	Prompt Used	Post-Processing	Image Extraction	Output Suffix	Use Case
pdf_to_markdown_processor.py	Standard Markdown conversion prompt	❌	❌	`-MD.md`	Quick conversion of PDFs to structured Markdown
pdf_to_markdown_processor_enhanced.py	Standard Markdown conversion prompt	✅	✅	`-MD.md`	Conversion requiring image preservation and fine formatting
pdf_to_ocr_enhanced.py	Plain OCR extraction prompt	✅	✅	`-OCR.md`	Text-only extraction (formatting not required)
pdf_to_custom_prompt.py	Custom prompt (from YAML)	❌	❌	`-CUSTOM.md`	Testing custom prompt effectiveness
pdf_to_custom_prompt_enhanced.py	Custom prompt (from YAML)	✅	✅	`-CUSTOM.md`	Custom processing logic with image preservation

Detailed Usage Guide

1. Basic Markdown Conversion (pdf_to_markdown_processor.py)
Suitable for quickly converting PDFs to formatted Markdown (no image extraction):

# Place the PDF in the data directory
cp your_document.pdf data/

# Run the conversion
python pdf_to_markdown_processor.py

# View results
ls data/*-MD.md

2. Enhanced Markdown Conversion (pdf_to_markdown_processor_enhanced.py)
Ideal for scenarios requiring image preservation and fine formatting (e.g., reports with charts):

# Run the conversion
python pdf_to_markdown_processor_enhanced.py

# View results (including extracted images)
ls data/*-MD.md
ls data/images/  # Extracted images are stored here

3. Plain Text OCR Extraction (pdf_to_ocr_enhanced.py)
Best for extracting text content without formatting (e.g., content retrieval):

python pdf_to_ocr_enhanced.py
cat data/your_document-OCR.md  # View plain text results

4. Custom Prompt Processing
For special needs (e.g., extracting only tables or converting to CSV). First, configure the prompt:

# Edit custom_prompt.yaml
prompt: '<image>\n<|grounding|>Extract only tables from the document and convert them to CSV format.'

Then run the corresponding script:

# Basic version (no post-processing)
python pdf_to_custom_prompt.py

# Enhanced version (with image extraction and formatting)
python pdf_to_custom_prompt_enhanced.py

Application Scenario Examples: Academic researchers processing large volumes of papers can use pdf_to_markdown_processor_enhanced.py to preserve formulas and chart positions; corporate archivists digitizing contracts in bulk can use pdf_to_ocr_enhanced.py to extract key clause text; data analysts needing table data from reports can use the custom prompt script to get CSV-formatted results directly.

1.6 API Interface Details: From Single-File to Batch Processing

Core Question: What API interfaces does DeepSeek-OCR provide? How to call these interfaces via code for document conversion?

The API service offers flexible interfaces supporting image, PDF, and batch file processing, making it easy to integrate into various applications.

Key Interface Descriptions

1. Health Check Interface
Confirms service status (no parameters required):

GET http://localhost:8000/health

2. Single Image Processing Interface
Processes image files (PNG, JPG, etc.):

curl -X POST "http://localhost:8000/ocr/image" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@your_image.jpg"

3. PDF Processing Interface
Processes PDF files and returns results for each page:

curl -X POST "http://localhost:8000/ocr/pdf" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@your_document.pdf"

4. Batch Processing Interface
Processes multiple files simultaneously (supports mixed images and PDFs):

curl -X POST "http://localhost:8000/ocr/batch" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "files=@image1.jpg" \
  -F "files=@document.pdf" \
  -F "files=@image2.png"

Client Code Examples

Python Client:

import requests

class DeepSeekOCRClient:
    def __init__(self, base_url="http://localhost:8000"):
        self.base_url = base_url
    
    def process_pdf(self, pdf_path):
        with open(pdf_path, 'rb') as f:
            response = requests.post(
                f"{self.base_url}/ocr/pdf",
                files={"file": f}
            )
        return response.json()

# Usage Example
client = DeepSeekOCRClient()
result = client.process_pdf("document.pdf")

if result["success"]:
    for page_result in result["results"]:
        print(f"Page {page_result['page_count']}:")
        print(page_result["result"])
        print("---")

JavaScript Client (Browser Environment):

class DeepSeekOCR {
    constructor(baseUrl = 'http://localhost:8000') {
        this.baseUrl = baseUrl;
    }
    
    async processPDF(file) {
        const formData = new FormData();
        formData.append('file', file);
        
        const response = await fetch(`${this.baseUrl}/ocr/pdf`, {
            method: 'POST',
            body: formData
        });
        
        return await response.json();
    }
}

// Usage in Browser
const ocr = new DeepSeekOCR();
document.getElementById('fileInput').addEventListener('change', async (e) => {
    const file = e.target.files[0];
    const result = await ocr.processPDF(file);
    
    if (result.success) {
        result.results.forEach(page => {
            console.log(`Page ${page.page_count}:`, page.result);
        });
    }
});

Response Format Descriptions

Single Image Response:

{
  "success": true,
  "result": "# Title from Image\n\nThis is the text content extracted from the image...",
  "page_count": 1
}

PDF Response:

{
  "success": true,
  "results": [
    {
      "success": true,
      "result": "# Page 1 Content\n...",
      "page_count": 1
    },
    {
      "success": true,
      "result": "# Page 2 Content\n...",
      "page_count": 2
    }
  ],
  "total_pages": 2,
  "filename": "document.pdf"
}

1.7 Configuration & Optimization: Tailor the Service to Your Needs

Core Question: How to adjust DeepSeek-OCR’s configuration parameters? How to optimize performance based on hardware conditions?

By modifying configuration parameters, you can balance service performance and resource usage to meet different scenario requirements.

Key Environment Variables

Modify the following environment variables in docker-compose.yml:

environment:
  - CUDA_VISIBLE_DEVICES=0  # Specify GPU device (use "0,1" for multiple GPUs)
  - MODEL_PATH=/app/models/deepseek-ai/DeepSeek-OCR  # Model storage path
  - MAX_CONCURRENCY=50  # Maximum number of concurrent requests
  - GPU_MEMORY_UTILIZATION=0.85  # GPU memory usage rate (0.1-1.0)

Performance Optimization Strategies

1. High Throughput Scenarios (e.g., batch processing large volumes of documents)
Suitable for systems with powerful GPUs (e.g., A100):

environment:
  - MAX_CONCURRENCY=100  # Increase concurrency
  - GPU_MEMORY_UTILIZATION=0.95  # Higher GPU memory usage

2. Memory-Constrained Scenarios (e.g., GPUs with ~12GB VRAM)
Reduce resource usage to avoid out-of-memory errors:

environment:
  - MAX_CONCURRENCY=10  # Decrease concurrency
  - GPU_MEMORY_UTILIZATION=0.7  # Lower GPU memory usage

3. Multi-GPU Deployment
For servers with multiple GPUs, split service instances:

services:
  deepseek-ocr-1:
    extends:
      file: docker-compose.yml
      service: deepseek-ocr
    environment:
      - CUDA_VISIBLE_DEVICES=0
    ports:
      - "8000:8000"
  
  deepseek-ocr-2:
    extends:
      file: docker-compose.yml
      service: deepseek-ocr
    environment:
      - CUDA_VISIBLE_DEVICES=1
    ports:
      - "8001:8000"

Reflection: Parameter adjustment must align with actual hardware and business loads. I once set MAX_CONCURRENCY to 50 when processing a 500-page PDF, which caused a GPU out-of-memory error. Later, reducing it to 20 and lowering GPU_MEMORY_UTILIZATION to 0.7 ensured speed while preventing crashes. I recommend starting with small-scale tests and gradually adjusting to find the optimal configuration.

1.8 Common Issues & Solutions

Core Question: What issues might occur when using DeepSeek-OCR? How to quickly troubleshoot and resolve them?

1. Out-of-Memory Errors

Symptom: The service crashes unexpectedly, with “Out of memory” messages in logs.
Solution:

# Edit docker-compose.yml to reduce resource usage
environment:
  - MAX_CONCURRENCY=10
  - GPU_MEMORY_UTILIZATION=0.7

Then restart the service: docker-compose restart deepseek-ocr

2. Model Loading Failures

Symptom: Health check shows model_loaded: false.
Solution:

# Check model directory structure
ls -la models/deepseek-ai/DeepSeek-OCR/

# Verify the model path is correct inside the container
docker-compose exec deepseek-ocr ls -la /app/models/deepseek-ai/DeepSeek-OCR/

Ensure the model files are complete; re-download the model if necessary.

3. CUDA-Related Errors

Symptom: Logs show “CUDA out of memory” or “CUDA device not found”.
Solution:

# Check if the GPU is available
nvidia-smi

# Confirm Docker can access the GPU
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi

If the command fails, reinstall the NVIDIA Container Toolkit.

4. API Connection Failures

Symptom: API calls return “connection refused” or time out.
Solution:

# Check if the service is running
docker-compose ps

# View service logs to identify errors
docker-compose logs deepseek-ocr

# Restart the service
docker-compose restart deepseek-ocr

5. Prompt Parameter Error (Fixed)

Symptom: Service fails to start, with TypeError: tokenize_with_images() missing 1 required positional argument: 'prompt' in logs.
Solution:
This issue has been fixed with custom scripts included in the project. Ensure you use the latest Docker build:

docker-compose down
docker-compose build --no-cache
docker-compose up -d

Part 2: Building Reinforcement Learning Trading Agents with Stable-Baselines3

2.1 RL in Trading: Why Do We Need a Custom Environment?

Core Question: How is reinforcement learning applied to trading scenarios? Why is a custom trading environment necessary?

Reinforcement learning (RL) enables agents to learn optimal strategies through interaction with their environment—making it ideal for trading, where continuous decision-making and outcome dependence on historical actions are key. In trading, agents must decide to buy, sell, or hold based on market conditions, with the goal of maximizing returns—aligning perfectly with RL’s objective of “maximizing cumulative reward.”

However, generic RL environments cannot simulate real-world trading scenarios (e.g., capital management, price fluctuations, transaction fees). A custom environment accurately models these critical elements, allowing agents to learn strategies that are more relevant to practical applications. For example, we can incorporate transaction costs, capital limits, and price volatility rules into the environment, making the trained strategies more useful.

Image source: Unsplash

2.2 Building a Custom Trading Environment: Core Components & Implementation

Core Question: What components make up a complete trading environment? How to implement one with code?

A custom trading environment must define the state space, action space, reward mechanism, and state transition rules. Below is an implementation based on Gymnasium:

Core Environment Code

import numpy as np
import gymnasium as gym
from gymnasium import spaces

class TradingEnv(gym.Env):
    def __init__(self, max_steps=200):
        super().__init__()
        # Action space: 0 = Hold, 1 = Buy, 2 = Sell
        self.action_space = spaces.Discrete(3)
        # Observation space: 5 features (capital, shares held, current price, price trend, step progress)
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(5,), dtype=np.float32)
        self.max_steps = max_steps
        self.reset()  # Initialize the environment
    
    def reset(self, seed=None, options=None):
        """Reset the environment to its initial state"""
        super().reset(seed=seed)
        self.current_step = 0
        self.balance = 1000.0  # Initial capital
        self.shares = 0  # Initial shares held
        self.price = 100.0  # Initial price
        self.price_history = [self.price]  # Price history
        return self._get_obs(), {}
    
    def _get_obs(self):
        """Generate the current observation state"""
        # Calculate price trend (average price of the last 5 steps)
        price_trend = np.mean(self.price_history[-5:]) if len(self.price_history) >=5 else self.price
        return np.array([
            self.balance / 1000.0,  # Capital (normalized)
            self.shares / 10.0,     # Shares held (normalized)
            self.price / 100.0,     # Current price (normalized)
            price_trend / 100.0,    # Price trend (normalized)
            self.current_step / self.max_steps  # Progress ratio
        ], dtype=np.float32)
    
    def step(self, action):
        """Execute an action and return the new state, reward, termination status, etc."""
        self.current_step += 1
        
        # Simulate price fluctuations (including trend and random noise)
        trend = 0.001 * np.sin(self.current_step / 20)  # Cyclical trend
        self.price *= (1 + trend + np.random.normal(0, 0.02))  # Add random volatility
        self.price = np.clip(self.price, 50, 200)  # Limit price range
        self.price_history.append(self.price)
        
        reward = 0  # Initial reward
        
        # Execute buy action (if sufficient capital)
        if action == 1 and self.balance >= self.price:
            shares_to_buy = int(self.balance / self.price)
            cost = shares_to_buy * self.price
            self.balance -= cost
            self.shares += shares_to_buy
            reward = -0.01  # Small penalty for buying (simulates transaction fees)
        
        # Execute sell action (if holding shares)
        elif action == 2 and self.shares > 0:
            revenue = self.shares * self.price
            self.balance += revenue
            self.shares = 0
            reward = 0.01  # Small reward for selling
        
        # Calculate total portfolio value (capital + share value) and update reward
        portfolio_value = self.balance + self.shares * self.price
        reward += (portfolio_value - 1000) / 1000  # Reward linked to portfolio growth
        
        # Determine if the episode has ended (reached max steps)
        terminated = self.current_step >= self.max_steps
        truncated = False
        
        return self._get_obs(), reward, terminated, truncated, {"portfolio": portfolio_value}
    
    def render(self):
        """Print current state (for debugging)"""
        print(f"Step: {self.current_step}, Balance: ${self.balance:.2f}, Shares: {self.shares}, Price: ${self.price:.2f}")

Breakdown of Core Environment Components

Action Space: Discrete space with 3 actions (hold, buy, sell)
Observation Space: Continuous space with 5 features (current capital, shares held, price, etc.)
State Reset: The reset() method initializes the environment, ensuring consistent starting points for training/testing
State Transition: The step() method defines price fluctuation rules and how actions affect the state
Reward Mechanism: Rewards are linked to portfolio growth, with small penalties/rewards for buying/selling (simulating transaction costs)

Application Scenario: This environment can be used to test the effectiveness of different trading strategies. For example, by comparing the performance of different RL algorithms in this environment, you can identify strategies better suited for volatile or trending markets.

2.3 Training & Evaluation: Comparing PPO and A2C Algorithms

Core Question: How to train trading agents with Stable-Baselines3? Which algorithm—PPO or A2C—is better suited for trading scenarios?

Stable-Baselines3 provides a range of pre-built RL algorithms that can be directly used and compared. Below are the steps to train and evaluate PPO (Proximal Policy Optimization) and A2C (Advantage Actor-Critic):

Step 1: Prepare the Environment and Callback Function

The callback function monitors training progress:

from stable_baselines3.common.callbacks import BaseCallback
import numpy as np

class ProgressCallback(BaseCallback):
    def __init__(self, check_freq=1000, verbose=1):
        super().__init__(verbose)
        self.check_freq = check_freq
        self.rewards = []  # Record average reward at each checkpoint
    
    def _on_step(self):
        # Record average reward every check_freq steps
        if self.n_calls % self.check_freq == 0:
            mean_reward = np.mean([ep_info["r"] for ep_info in self.model.ep_info_buffer])
            self.rewards.append(mean_reward)
            if self.verbose:
                print(f"Steps: {self.n_calls}, Mean Reward: {mean_reward:.2f}")
        return True

Initialize and validate the environment:

from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

# Create and validate the environment
env = TradingEnv()
check_env(env, warn=True)  # Check if the environment complies with specifications
print("✓ Environment validation passed!")

# Wrap the environment (monitoring and normalization)
env = Monitor(env)  # Record training data
vec_env = DummyVecEnv([lambda: env])  # Vectorize environment (supports parallelism)
vec_env = VecNormalize(vec_env, norm_obs=True, norm_reward=True)  # Normalize observations and rewards

Step 2: Train Both Algorithms

from stable_baselines3 import PPO, A2C

# Define algorithms to train
algorithms = {
    "PPO": PPO("MlpPolicy", vec_env, verbose=0, learning_rate=3e-4, n_steps=2048),
    "A2C": A2C("MlpPolicy", vec_env, verbose=0, learning_rate=7e-4),
}

results = {}  # Store training results

# Train each algorithm
for name, model in algorithms.items():
    print(f"\nTraining {name}...")
    callback = ProgressCallback(check_freq=2000, verbose=0)  # Record every 2000 steps
    model.learn(total_timesteps=50000, callback=callback, progress_bar=True)  # Total training steps: 50,000
    results[name] = {"model": model, "rewards": callback.rewards}
    print(f"✓ {name} training completed!")

Step 3: Evaluate Algorithm Performance

from stable_baselines3.common.evaluation import evaluate_policy

# Create evaluation environment
eval_env = Monitor(TradingEnv())

# Evaluate each model
for name, data in results.items():
    mean_reward, std_reward = evaluate_policy(
        data["model"], eval_env, n_eval_episodes=20, deterministic=True
    )
    results[name]["eval_mean"] = mean_reward
    results[name]["eval_std"] = std_reward
    print(f"{name}: Mean Reward = {mean_reward:.2f} +/- {std_reward:.2f}")

Result Analysis

Evaluation allows comparison of the two algorithms:

PPO is typically more stable, making it suitable for scenarios requiring fine-tuning
A2C trains faster, ideal for rapid iteration testing

Reflection: In practice, I found that PPO converges more slowly but delivers better final performance, while A2C trains quickly but has more volatile results. This likely stems from the fact that trading decisions require considering long-term returns—PPO’s “proximal policy optimization” mechanism better balances exploration and exploitation, making it more suitable for such scenarios.

2.4 Result Visualization: From Learning Curves to Strategy Analysis

Core Question: How to use visualization to understand an RL agent’s learning process and decision-making strategy? What are the key visualization metrics?

Visualization is a critical tool for analyzing RL agent performance, helping to understand the learning process and decision logic.

Key Visualization Code

import matplotlib.pyplot as plt

# Create a 2x2 figure
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Training Progress Curve (Learning Curve)
ax = axes[0, 0]
for name, data in results.items():
    ax.plot(data["rewards"], label=name, linewidth=2)
ax.set_xlabel("Training Checkpoints (x1000 steps)")
ax.set_ylabel("Mean Episode Reward")
ax.set_title("Training Progress Comparison")
ax.legend()
ax.grid(True, alpha=0.3)

# 2. Evaluation Performance Bar Chart
ax = axes[0, 1]
names = list(results.keys())
means = [results[n]["eval_mean"] for n in names]
stds = [results[n]["eval_std"] for n in names]
ax.bar(names, means, yerr=stds, capsize=10, alpha=0.7, color=['#1f77b4', '#ff7f0e'])
ax.set_ylabel("Mean Reward")
ax.set_title("Evaluation Performance (20 Episodes)")
ax.grid(True, alpha=0.3, axis='y')

# 3. Portfolio Value Curve for the Best Model
ax = axes[1, 0]
# Identify the best model
best_model = max(results.items(), key=lambda x: x[1]["eval_mean"])[1]["model"]
obs = eval_env.reset()[0]
portfolio_values = [1000]  # Initial portfolio value
for _ in range(200):
    action, _ = best_model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = eval_env.step(action)
    portfolio_values.append(info.get("portfolio", portfolio_values[-1]))
    if done:
        break
ax.plot(portfolio_values, linewidth=2, color='green')
ax.axhline(y=1000, color='red', linestyle='--', label='Initial Value')
ax.set_xlabel("Steps")
ax.set_ylabel("Portfolio Value ($)")
ax.set_title(f"Best Model ({max(results.items(), key=lambda x: x[1]['eval_mean'])[0]}) Portfolio Performance")
ax.legend()
ax.grid(True, alpha=0.3)

# 4. Action Distribution of the Best Model
ax = axes[1, 1]
obs = eval_env.reset()[0]
actions = []
for _ in range(200):
    action, _ = best_model.predict(obs, deterministic=True)
    actions.append(action)
    obs, _, done, truncated, _ = eval_env.step(action)
    if done:
        break
action_names = ['Hold', 'Buy', 'Sell']
action_counts = [actions.count(i) for i in range(3)]
ax.pie(action_counts, labels=action_names, autopct='%1.1f%%', startangle=90, colors=['#ff9999', '#66b3ff', '#99ff99'])
ax.set_title("Best Model Action Distribution")

plt.tight_layout()
plt.savefig('trading_results.png', dpi=150, bbox_inches='tight')
plt.show()

Interpretation of Visualization Metrics

Learning Curve: Shows how average reward changes during training, indicating whether the model is continuously learning (an upward curve means progress)
Evaluation Performance: Compares average rewards and volatility of different algorithms via a bar chart,直观显示哪种算法更优
Portfolio Value Curve: Shows how the best model’s portfolio value changes over a full episode, determining if the strategy generates stable profits
Action Distribution: Uses a pie chart to understand the model’s decision tendencies (e.g., whether it trades frequently or prefers long-term holding)

Application Value: These visualizations help improve strategies. For example, if the model’s “Buy” actions are overrepresented, it may be overtrading (increasing costs); if the portfolio curve is highly volatile, the reward mechanism may need adjustment to encourage more stable returns.

2.5 Model Saving & Loading: Deploying and Reusing Trained Models

Core Question: How to save a trained RL model? How to load and reuse these models in future applications?

Saving trained models avoids redundant training and facilitates subsequent testing and deployment. Below are methods for saving and loading:

# Identify the best model
best_name = max(results.items(), key=lambda x: x[1]["eval_mean"])[0]
best_model = results[best_name]["model"]

# Save the model and environment normalization parameters
best_model.save(f"best_trading_model_{best_name}")
vec_env.save("vec_normalize.pkl")

# Load the model (for future use)
from stable_baselines3 import PPO  # Select the corresponding class based on the best model type

loaded_model = PPO.load(f"best_trading_model_{best_name}")
print(f"✓ Best model ({best_name}) saved and loaded successfully!")

Application Scenarios: Saved models can be used for:

Subsequent testing on new market data
Comparison with models from other new algorithms
Integration into trading simulation systems to observe long-term performance

Practical Summary & Action Checklist

DeepSeek-OCR Practical Action Checklist

Environment Preparation
- Confirm GPU has 12GB+ VRAM and install CUDA 11.8-compatible drivers
- Install Docker, Docker Compose, and NVIDIA Container Toolkit
Deployment Steps
- Clone the model: git clone https://huggingface.co/deepseek-ai/DeepSeek-OCR models/deepseek-ai/DeepSeek-OCR
- Build the container: docker-compose build
- Start the service: docker-compose up -d
- Verify health: curl http://localhost:8000/health
File Processing
- Quick conversion: python pdf_to_markdown_processor.py (for simple needs)
- Conversion with images: python pdf_to_markdown_processor_enhanced.py (for complex documents)
- Custom processing: Edit custom_prompt.yaml and run python pdf_to_custom_prompt_enhanced.py
Performance Optimization
- High load: MAX_CONCURRENCY=100 + GPU_MEMORY_UTILIZATION=0.95
- Low memory: MAX_CONCURRENCY=10 + GPU_MEMORY_UTILIZATION=0.7

Reinforcement Learning Trading Agent Action Checklist

Environment Construction
- Implement the TradingEnv class, defining the action space, observation space, and reward mechanism
- Validate the environment with check_env()
Training Process
- Define ProgressCallback to monitor training progress
- Initialize PPO and A2C models, setting parameters like learning rate
- Train the model: model.learn(total_timesteps=50000, callback=callback)
Evaluation & Visualization
- Evaluate model performance with evaluate_policy()
- Plot learning curves, portfolio changes, and action distributions to analyze model performance
Model Management
- Save the best model: best_model.save("best_trading_model")
- Load the model: loaded_model = PPO.load("best_trading_model")

One-page Summary

DeepSeek-OCR: A deep learning-based tool for PDF-to-Markdown conversion, supporting Docker deployment and API calls. It offers multiple processing scripts for diverse needs, ideal for document digitization and content extraction.
Key Advantages: Structured output, GPU acceleration, flexible deployment, custom prompt support.
Reinforcement Learning Trading Agents: Built with Stable-Baselines3, trained on PPO and A2C algorithms via a custom trading environment to explore optimal trading strategies.
Key Value: Simulates real trading scenarios, compares decision-making effects of different algorithms, and provides references for practical trading strategies.
Key Steps: Environment preparation → Deployment/construction → Training/processing → Evaluation → Optimization → Saving/reuse.

Frequently Asked Questions (FAQ)

What file formats does DeepSeek-OCR support?
It supports common formats like PDF, JPG, and PNG. PDF processing supports multi-page documents, outputting results for each page.
Can DeepSeek-OCR run without a GPU?
Not recommended. The model requires significant computing resources—without a GPU, processing will be extremely slow, and the model may fail to load. A minimum of 12GB VRAM (NVIDIA GPU) is required.
How to choose the right PDF processing script?
Use pdf_to_markdown_processor.py for simple conversion; the enhanced version for images and fine formatting; pdf_to_ocr_enhanced.py for text-only extraction; and custom prompt scripts for special needs.
How does the reward mechanism affect strategy during trading agent training?
The reward mechanism directly guides the model’s learning direction. For example, if rewards are linked to short-term gains, the model may trade frequently; linking rewards to long-term gains may encourage trend-following.
Which algorithm—PPO or A2C—is better for trading scenarios?
PPO is generally more stable, suitable for trading scenarios requiring long-term strategies; A2C trains faster, ideal for rapid testing of different environment settings.
What to do if DeepSeek-OCR times out when processing large PDFs?
Split large PDFs into smaller files, reduce concurrency by adjusting MAX_CONCURRENCY, or increase the service timeout.
How to confirm if a trading environment is well-designed?
First, run a random strategy in the environment to check if portfolio changes align with expectations. Then compare performance across algorithms—significant differences indicate an effective environment design.
Can trained trading agents be used for live trading?
Direct use in live trading is not recommended. Simulated environments differ from real markets—models require extensive backtesting and risk assessment before cautious application.