Texo: The Ultimate Lightweight LaTeX OCR for Math Formula Recognition

高效码农

3 hours ago

Texo: A Lightweight, Open-Source LaTeX OCR Model for Effortless Math Formula Recognition

Have you ever encountered a complex mathematical formula in a document or image and wished you could instantly convert it into editable LaTeX code? As students, researchers, or STEM professionals, we often need to extract mathematical expressions from images or handwritten notes. This is where LaTeX OCR (Optical Character Recognition) tools become invaluable. Today, we introduce Texo – a free, open-source, lightweight, yet powerful LaTeX OCR model. With only 20 million parameters, it efficiently handles formula recognition across various scenarios.

What is Texo and Why Should You Care?

Texo (pronounced /ˈtɛːkoʊ/) is a minimalist, free, and open-source LaTeX OCR model specifically designed to recognize mathematical formulas and scientific documentation, converting them into LaTeX code. While numerous OCR tools exist, many are either expensive, resource-heavy, or lack precision. Texo fills this gap: it’s not only free and open-source but also compact, fast, and even capable of running directly in your web browser.

For anyone regularly working with STEM (Science, Technology, Engineering, and Mathematics) content, Texo acts as a handy assistant. It enables you to extract formulas from images for quick editing or reuse, eliminating the need to manually type complex LaTeX code. Whether you’re a student taking notes or a researcher compiling literature, Texo can save you significant time and effort.

Key Features of Texo

Free and Open-Source: Use, modify, and distribute Texo freely without any cost.
Fast and Lightweight Inference: With only 20 million parameters, inference is quick and resource-efficient.
Trainable on Consumer-Grade GPUs: If you wish to customize the model, you can do so even on personal computer hardware.
Well-Organized Code as a Tutorial: The clear code structure serves as an excellent learning resource for deep learning projects.
Runs in Your Browser!: Experience Texo directly online via the Texo-web project.

Trying Texo is straightforward – visit the Online Demo to get started immediately.

How Does Texo Perform? Let the Data Speak

Performance evaluation is crucial in machine learning. Texo is a distilled version of the PPFormulaNet-S model, fine-tuned on the UniMERNet-1M dataset. This means it retains high performance while drastically reducing the number of parameters. The following table compares Texo against other leading models on the UniMERNet-Test dataset:

Model	Params	Metric	SPE	CPE	SCE	HWE
UniMERNet-T	107M	BLEU	0.909	0.902	0.566	0.883
		Edit Distance	0.066	0.075	0.239	0.078
PPFormulaNet-S	57M	BLEU	0.8694	0.8071	–	–
Texo-distill	20M	BLEU	0.9014	0.8909	0.7034	0.8606
		Edit Distance	0.0780	0.1042	0.1941	0.0995
Texo-transfer	~20M	BLEU	0.8597	0.8334	0.5549	0.7973
		Edit Distance	0.0980	0.1306	0.2187	0.0999

As the data shows, Texo delivers comparable performance to much larger models, despite a significant reduction in parameters. For instance, the Texo-distill variant closely matches or even surpasses some larger models in metrics like BLEU score and Edit Distance. This is a major advantage for users with limited computational resources.

Understanding the Performance Metrics

BLEU: A metric measuring textual similarity; a higher score indicates the recognition result is closer to the ground truth.
Edit Distance: Represents the minimum number of edits required to convert the recognition result into the ground truth; a lower value is better.
SPE, CPE, SCE, HWE: These represent different error types in the UniMERNet dataset, specifically Symbol, Structure, Semantic, and HandWriting Errors.

Texo’s performance across these metrics demonstrates its robustness in various scenarios.

Getting Started with Texo: A Step-by-Step Guide

Configuring Your Environment

First, you need a Python environment. Texo recommends using the uv tool for dependency management, a modern Python package manager that simplifies environment setup.

git clone https://github.com/alephpi/Texo
uv sync

If you haven’t used uv before, this is a great opportunity to try it. Of course, you can also configure the environment manually using other tools, ensuring all dependencies are correctly installed.

Downloading the Model

Downloading the pre-trained model is a crucial step. You can quickly obtain the model files using the following commands:

# Download only the model
python scripts/python/hf_hub.py pull

If you plan to train the model from useful checkpoints, you can download additional resources:

# Download complete resources, including useful checkpoints
python scripts/python/hf_hub.py pull --with_useful_ckpts

Running Inference

Once your environment and model are ready, you can start using Texo for formula recognition. The project provides a demonstration notebook, demo.ipynb, to help you get started quickly. Simply open the notebook, follow the instructions to load your image, and run the model to see the recognition results.

The inference process is straightforward and intuitive, suitable for both beginners and advanced users.

How to Train Your Own Texo Model?

If you’re interested in customizing the model, Texo provides a complete training pipeline. Training a LaTeX OCR model might sound complex, but Texo’s code structure simplifies the process.

Training Requirements

Training deep learning models requires certain computational resources. Here are the resource requirements for training Texo:

My Setup: 50GB CPU RAM, A40/L40S GPU (46GB VRAM).
Recommended Setup: 50GB CPU RAM, 40GB GPU VRAM.
Minimal Setup: 20GB CPU RAM (with streaming data loading) and 16GB GPU VRAM (using gradient accumulation).

If your system meets these requirements, you can begin training.

Downloading the Dataset

Texo is trained on the UniMER-1M dataset, a large-scale dataset for mathematical formula recognition containing numerous annotated images and their corresponding LaTeX code.

You can download the dataset following the original instructions, or use the pre-arranged and normalized versions I’ve prepared:

These datasets have been preprocessed, including collecting and sorting all useful KaTeX commands, ensuring a more efficient training process.

Launching Training

Texo uses the hydra tool to manage training configurations and experiments. This allows flexible hyperparameter adjustment and easy reproduction of results.

Here are some common training commands:

# Start training
python src/train.py

# Resume training from a checkpoint
python src/train.py training.resume_from_ckpt="<ckpt_path>"

# Debug mode
python src/train.py --config-dir="./config" --config-name="train_debug.yaml"

# Train on a Slurm cluster
python src/train.py --multirun --config-dir="./config" --config-name="train_slurm.yaml"

You can find more training configuration options in the config directory and adjust them according to your needs.

Viewing Training Logs

During training, all results are saved in the outputs directory. You can visualize the training progress using TensorBoard:

tensorboard --logdir outputs

This will display key metrics like loss curves and accuracy, helping you monitor model performance.

Visualizing the Training Process

The following charts from the training process provide an intuitive understanding of model convergence:

Training Loss

Validation Loss

BLEU Score

Edit Distance

Learning Rate

These charts illustrate the model’s performance during training, helping you decide when to stop training or adjust hyperparameters.

Reproducing the Entire Project: Building Texo from Scratch

If you’re interested in the implementation details of Texo or want to reproduce the entire project from scratch, please refer to my technical notes. These notes document the model design, data preprocessing, and training strategy selections in detail, making them suitable for deep learning enthusiasts and researchers.

Frequently Asked Questions (FAQ)

What is Texo?

Texo is a free, open-source LaTeX OCR model specifically designed to recognize mathematical formulas in images and convert them into LaTeX code. It is lightweight, efficient, and user-friendly.

How do you pronounce Texo?

Texo is pronounced /ˈtɛːkoʊ/, similar to “teh-ko”.

What devices can run Texo?

Texo can run on most modern computers, including personal PCs and servers. It even supports running directly in a web browser without requiring any software installation.

How can I train a Texo model?

Training a Texo model requires downloading the dataset, configuring the environment, and running the training scripts. Please refer to the “How to Train Your Own Texo Model?” section above for detailed steps.

How well does Texo perform?

Texo delivers excellent performance across multiple evaluation metrics. It maintains accuracy comparable to much larger models despite a significant reduction in parameters. Please see the performance comparison table for detailed data.

Is Texo really free?

Yes, Texo is completely free and open-source, released under the AGPL-3.0 license.

How can I contribute code or report an issue?

You can submit issues or pull requests via the GitHub repository: https://github.com/alephpi/Texo.

Acknowledgements

The development of Texo relies on and draws inspiration from many excellent open-source projects. We extend our sincere gratitude to the contributors of the following projects:

transformers: Provided the model framework, decoder, and tokenizer.
UniMERNet: Provided the dataset and image processor.
Im2Markup: Provided LaTeX preprocessing methods.
KaTeX: Used for training the tokenizer and LaTeX parsing.
my-unimernet: Provided the image processor implementation.
PaddleOCR: Provided the model architecture and pre-training weights.
PaddleOCR2Pytorch and D-FINE: Provided the model encoder implementation.
Im2Markup, LaTeX-OCR, and TrOCR: Pioneers in the LaTeX OCR field.
MixTeX and TexTeller: These projects inspired the development of Texo.
Telecom Paris: For providing the GPU cluster support.

Thank you to all the developers who contributed to these projects. Without your work, Texo would not have been possible.

License

Texo is released under the AGPL-3.0 license. This means you are free to use, modify, and distribute the code, but if you create derivative works based on Texo, they must also be open-sourced.

Star History

Since its release, Texo has garnered significant attention from the community. The chart below shows the project’s growth in popularity over time:

Final Thoughts

Texo is a powerful and flexible LaTeX OCR tool that provides excellent support for learning, research, and practical applications. Through this article, we hope you gain a comprehensive understanding of Texo and start using it to streamline your workflow. If you have any questions or suggestions, please feel free to engage with the project. Let’s work together to advance the development of open-source scientific computing tools