Enigmata: Elevating Logical Reasoning in Large Language Models

In the ever-evolving landscape of artificial intelligence, large language models (LLMs) have made remarkable strides. They excel in a multitude of tasks, from mathematical computations to coding endeavors. However, when it comes to logical reasoning puzzles that do not necessitate domain-specific expertise, these models have shown certain limitations. To bridge this gap, researchers have introduced Enigmata, a comprehensive suite meticulously designed to enhance the puzzle-solving abilities of LLMs.

I. The Enigmata Suite: A Closer Look

(A) Enigmata-Data: A Rich Repository of Puzzles

Enigmata-Data boasts an impressive collection of 36 distinct tasks across 7 major categories. These categories encompass Cryptographic Puzzles, Arithmetic Puzzles, Logic Puzzles, Grid Puzzles, Graph Puzzles, Search Puzzles, and Sequential Puzzles. Each task is supported by an automated generator and verifier.

This innovative design offers three significant advantages:

  • Unlimited Self-Verifying Puzzle Prompts: The generator can produce a vast number of puzzle instances, each equipped with a verifier for instant correctness checks. This seamless integration with the Reinforcement Learning with Verifiable Rewards (RLVR) framework enables long-chain-of-thought training.
  • Programmable Difficulty Control: Researchers can adjust puzzle difficulty by manipulating key variables, such as grid size and blank cell count in Binario. This capability allows for the creation of puzzles tailored to specific difficulty levels, facilitating experiments on how curriculum design impacts reinforcement learning.
  • Scalable Sample Generation: The generator can produce any number of samples per task, ensuring balanced task representation and enabling cross-task generalization studies.

When compared to other puzzle resources, Enigmata stands out as the only dataset that covers multiple task categories, offers scalability, provides automatic verification, and is publicly available.

(B) Enigmata-Eval: A Rigorous Benchmark

Based on Enigmata-Data, the Enigmata-Eval benchmark has been developed. It consists of 4,758 puzzle instances, spanning various difficulty levels from simple to complex. Stringent measures were taken during the sampling process to ensure no data leakage between the training and evaluation sets.

(C) Enigmata-Model: An Optimized Training Approach

Enigmata-Model outlines a systematic training methodology to empower models with superior puzzle-solving capabilities. The key steps involved are as follows:

  • Rejection Fine-Tuning (RFT): This process establishes a robust foundation of reasoning patterns by combining high-quality mathematical problems and puzzle solutions. In the puzzle component, tasks and difficulty levels are uniformly sampled from the Enigmata dataset to ensure comprehensive coverage and balanced distribution of reasoning patterns. The mathematical component leverages carefully curated problems, maintaining a 1:1 ratio with the puzzle component to foster all-round reasoning development across domains.
  • Reinforcement Learning with Verifiable Puzzles: The VC-PPO variant of PPO is employed for model training. Each task is equipped with an automated verifier that instantly scores model responses, guiding VC-PPO policy updates. For tasks with generators, examples can be created at any difficulty level; for tasks relying on fixed pools, data is directly sampled from the pools.

II. Experiments and Achievements

(A) Experimental Setup

To evaluate model performance, researchers utilized multiple challenging reasoning benchmarks, including Enigmata-Eval, ARC-AGI 1, ARC-AGI 2, and KOR-Bench. Additionally, the advanced mathematical benchmark AIME 2024 was included to assess generalization capabilities. All models were trained starting from Qwen2.5-32B-Instruct, with RFT and RL training yielding the high-performing Qwen2.5-32B-Enigmata model.

(B) Key Achievements

  • Superior Performance on Puzzle Reasoning Benchmarks: Qwen2.5-32B-Enigmata demonstrated remarkable accuracy of 32.8% on the Enigmata-Eval benchmark, surpassing models like o3-mini-high (25.8%) and o1 (29.0%). On the ARC-AGI benchmark, it achieved an accuracy of 0.6%, outperforming other strong reasoning models such as Gemini 2.5 Pro (1.4%), o4-mini-high (2.6%), and o3-mini-high (0.4%).
  • Impressive Generalization Abilities: Beyond excelling in puzzle reasoning, Qwen2.5-32B-Enigmata showcased robust generalization in mathematical reasoning. For instance, its accuracy on AIME 2024 reached 60.6%, a significant improvement over the pre-training Qwen2.5-32B-Instruct model (16.6%). This indicates that the model retains its mathematical reasoning capabilities while enhancing puzzle-solving skills, without the trade-offs often seen in multi-task training.
  • Scalability Benefits on Larger Models: When applied to larger models like Seed1.5-Thinking (with 20B activated parameters and 200B total parameters), Enigmata’s puzzle data demonstrated added value. On tasks such as AIME (2024-2025), BeyondAIME, and GPQA (Diamond), the Seed1.5-Thinking-Enigmata model exhibited performance increases across multiple metrics compared to the original Seed1.5-Thinking model. This suggests that Enigmata’s puzzle data can positively impact the general reasoning capabilities of larger models.

III. In-Depth Analysis

(A) Performance Across Puzzle Categories

In the detailed analysis of Enigmata-Eval, Qwen2.5-32B-Enigmata demonstrated exceptional performance in structured reasoning categories such as Cryptographic, Arithmetic, and Logic puzzles, achieving accuracy rates of 96.0%, 93.7%, and 90.2%, respectively. This indicates that the training approach effectively cultivates the model’s rule-based reasoning abilities under explicit constraints and patterns. Furthermore, the model showed competitiveness in search tasks, outperforming most baseline models. However, spatial and sequential tasks remain relatively challenging, pointing to avenues for future research.

(B) The Impact of Training Data Volume

Researchers investigated how the volume of training data affects model performance. Results indicated that in the second stage of multi-stage training, a small amount of Enigmata-Train data could significantly enhance Enigmata-Eval performance while better preserving knowledge from the first stage and OOD performance. As the volume of Enigmata-Train data increased, in-domain Enigmata-Eval performance gradually improved. However, excessive Enigmata-Train data led to catastrophic forgetting and slight degradation of OOD performance.

(C) The Role of Data Difficulty Control

By comparing different data difficulty distribution strategies, researchers found that a balanced difficulty ratio (1:1:1) enabled the model to deliver more stable complex reasoning performance. Moreover, Enigmata’s simple difficulty control method based on difficulty tags performed comparably to historical reward variation (HRV) on Enigmata-Eval while yielding superior results on OOD benchmarks.

(D) Comparing Multi-Task Training Methods

Experiments on multi-task training methods revealed distinct strengths of Mix-Training RL and Multi-stage RL. Mix-Training RL, by exposing the model to a diverse range of puzzles, enhanced generalization capabilities. On the other hand, Multi-stage RL, through a staged approach, facilitated more effective complex reasoning learning while maintaining performance on initial tasks.

IV. Conclusion and Future Outlook

Enigmata represents a significant advancement in the field of AI by providing a comprehensive suite that enhances the logical reasoning capabilities of LLMs. Its Enigmata-Data offers scalable, difficulty-controllable, and automatically verifiable puzzles, while its training methodology seamlessly integrates with the RLVR paradigm. Experimental results highlight Enigmata’s ability to significantly boost model performance on puzzle reasoning tasks and its strong generalization capabilities. Particularly noteworthy is its positive impact on larger models, where it further elevates performance in mathematical and STEM reasoning domains.

Looking ahead, researchers can continue to explore the application of Enigmata across a broader range of models and algorithms, unlocking its potential in diverse fields and tasks. As technology progresses, Enigmata is poised to evolve and refine its capabilities, contributing significantly to the advancement of logical reasoning in large language models.