Heretic: The Complete Guide to Automatically Removing Censorship from Language Models

In the rapidly evolving landscape of artificial intelligence, language models have become indispensable assistants in our work and daily lives. However, the built-in “safety alignment” mechanisms—what we commonly refer to as censorship functions—often limit models’ creativity and practical utility. Imagine asking an AI model a sensitive but legitimate question, only to receive a mechanical refusal to answer. This experience can be incredibly frustrating.

Enter Heretic, a tool that’s changing this status quo. It can automatically remove censorship mechanisms from language models without requiring expensive retraining. Whether you’re a researcher, developer, or simply an AI enthusiast, Heretic can help you unlock the full potential of language models.

Understanding Heretic: What It Is and Why It Matters

Heretic is a specialized tool designed for transformer-based language models. It employs advanced directional ablation technology (known in academic circles as “abliteration”), combined with a TPE-based parameter optimizer, to precisely remove model censorship mechanisms.

The most remarkable feature of this tool is its complete automation. Heretic automatically finds optimal ablation parameters by co-minimizing refusal counts and KL divergence from the original model. This means even if you have zero understanding of transformer internals, as long as you can run command-line programs, you can use Heretic to remove censorship from language models.

The Problem with Language Model Censorship

Language model censorship mechanisms were originally designed to prevent models from generating harmful, biased, or inappropriate content. However, these safety measures often prove overly conservative, causing models to refuse answering even legitimate, constructive questions. For instance, when researchers try to study sensitive social phenomena, or when writers want to create stories involving complex moral issues, restricted models often fail to provide valuable assistance.

Heretic’s goal isn’t to create completely unconstrained AI, but to remove unnecessary restrictions while maintaining model intelligence and usefulness. Models processed by Heretic retain their original knowledge and capabilities while becoming more open and helpful when responding to various user queries.

Measuring Heretic’s Effectiveness: Real Data

To objectively evaluate Heretic’s performance, let’s examine concrete data from tests on Google’s Gemma-3-12B model:

Model	Refusals for “Harmful” Prompts	KL Divergence for “Harmless” Prompts
Original Model (google/gemma-3-12b-it)	97/100	0 (Baseline)
Manual Ablation Version 1 (mlabonne/gemma-3-12b-it-abliterated-v2)	3/100	1.04
Manual Ablation Version 2 (huihui-ai/gemma-3-12b-it-abliterated)	3/100	0.45
Heretic Automated Version (p-e-w/gemma-3-12b-it-heretic)	3/100	0.16

The data clearly shows that Heretic’s automated processing achieves the same reduction in refusal rates (from 97% to 3%) as manual processing, while achieving significantly lower KL divergence. KL divergence measures the difference between processed and original models—lower values indicate better preservation of model capabilities. Heretic’s KL divergence of just 0.16 substantially outperforms other versions, demonstrating its ability to remove censorship while maximally preserving original model intelligence.

If you wish to verify these results, you can use Heretic’s built-in evaluation functionality: heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic. Note that specific values may vary slightly depending on platform and hardware differences.

Model Compatibility: What Works with Heretic?

Heretic is compatible with most dense models, including multimodal models and various mixture-of-experts architectures. However, it currently doesn’t support state space models (SSMs)/hybrid models, models with inhomogeneous layers, and certain novel attention systems.

You can find a collection of models processed using Heretic on the Hugging Face platform, forming what’s known as “The Bestiary”—a rich resource for research and applications.

Getting Started with Heretic: A Practical Guide

Using Heretic is straightforward and can be completed in just a few steps:

Environment Preparation

First, ensure your system meets these requirements:

Python 3.10 or higher
PyTorch 2.2 or higher, appropriate for your hardware

Installation and Execution

Install the Heretic package:
```
pip install heretic-llm
```
Run Heretic on your chosen model:
```
heretic Qwen/Qwen3-4B-Instruct-2507
```

You can replace the model name above with any model you wish to process. The entire process is fully automatic and requires no configuration.

Advanced Configuration

Although Heretic comes with sensible default configurations, it offers numerous parameters for advanced users. You can explore available options by:

Running heretic --help to view command-line options
Referencing the config.default.toml configuration file for detailed settings

Processing Time Considerations

Heretic performs system benchmarking at program start to determine optimal batch sizes, making the most of available hardware resources. Processing time varies by model size and hardware performance—on an RTX 3090, processing the Llama-3.1-8B model with default configuration takes approximately 45 minutes.

Post-Processing Options

After Heretic completes model processing, you can choose to:

Save the model locally
Upload to the Hugging Face platform
Chat with the model to test its performance
Or perform all the above actions simultaneously

The Technology Behind Heretic: How It Works

To understand Heretic’s operation, we need some background knowledge.

What is Directional Ablation?

Directional ablation is a technique for precisely modifying neural networks. It works by identifying “directions” within the model associated with specific behaviors (like refusal to answer), then selectively suppressing the influence of these directions.

Technically, Heretic performs the following operations for each supported transformer component (currently including attention out-projection and MLP down-projection):

Identifies relevant matrices in each transformer layer
Orthogonalizes these matrices with respect to relevant “refusal directions”
Suppresses the expression of these directions in matrix multiplication results

Calculating Refusal Directions

Refusal directions are determined by computing the difference-of-means between first-token residuals for “harmful” and “harmless” example prompts. Simply put, this means analyzing differences in the model’s internal representations when facing different types of questions, identifying patterns associated with refusal behavior.

Heretic’s Technical Innovations

Compared to existing ablation techniques, Heretic introduces several key innovations:

Flexible Ablation Weight Kernel

Heretic uses a set of parameters (max_weight, max_weight_position, min_weight, and min_weight_position) to define how ablation weights vary across layers. This flexibility allows the tool to apply interventions of different strengths at different layers, optimizing the balance between compliance and quality.

The concept of non-constant ablation weights had been explored before, but Heretic takes it to new levels through automated optimization.

Continuous Refusal Direction Index

Heretic’s refusal direction index is a float rather than an integer. For non-integral values, the tool performs linear interpolation between the two nearest refusal direction vectors. This innovation unlocks a direction space far beyond traditional methods, enabling the optimization process to find better ablation directions than those belonging to any individual layer.

Component-Specific Parameter Selection

Heretic selects ablation parameters separately for each component. Research has found that interventions on MLP components tend to damage models more than interventions on attention components. By using different ablation weights for different components, Heretic can squeeze out additional performance improvements.

Comparing Heretic with Alternative Approaches

Before Heretic, several public implementations of ablation techniques existed:

AutoAbliteration
abliterator.py
wassname’s Abliterator
ErisForge
Removing refusals with HF Transformers
deccp

It’s important to note that Heretic was written from scratch and doesn’t reuse code from these projects. It represents significant improvements in automation, processing effectiveness, and ease of use.

Technical Foundations and Development Context

Heretic’s development builds on solid academic and practical foundations:

Directional ablation technology was initially proposed by Arditi et al. in their 2024 research paper, establishing theoretical groundwork for subsequent work. Later, Maxime Labonne further popularized and refined this technology through articles and model cards. Jim Lai’s description of “projected abliteration” also provided important references for Heretic’s development.

This technological evolution exemplifies the open-source collaborative spirit of the AI community—each contributor builds upon previous work, collectively pushing technological boundaries.

Frequently Asked Questions

Does Heretic Make Models Dangerous?

Not exactly. Heretic’s purpose is to remove overly conservative censorship, not to create completely unconstrained models. Models processed by Heretic maintain their original knowledge and capabilities, while becoming more open and useful when answering sensitive questions. Practical testing shows that processed models generally maintain responsible behavior in most situations.

Is Using Heretic to Process Models Legal?

This depends on your local laws and specific use cases. Heretic itself is an open-source tool released under the AGPLv3 license. When using models processed with it, you should comply with the original models’ license terms and applicable legal regulations.

Does Heretic Affect Model Performance?

Heretic is designed to remove censorship while minimizing impact on model performance. Experimental results show that significantly reduced KL divergence indicates Heretic excels in this aspect—it removes censorship mechanisms while maximally preserving original model capabilities.

How Much Technical Knowledge Do I Need to Use Heretic?

Almost none. Heretic is designed for accessibility regardless of understanding of transformer internals. If you can run command-line programs, you can use Heretic.

How Long Does Heretic Take to Process Models?

Processing time depends on model size and your hardware performance. For 8B parameter models, it takes approximately 45 minutes on high-end GPUs. Larger models require correspondingly more time. Heretic automatically performs system benchmarking at start to optimize processing speed.

Which Models Can I Use with Heretic?

Heretic supports most dense models, including multimodal models and various mixture-of-experts architectures. However, it currently doesn’t support state space models, models with inhomogeneous layers, and certain novel attention systems. Check the latest documentation for specific compatibility information.

Can Processed Models Be Used Commercially?

This depends on the original models’ licenses. Before using any model, always check its license terms to ensure your usage complies with regulations.

Practical Application Scenarios

Models processed by Heretic have broad applications across multiple domains:

Academic Research: Researchers can use models with unnecessary restrictions removed to explore sensitive but important social topics, such as historical event analysis and social phenomenon studies.

Content Creation: Writers and creators can leverage more open models to develop storylines involving complex moral issues, breaking through creative expression boundaries.

Technical Development: Developers can build more flexible and useful AI assistants that better meet users’ diverse needs.

Model Analysis: AI safety researchers can compare pre- and post-processing models to gain deeper understanding of safety alignment mechanisms and potential improvement directions.

Future Development Directions

As language model technology continues evolving, the balance between safety and utility will remain crucial. Heretic represents significant progress in this field—it provides a precise, automated method to adjust model openness.

Future development directions may include support for more model architectures, further improvements in processing efficiency, and more fine-grained control options allowing users to adjust model openness according to specific needs.

Licensing and Contributions

Heretic is released under the AGPLv3 license, meaning you can freely use, modify, and distribute it, but if you distribute modified versions, you must also open-source your changes.

The project welcomes community contributions—by submitting code to the project, you agree to license your contributions under the same terms.

Conclusion: The Future of Language Model Customization

Heretic represents a significant milestone in AI model optimization. Through fully automated methods, it addresses the problem of excessive language model censorship while setting new standards for preserving model capabilities.

Whether you’re a scholar seeking to overcome research limitations, a developer looking for more flexible AI assistants, or simply an AI technology enthusiast, Heretic deserves your attention. It demonstrates that with AI safety and practicality, we don’t need to make either-or choices—through sophisticated technical means, we can find more balanced solutions.

As artificial intelligence increasingly integrates into our lives, tools like Heretic remind us that technology’s ultimate goal should be to enhance rather than limit human capabilities and creativity. By responsibly using these tools, we can unlock AI’s true potential and collectively build a more intelligent and open future.

Heretic AI: The Ultimate Guide to Removing Censorship from Language Models Automatically