Steering Conceptual Bias in Language Models for Scientific Code Generation

Abstract

This work explores whether activating latent subspaces in language models (LLMs) can guide scientific code generation toward a specific programming language. Five causal LLMs were evaluated on scientific coding prompts to quantify their baseline bias among four programming languages. A static neuron-attribution method, perturbing the highest activated MLP weight for a “C++ or CPP” token, proved brittle and exhibited limited generalization across prompt styles and model scales. To address these limitations, a gradient-refined adaptive activation steering framework (G-ACT) was developed: per-prompt activation differences are clustered into a small set of steering directions, and lightweight per-layer probes are trained and refined online to select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably biases generation towards the CPP language by increasing the average probe classification accuracy by 15% and the early layers (0–6) improving the probe classification accuracy by 61.5% compared to the standard ACT framework. For LLaMA-3.3 70B, targeted injections at key layers still improve language selection. These results demonstrate a scalable, interpretable and efficient mechanism for concept-level control for practical agentic systems.

alt text

Introduction

Large language models (LLMs) have rapidly evolved into sophisticated natural language processors, enabling the development of agentic systems that autonomously orchestrate complex workflows. A particularly striking trend is the adoption of LLM-driven agents for automated code generation. By incorporating plugin architectures that expose external APIs, these agents extend beyond text synthesis: agents can invoke specialized tools and execute command-line operations. This expanded action space has already powered a variety of real-world applications – from LLM-based robotic controllers to automated scientific experimentation platforms – highlighting the remarkable success of these agents.

However, their application to scientific code generation remains unexplored. Scientific software predominantly relies on C++ or CPP, CUDA and other low-level languages that are sparsely represented in most pretraining datasets. As a result, LLM-generated implementations often exhibit syntactic or semantic errors, leading to compilation failures or unstable behavior at runtime. In addition, current agents are heavily dependent on user-defined control primitives and meticulously engineered prompts, which can be misinterpreted and give rise to unpredictable execution paths. One possible solution is to augment function definitions and syntax using a retrieval augmented generation (RAG) framework; however, Xiao et al. and Yona et al. report unexplained vulnerability linked to “attention sinks” or first token bias, causing model behavior to diverge due to token repetition in long interactions, commonly seen in RAG. Another failure mode arises from the use of LMs that have been trained or fine-tuned on undisclosed corpora and then subjected to opaque alignment procedures, especially for MoE models. Such processes can inadvertently skew the model’s output toward a particular programming language or coding style, further eroding its ability to generate correct generalizable code across the various low-level languages prevalent in scientific computing during agentic applications with long repeated interactions.

To investigate these bottlenecks, a curated benchmark of scientific coding challenges is introduced to reveal the implicit language preference of an LLM when presented with a given problem. Targeted probing techniques are then applied to identify subgraphs or subspaces in the model whose activation strongly correlates with that preference. In the privileged basis of the model, often termed native MLP activation axes, each coordinate encodes a distinct functional characteristic (here the coding language preference). This basis is a consequence of elementwise nonlinearities (e.g., ReLU, GELU) breaking rotational symmetry and the coordinate-wise biases introduced by optimizers such as Adam. Since these axes remain (approximately) disentangled, selectively amplifying or suppressing a single coordinate produces clear causal shifts in token probabilities. In this basis, an effective weight vector for any neuron can be derived and its direct influence on features can be decoded. The external perturbation of this neuron’s activation along the identified axis then reliably steers the model toward generating code in the desired programming language.

Related Work

Reverse engineering neural networks to drive/steer a specific behavior, known as mechanistic interpretability, is an emerging field. Its primary objective is to identify causal relationships within model activations thereby revealing complex functional roles and enabling the targeted modulation or ablation of individual neurons. Supervised fine-tuning, weight modulation techniques, and RLHF represent direct intervention strategies for model steering. Although effective, these methods impose substantial computational overhead and can inadvertently compromise the robustness and general performance of the model.

The method using corrupted inputs as a baseline distribution to resample neuron activation, known as Activation Patching, has been widely used to achieve fine-grained control over the model output. A typical causal attribution method quantifies the negative impact on model output by deleting specific neuron activations, a more trivial approach compared to Activation Patching. In such methods, extensive model sweeps are conducted to evaluate the results of neuron modification, leading to millions of model evaluations. Studies have focused on suppressing hallucinations in LLMs within different modalities using activation patching. Recent studies using similar forms of neuron attribution have typically been applied to multiple choice question benchmarks, rather than real-world deployment scenarios. In contrast to previous work, this study presents a different approach for the inference-time adaptive steering algorithm by incorporating gradient-based refinement, offering a more efficient and precise mechanism for steering large-scale models.

Methodology

Autoregressive decoder-only transformers convert embedded tokens into high-dimensional latent representations that are successively refined by the attention mechanism and deep MLP blocks for output generation. Residual connections carry a persistent residual stream: combining each MLP input with its output to preserve and enhance contextual features before the LM head decodes the final latent stream into tokens. In particular, individual neurons operate directly on the residual stream, a central information pathway through transformer layers. Neurons specialized in the detection of specific features within this stream exhibit a strong alignment between their weight vectors and the corresponding features, characterized by high cosine similarity. This property forms the basis for the methods explained in the following sections.

Neuron Attribution: Static Method

Building upon the previously mentioned alignment property between weights and features, the static method involves directly decoding the neuron weight vectors through the LM-head, thus converting the neuron weights into interpretable token-level probability distributions.

A transformer with L layers processes an input token sequence x₁,…,xₙ ∈ V (vocabulary V ) through a residual-stream architecture. Each token xᵢ is first embedded into a d-dimensional vector hᵢ⁽⁰⁾. In layer j ∈ {1,…,L}, the residual update at position i is given by

hᵢ⁽ʲ⁾ = hᵢ⁽ʲ⁻¹⁾ + f⁽ʲ⁾(h₁⁽ʲ⁻¹⁾,…,hₙ⁽ʲ⁻¹⁾),

where f⁽ʲ⁾ denotes the combined self-attention and feedforward operations of layer j. After L such updates, the final residuals hᵢ⁽ᴸ⁾ are passed through a learned linear head (LM-head or “output head” or “prediction head”) and softmax to produce next token probabilities:

P(xᵢ₊₁ | x₁:i) = softmax(W hᵢ⁽ᴸ⁾).

To identify neurons selectively responsive to a particular target token, the decoded probability corresponding to that token is normalized by the average decoded probability of the top k tokens foreach neuron. Upon identifying candidate feature-selective neurons, their functional significance is validated by systematically modifying their activations. For pre-trained models, the role of individual neurons can be probed through precisely crafted prompts that elicit the desired features, followed by controlled variations in neuron outputs to measure corresponding changes in token probabilities. The neuron with the highest activation is identified and then its activation value is increased by a fixed amount, while all other coordinates remain unchanged. The modified residual vector is passed unchanged through the remaining layers of the transformer. The subsequent steps provide a detailed description of this approach. Here, the residual stream dimension (final hidden state) is D and the vocabulary dimension is denoted as V .

Extraction from Transformer Layers

Iterate over each transformer layer, accessing the corresponding MLP components.

Computation of Effective Neuron Weights

Calculate effective neuron weights using an elementwise multiplication of up-projection weights W_up and the sigmoid activation (σ) of gate-projection weights W_gate:

W_eff = W_up ⊙ σ(W_gate).

Decoding via LM Head

Decode each row (w_eff) of W_eff using the LM head (W_LM ∈ R^{V×D}), mapping it into vocabulary logits:

Logits = W_LM w_eff + b_LM.

Compute probability distribution over the vocabulary using the softmax function:

P = softmax(Logits).

Normalized Activation Score Calculation

Determine probability P_t for the token of interest t:

P_t = P[t].

Calculate average probability P_avg over the top-k (here, k=100) tokens:

P_avg = (1/k) Σₖ=1^k P(i).

Obtain the normalized activation score (A_N):

A_N = P_t / P_avg.

Neuron Attribution: Adaptive Method

The algorithm presented above targets individual neuron activations directly, but in large-scale models the notion that a single neuron uniquely encodes a high-level concept is often invalid: neurons are commonly polysemantic, responding to many different features or tokens. To achieve consistent concept steering, one must, therefore, identify all neurons associated with a given concept – across every layer – and apply coordinated adjustments. Building on the Adaptive Activation Steering (ACT) framework, an adaptive, multi-neuron gradient-based algorithm is proposed that, at inference time, activates the entire set of concept-linked neurons.

The algorithm extends the original three-step ACT framework by introducing a lightweight probe-refinement stage. First, per-prompt style–difference vectors are extracted, second, cluster these into a compact set of steering centroids, and third, train a separate probe at each layer to assign new activations to these clusters at inference. In the second phase, probes are iteratively refined via gradient descent: autoregressive inference is performed under gradient tracking, and at each layer (a) the selected centroid as a residual adjustment is applied and (b) accumulate a cross-entropy loss comparing the probe’s prediction to the known cluster label. Finally, the loss is propagated back exclusively through the probe parameters, keeping the weights of the base model fixed.

Generating Dataset

A dataset of coding challenges was assembled to isolate the natural language selection behavior of an LLM. Each example provides a problem statement (e.g., “Implement the Gauss-Seidel method” or “Generate the nth Fibonacci sequence”) in addition to a problem description without mention of a programming language. The tasks span both general-purpose algorithms and domain-specific scientific routines, ensuring broad coverage. Each problem instance in this dataset represents a distinct scientific or coding scenario; consequently, even after randomly partitioning into training and test subsets (e.g., an 80%/20% split), every held-out example remains effectively out-of-distribution. Prompts are cast as direct questions to instruction-tuned models, and qualitative comparisons of their outputs are used to assess shifts in chosen implementation language.

Results

In this study, five instruction-tuned LLMs were evaluated: Llama-3.2-3B-Instruct, Llama-3.3-70B-Instruct, Qwen2.5-Coder-32B-Instruct, Qwen2.5-14B-Instruct-1M, and QwQ-32B. Models were evaluated on the 84 benchmark questions at a sampling temperature (T) of 1.0. To verify statistical stability, each model–prompt pair was sampled both 100 and 25 times; observed performance differences between these two sampling sizes were under 1%. Consequently, 25 repetitions per prompt were deemed sufficient and used for all reported results.

Language Preferences

The programming-language preferences for Llama-3.2-3B-Instruct across 84 benchmark problems show that Python predominates, accounting for roughly 70–80% of outputs on nearly every task. Java appears most frequently on classical algorithmic challenges, peaking at about 30%. CPP is chosen only sporadically (<10%), typically for numerically intensive or performance-critical operations. Julia applies exclusively to a small subset of domain-specific scientific problems, also at low frequency (<10%).

The summary of results for the five models illustrates that variations in model scale, architectural design, and fine-tuning data collectively impart distinct, reproducible biases in each model’s code generation behavior. In particular, smaller, distilled models are often fine-tuned on corpora focused on web development and other routine applications. Based on these findings, to bias code generation towards the CPP language for scientific computing, there is a need for targeted alignment, through prompt engineering or domain-specific fine-tuning of the model.

Static Method Analysis

The procedure outlined in Sec. 3.1 was applied across all models. The activation maps for Llama-3.2-3B-Instruct and Qwen2.5-14B-Instruct-1M show the neuron exhibiting the highest probability for the “CPP” token located in layer 27 of the Llama model and layer 31 of the Qwen model.

To evaluate the causal role of the identified “CPP” neuron, Llama-3.2-3B-Instruct was rerun on the benchmark with artificially amplified activation of the same neuron (L27 N6859). The resulting shift in language selection frequencies demonstrates an increase in CPP output.

In the baseline configuration, Java was the dominant output language for Llama-3.2-3B-Instruct. Compared to baseline, where CPP never exceeded 10%, the red bars now dominate almost all problems, often approaching or reaching 100% CPP output. Python and Java virtually disappear, and Julia is completely eliminated. In a converse experiment, by applying the same algorithm to amplify the neuron associated with the “Python” token, the model was re-evaluated on the identical 84 task benchmark. As shown in Fig. 7, this single neuron drives the model to select Python for virtually every prompt, green bars reach nearly 100%, while CPP, Java and Julia outputs drop to negligible levels. These complementary tests confirm that selective activation of individual MLP neurons exerts a strong causal control over the model’s programming-language choice, effectively overriding its native bias.

Adaptive Method Analysis

To build the framework, the initial step involves splitting (70/30%) the dataset of the coding problem prompts into disjoint training and test sets. For each prompt i in training set, two forward passes through the frozen model (here Llama-3.2-3B-Instruct) are performed, one requesting a CPP solution and one requesting a Python solution, and the final token activations h⁺_i, l and h⁻_i, l at each layer l are stored. The per-layer difference (∆_i, l) vectors were then used to characterize the “style shift” between CPP and Python.

The mean l₂ norm of the vectors (∆_ℓ₂) as a function of the layer index (L_i) shows that the style signal is negligible in the earliest layers but grows steadily, reaching its maximum in the final blocks. While the full hidden-state differences (blue) carry the largest absolute style signal in late layers, the attention-head differences (orange) already exhibit a clear and much lower dimensional rise beginning around layer 5. Therefore, head outputs were selected to probe because they strike a practical balance between signal strength, interpretability, and computational cost.

The results for Llama-3.2-3B-Instruct model show that the gradient-refined ACT probes deliver strong improvements on Llama 3B but collapse when applied to Llama 70B. For the 3B model, gradient-based refinement increases the accuracy of the early layers 0–6 from 0% to 61.5% and the macro-F1 from 0 to 0.254. Averaged across all 28 layers, the accuracy increases from 0.405 to 0.556 (+15%) and the macro-F1 from 0.165 to 0.238 (+7.3%), demonstrating that the gradient-refined ACT probes more reliably recognize the correct steering mode, especially in layers that originally carried a smaller or noisy style signal. In contrast, on the much larger 70B model, where only hidden-state differences were used and clusters reduced to three, both standard ACT and gradient ACT probes underperform: the mean accuracy is only 6.3% for vanilla and 19.1% for refined; mean macro-F1 is 0.041 vs. 0.111. In other words, the head-state activations that carry a clear style signal in 3B become too diffuse or noisy in 70B model, so clustering and probes trained on those attention-head features lose their discriminative power; instead hidden-state features can be used in such scenarios.

Furthermore, the probe logic does come with a non-trivial runtime cost. As shown in Table 1, on Llama-3B a single “no-probe” forward pass takes approx. 3.25 s, but inserting standard ACT injections inflates that to about 6.78 s (nearly double). Gradient-refined ACT falls in between (~4.55 s) as it must evaluate the lightweight classifier of each layer during generation but does not rerun the clustering offline. The 70B model shows the same pattern at scale: base generation costs about 27.8 s, standard ACT around 38.7 s, and gradient-refined ACT takes 32.6 s. In practice, each per-layer probe entails a small extra linear pass (and, when using gradients, a backward step), so total latency increases compared to unmodified inference. Gradient-ACT skips the expensive nearest-centroid search at every layer. In standard ACT, ∆_i, l of each prompt token must be compared (L2 distance) with all K centroids to select which steering vector to add, resulting in an O(KD) operation per layer per token. In contrast, gradient-ACT uses a smaller linear probe (single matrix multiplication) to predict the cluster, so it avoids repeatedly computing distances. As a result, even though gradient-ACT still evaluates one extra linear layer per token, it ends up faster than a full K-means lookup at inference time, and hence the intermediate runtime.

Conclusion

Five causal language models were evaluated on a suite of scientific coding prompts to quantify coding language bias. While smaller models preferred Java or Julia inconsistently, both variants of the LLaMA models exhibited clear separability in their internal activations when prompted to select one of four language options.

A static neuron-attribution approach, identifying individual LM-head neurons correlated with the CPP style and manually perturbing them, is shown to produce CPP code in limited cases. However, this method proved fragile, with performance highly sensitive to choice of neuron, prompt formulation, and model scale. To overcome these limitations, a gradient-refined version of ACT is introduced. Per-prompt “difference” vectors (CPP minus Python activations) are clustered into a small set of representative steering directions, and lightweight per-layer probes are trained, then further refined online via cross-entropy loss, during generation. In LLaMA-3.2 3B, the average probe classification accuracy of the probe increased, with the layers (0–6) improving from 0 % to 61.5 % accuracy. Even on the larger LLaMA-3.3 70B model, where head-state signals become more diffuse, G-ACT still increases accuracy, demonstrating that targeted injections at key layers can reliably bias generation toward CPP despite overall weak activations.

Although adding per-layer probes incurs a modest runtime overhead (approximately 1.3–1.4× slower than base generation), practical deployments can accommodate this cost by steering only a subset of layers, caching subtokens, or accepting slightly longer latencies. Even imperfect probes suffice to steer the model output toward the desired style, yielding substantial qualitative gains in CPP code generation, and can be extended to other concepts of interest. While static neuron perturbation may only be viable in narrow scenarios, G-ACT provides a scalable, interpretable, and efficient mechanism for steering LLMs toward CPP (or any target subject), with acceptable inference costs. Beyond steering concepts, this approach embeds persistent transformation matrices that guarantee identical model behavior across different users, fostering a new paradigm of output reproducibility.