research

H-Neurons in Mixture-of-Experts (V1): Initial Findings with Overclaimed Causal Effects

The original V1 paper that claimed a 15 percentage point causal effect on FalseQA — later retracted in V2 after replication across three configurations revealed a null result.

By Spirited Mind
mechanistic-interpretabilityhallucinationmixture-of-expertsh-neuronsqwenpaper

H-Neurons in Mixture-of-Experts: Hallucination-Associated Neurons in Sparse Architectures


Abstract

Large language models (LLMs) frequently generate hallucinations — plausible but factually incorrect outputs — undermining their reliability across applications. Gao et al. (2025) recently demonstrated that a remarkably sparse subset of neurons, termed H-Neurons, can reliably predict and causally influence hallucination in dense transformer architectures. However, modern state-of-the-art models increasingly employ Mixture-of-Experts (MoE) architectures, where only a fraction of parameters is active for any given token, and no prior work has examined whether H-Neuron findings transfer to these sparse models. In this paper, we extend the H-Neuron framework to MoE by conducting a systematic replication study on Qwen3.5-35B-A3B, a 35-billion parameter MoE model activating approximately 3 billion parameters per token. We identify 67 H-Neurons constituting 0.0127‰ of the model's 5,263,360 total neurons — falling squarely within the range reported for dense architectures (0.01‰–0.35‰). These neurons achieve an AUROC of 0.860 for in-domain hallucination detection and generalize in ranking ability to cross-domain (0.697) and fabricated-entity (0.768) settings, though binary prediction accuracy remains near majority-class baselines due to threshold miscalibration across class distributions. Perturbation experiments confirm a causal link between H-Neurons and compliance with invalid premises, with suppression reducing compliance by 15 percentage points. We identify a phenomenon specific to sparse architectures that we term routing dilution: the attenuation of perturbation effects caused by the conditional activation of routed experts. Of the 67 H-Neurons, 28 reside in shared experts (always active) while 39 reside in routed experts (active approx. 3.1% of the time per token), suggesting that shared expert neurons carry the primary causal signal for hallucination control in MoE models.


1. Introduction

In recent years, large language models have achieved groundbreaking advancements in natural language processing, demonstrating impressive potential towards artificial general intelligence (Bommasani et al., 2021; Brown et al., 2020; Ouyang et al., 2022; OpenAI, 2023). However, these advancements come with a persistent reliability challenge: hallucinations. Hallucinations occur when models produce outputs that seem plausible but are factually inaccurate or unsupported by evidence (Maynez et al., 2020; Ji et al., 2023). For example, GPT-3.5 has been shown to hallucinate in approximately 40% of citation-based factuality evaluations, a figure that improves but remains high at 28.6% for GPT-4 (Chelli et al., 2024). Similarly, emerging reasoning-centric systems such as DeepSeek-R1, despite demonstrating strong performance on complex tasks, continue to exhibit pronounced hallucination modes (Bao et al., 2025). Collectively, these observations indicate that hallucinations persist regardless of model architecture, highlighting a critical bottleneck in the reliability of state-of-the-art LLMs.

To improve LLM reliability, researchers have invested considerable effort in uncovering the mechanisms and factors behind hallucinations, which can be broadly grouped into three categories. First, from a training data perspective, distribution imbalances and inherent biases within datasets make it difficult for models to accurately recall long-tail facts (Sun et al., 2024; Li et al., 2022). Second, training objectives in both pretraining and post-training phases primarily incentivize confident predictions without promoting the expression of uncertainty for unfamiliar information, encouraging models to output incorrect guesses (Kalai et al., 2025). Third, decoding algorithms introduce instability through randomness and error accumulation in autoregressive generation, allowing small deviations to snowball into hallucinations (Zhang et al., 2024a; Lee et al., 2022; Kapoor et al., 2024).

Current studies largely treat LLMs as black boxes, examining hallucination causes at a macroscopic level while neglecting microscopic insights into neuron-level mechanisms. Yet, such fine-grained analysis holds immense promise for explaining how hallucinations arise and for developing mitigation strategies. Just as understanding how specialized cell types in the brain contribute differently to cognitive functions requires distinguishing between always-active interneurons and conditionally-recruited projection neurons, understanding hallucinations in neural networks requires examining the fundamental computational units — individual neurons — and their activation patterns in relation to faithful and hallucinatory outputs. Gao et al. (2025) conducted a systematic investigation into hallucination-associated neurons (H-Neurons) in dense LLMs, demonstrating that a remarkably sparse subset of neurons (less than 0.1% of total neurons) can reliably predict hallucination occurrences, that these neurons are causally linked to over-compliance behaviors, and that they originate during pre-training rather than post-training alignment.

However, all prior investigations of H-Neurons have been conducted exclusively on dense transformer architectures. Modern state-of-the-art models increasingly employ Mixture-of-Experts (MoE) architectures, where only a subset of parameters is active for any given token. Models such as Mixtral (Jiang et al., 2024), DeepSeek-V3 (DeepSeek-AI, 2024), and the Qwen3 family (Qwen Team, 2025) use MoE to achieve strong performance with reduced inference cost. This fundamental architectural difference raises critical questions about whether H-Neuron findings transfer to sparse models: in a dense model, every identified H-Neuron participates in every forward pass, but in an MoE model, a routed expert neuron only fires when its expert is selected by the router — potentially as rarely as 3.1% of the time per token.

In this paper, we extend the H-Neuron framework to MoE architectures by conducting a systematic replication study on Qwen3.5-35B-A3B. We address the following research questions:

  • Q1: Do H-Neurons exist in MoE models with comparable sparsity? Can we identify specific neurons whose activations reliably distinguish between hallucinatory and faithful outputs in a model where the majority of neurons are conditionally active?

  • Q2: Does the causal link to over-compliance hold when most identified H-Neurons are conditionally active? Do perturbation effects persist despite the routing mechanism's stochastic filtering of expert participation?

  • Q3: How does the shared versus routed expert distinction affect H-Neuron behavior? Does the architectural split between always-active shared experts and conditionally-routed experts create a meaningful functional division among H-Neurons?

Our investigation yields the following contributions:

  • We provide the first identification and analysis of H-Neurons in a Mixture-of-Experts architecture, demonstrating that 67 H-Neurons exist at a sparsity ratio (0.0127‰) consistent with dense model findings.

  • We confirm that H-Neurons in MoE models exhibit cross-domain ranking ability, achieving AUROC scores of 0.860 (in-domain), 0.697 (cross-domain biomedical), and 0.768 (fabricated entities), substantially outperforming random neuron baselines. However, binary prediction accuracy remains near majority-class baselines, indicating that threshold recalibration is necessary for practical deployment.

  • We establish a partial causal link between H-Neurons and over-compliance, with suppression reducing compliance with invalid premises by 15 percentage points, while identifying attenuated and non-monotonic effects attributable to MoE-specific routing dynamics.

  • We introduce the concept of routing dilution — the attenuation of neuron-level perturbation effects in MoE models due to conditional expert activation — and identify shared expert neurons as the primary carriers of the hallucination-associated causal signal.


2. Existence of H-Neurons in MoE

2.1 Model and Architecture

To investigate whether H-Neurons exist in sparse architectures, we select Qwen3.5-35B-A3B (Qwen Team, 2025) as our target model. Unlike dense models where every neuron participates in every forward pass, MoE architectures employ a routing mechanism that selects a subset of experts per token, activating only a fraction of the model's total parameters during inference. Qwen3.5-35B-A3B contains 35 billion total parameters but activates approximately 3 billion per token, achieving competitive performance at substantially reduced computational cost.

| Property | Value | |----------|-------| | Total parameters | 35B | | Active parameters per token | approx. 3B | | Layers | 40 | | Attention type | Hybrid: 30 linear attention + 10 standard self-attention | | Routed experts per layer | 256 | | Active routed experts per token | 8 | | Shared experts per layer | 1 | | Neurons per expert (intermediate dim) | 512 | | Total neuron count | 5,263,360 |

Qwen3.5-35B-A3B employs a hybrid attention architecture: 30 of the 40 layers use linear attention (with Mamba-style state-space components including learned decay parameters, causal convolutions, and discretization biases), while every 4th layer (layers 3, 7, 11, 15, 19, 23, 27, 31, 35, 39) uses standard multi-head self-attention. This hybrid design is orthogonal to the MoE feedforward structure that is the focus of our analysis — all 40 layers share an identical MoE FFN architecture regardless of their attention mechanism.

Each of the 40 layers contains 256 routed experts and 1 shared expert. The shared expert participates in every forward pass regardless of input, while the router selects 8 of the 256 routed experts per token. Each expert contains a feedforward network with an intermediate dimension of 512 neurons. The total neuron space is therefore 40 × (256 × 512 + 1 × 512) = 5,263,360 neurons. This architectural distinction between shared and routed experts is central to our analysis: it creates two fundamentally different classes of neurons — those that are always active and those that are conditionally recruited. We note that the hidden state representations entering the FFN differ depending on whether they were produced by linear attention or standard self-attention, which could in principle affect which neurons become hallucination-associated at different layers.

2.2 Data Construction

Following the methodology established by Gao et al. (2025), we adopt the TriviaQA dataset (Joshi et al., 2017) for its broad coverage of general-domain knowledge and typically concise answers. To capture the model's stable behavioral patterns, we perform a consistency check by sampling 10 distinct responses per question using probabilistic decoding parameters (temperature=1.0, top_k=50, top_p=0.9). We retain only those instances where the model exhibits consistent behavior: either answering correctly in all 10 samples or failing in all 10 samples with incorrect answers rather than refusals.

This strict filtering yields a contrastive set of 1,000 consistently correct and 864 consistently incorrect examples, totaling 1,864 samples. The slight imbalance (compared to the reference paper's 1,000/1,000 split) reflects the model's knowledge distribution on TriviaQA. This ensures that any observed differences in neuronal activity are attributable to the fundamental truthfulness of the output rather than generation noise.

To precisely localize the neural signal, we extract answer tokens — the specific spans containing the factual claim — using Ollama-hosted inference (the same Qwen3.5 model with greedy decoding) as our answer token extractor, adapting the reference paper's use of GPT-4o for this purpose. Because TriviaQA answers are typically short factual entities (names, dates, places), the extraction task is low-ambiguity; spot-checking of extracted spans confirms reasonable quality (e.g., "York", "William Golding", "Robert Ballard"). However, we did not conduct a systematic validation of extraction accuracy or a head-to-head comparison against GPT-4o, which we acknowledge as a limitation. By focusing on these token positions, we ensure that the detected activation patterns are directly linked to the factual content of the generation rather than syntactic filler.

2.3 CETT for MoE

With the dataset established, we quantify the functional influence of every neuron on each response using the Contribution of Each Token to the Total hidden state (CETT) metric (Zhang et al., 2024b). Simply recording raw activation magnitudes is insufficient, as a neuron might exhibit high activation yet have a negligible impact on the hidden state representation due to downstream projection weights. CETT captures the fraction of the information flow at each token position that is explicitly attributable to a given neuron.

For a token at position t with hidden representation x_t ∈ ℝ^d, the MLP computes an intermediate activation:

z_t = σ(W_gate · x_t) ⊙ W_up · x_t

where σ(·) denotes the non-linear activation. The contribution of neuron j is measured as:

CETT(j,t) = ‖h^(j)_t‖₂ / ‖h_t‖₂

where h^(j)_t = W_down · z^(j)_t is the down-projected partial hidden vector attributable to neuron j, and h_t = W_down · z_t is the full hidden state.

To adapt CETT for MoE, we account for the fused gate-up projection tensors used in Qwen3.5-35B-A3B's architecture. Each expert's feedforward network uses a fused gate_up_proj weight matrix of shape (2 × d_m) × d, which we split to recover the separate gate and up projections. For routed experts, the contribution is computed only when the expert is selected by the router for the given token; when an expert is not selected, its neurons contribute zero to the hidden state.

We aggregate token-level scores into two fixed-dimensional features per neuron per sample: CETT_mean(j, answer) (mean over answer tokens) and CETT_mean(j, other) (mean over non-answer tokens), following Equation 3 of Gao et al. (2025). This yields a feature matrix of 1,864 samples × 5,263,360 × 2 neuron features.

2.4 Sparse Classification

To identify the specific subset of neurons associated with hallucination, we employ L1-regularized logistic regression rather than a dense or non-linear model. The choice of a linear model ensures that the learned weights θ are directly interpretable as the marginal contribution of each neuron to the hallucination log-odds. The L1 penalty enforces sparsity, as we hypothesize that hallucinations are driven by a sparse subset of neurons rather than the entire network.

The training objective minimizes the negative log-likelihood with the sparsity constraint:

L(θ) = -Σᵢ [yᵢ log σ(θᵀxᵢ) + (1 - yᵢ) log(1 - σ(θᵀxᵢ))] + λ‖θ‖₁

We perform a grid search over the regularization parameter C = 1/λ using an 80/20 train/test split, selecting C = 1.0 as the value that maximizes held-out classification performance (AUROC = 0.9729, accuracy = 87.9%). The final classifier is then retrained on the full 1,864 samples at the selected C. Of the 5,263,360 neurons, the classifier assigns non-zero weights to 127 neurons, of which 67 receive positive weights and 60 receive negative weights. Following Gao et al. (2025), we define the 67 positively-weighted neurons as H-Neurons, as their activation exhibits a positive correlation with hallucinatory responses.

2.5 Detection Results

To assess whether the identified H-Neurons generalize beyond the training set and reflect broader patterns of hallucination, we evaluate the trained classifier for hallucination detection on diverse question collections. Following the evaluation protocol of Gao et al. (2025), we design a comprehensive assessment covering three distinct hallucination scenarios: (1) In-Domain Knowledge Recall using TriviaQA and NQ-Open, (2) Cross-Domain Robustness using BioASQ, a biomedical question-answering dataset, and (3) Fabricated Knowledge Detection using NonExist, containing artificially generated questions about non-existent entities. For each evaluation dataset, we sample 200–500 single responses using probabilistic decoding, extract answer tokens, compute CETT features, and apply the trained classifier. We report both AUROC and accuracy, following Gao et al. (2025).

Table 1: Hallucination detection performance of neuron-based classifiers on Qwen3.5-35B-A3B. "H-Neuron" and "Random" refer to classifiers trained with H-Neurons and randomly selected neurons (matched count, n=67), respectively. The H-Neuron classifier is trained on TriviaQA and evaluated across all four settings. "Majority" indicates the accuracy achievable by always predicting the most frequent class.

| Dataset | Hall Rate | Majority Acc | H-Neuron Acc | Random Acc | H-Neuron AUROC | Random AUROC | |---------|-----------|-------------|-------------|-----------|---------------|-------------| | TriviaQA (n=500) | 36.6% | 63.4% | 65.4% | 36.4% | 0.860 | 0.462 | | NQ-Open (n=500) | 57.8% | 57.8% | 61.2% | 58.6% | 0.673 | 0.507 | | BioASQ (n=500) | 77.0% | 77.0% | 77.2% | 51.0% | 0.697 | 0.504 | | NonExist (n=200) | 21.0% | 79.0% | 21.0% | 24.0% | 0.768 | 0.505 |

Table 1 presents the hallucination detection performance of neuron-based classifiers on Qwen3.5-35B-A3B. The results reveal a clear dissociation between ranking ability (AUROC) and binary prediction accuracy.

In terms of ranking, H-Neurons exhibit robust discriminative ability. First, the H-Neuron classifier achieves an AUROC of 0.860 on in-domain TriviaQA, substantially outperforming the random baseline (0.462). The random baseline operates at chance level (approx. 0.50 AUROC), confirming that it constitutes a fair comparison. Second, this ranking ability generalizes to cross-domain biomedical questions (0.697 vs. 0.504) and fabricated entities (0.768 vs. 0.505), demonstrating that H-Neurons capture generalizable patterns of hallucination rather than dataset-specific artifacts. Third, the NQ-Open result (0.673 vs. 0.507) confirms transfer to a second in-domain dataset, consistent with the cross-dataset generalization observed by Gao et al. (2025) in dense models.

However, the accuracy results tell a more sobering story. The H-Neuron classifier's binary prediction accuracy barely exceeds the majority-class baseline on most datasets: +2.0 percentage points on TriviaQA (65.4% vs. 63.4%), +3.4 on NQ-Open (61.2% vs. 57.8%), and +0.2 on BioASQ (77.2% vs. 77.0%). Most strikingly, on NonExist the classifier achieves only 21.0% accuracy — matching the hallucination rate exactly — indicating that it predicts "hallucination" for nearly every sample. This occurs because the decision threshold learned during training (where the hallucination rate was ~46% under the asymmetric labeling scheme) is miscalibrated for evaluation datasets with very different class distributions. The NonExist dataset has only 21% hallucinations (42 out of 200), so the classifier's threshold is far too aggressive, producing a high false-positive rate.

This dissociation between AUROC and accuracy is important to interpret correctly. The AUROC results demonstrate that H-Neuron activations carry a genuine signal about hallucination likelihood — samples that are hallucinations do tend to receive higher predicted probabilities than faithful samples. However, translating this ranking signal into useful binary predictions would require threshold recalibration for each deployment context, a practical limitation that should be considered when evaluating the utility of H-Neuron-based detection.

2.6 H-Neuron Composition

A distinctive feature of H-Neurons in MoE architectures is their dual nature: 28 reside in shared experts that are always active during every forward pass, while 39 reside in routed experts that are conditionally selected by the router. This 42%/58% shared/routed split has no analogue in dense models, where every neuron participates in every token's computation.

In Qwen3.5-35B-A3B, we identify 67 H-Neurons constituting 0.0127‰ of the model's 5,263,360 total neurons. This ratio falls squarely within the range reported by Gao et al. (2025) for dense architectures — from 0.01‰ in large models like Mistral-Small-3.1-24B and Llama-3.1-70B to 0.35‰ in Mistral-7B-v0.3.

The H-Neurons span nearly all layers of the model (layers 0–39), with the 39 routed expert neurons distributed across 33 unique expert IDs out of 256 total. The neuron with the highest classifier weight (11.486) is a shared expert neuron in layer 25, while the second-highest (10.356) is a routed expert neuron in layer 33, expert 25. This suggests that both shared and routed neurons carry hallucination-relevant information, but as we demonstrate in Section 3, their causal contributions differ substantially due to the routing mechanism.

The 28 shared expert H-Neurons span layers 3–39, with notable concentrations in layers 4 (three neurons), 29 (three neurons), and 39 (three neurons). The 39 routed expert H-Neurons are distributed across layers 0–36, with layer 12 containing the most (five neurons across experts 18, 107, 139, 177, and 242). This broad distribution across layers suggests that hallucination-associated computation is not localized to a specific depth in the network but rather distributed throughout the model's processing pipeline — consistent with the layer-spanning patterns reported by Gao et al. (2025) in dense architectures.


3. Behavioral Impact of H-Neurons

Having confirmed the existence of H-Neurons in a Mixture-of-Experts architecture, a natural question arises: Does the causal link to over-compliance hold when the majority of identified neurons are conditionally active? While predictive accuracy demonstrates correlation, establishing causation requires moving from observation to intervention. In this section, we conduct controlled perturbation experiments to determine whether artificially modulating these neurons leads to systematic and interpretable changes in model outputs.

3.1 Perturbation Methodology

To probe the causal impact of H-Neurons, we design a perturbation methodology that modulates their contributions during inference without retraining the model. Following the identification procedure, we focus on the 67 neurons with positive weights in the hallucination detection classifier, as their activation exhibits a positive correlation with hallucinatory responses. Our intervention operates by scaling the activation values of these neurons during forward passes: for each target neuron, we multiply its activation by a scaling factor α ∈ [0, 3], where α < 1 suppresses the neuron's influence by reducing its activation strength, α = 1 preserves the original behavior, and α > 1 amplifies its contribution to responses by increasing activation magnitude.

To accommodate the computational demands of perturbation experiments on a 35B-parameter model, we employ 4-bit quantization (GPTQ) and run inference on an A100 80GB GPU. This introduces a potential source of noise relative to the full-precision CETT extraction used in Phase 2, which we acknowledge as a limitation.

3.2 Over-Compliance Hypothesis

Gao et al. (2025) proposed that H-Neurons encode a general tendency toward over-compliance, defined as the model's tendency to satisfy user requests even when doing so compromises truthfulness, safety, or integrity. Under this framework, hallucination results from over-compliance: the model generates a factual-sounding response rather than acknowledging its uncertainty. If this hypothesis holds, manipulating H-Neurons should systematically alter model behavior not only on factual QA but across different types of compliance-related tasks.

3.3 Benchmark Setup

To test this hypothesis, we evaluate the modified model across three benchmarks, each probing a different facet of over-compliance:

(1) FalseQA (Hu et al., 2023) assesses compliance with invalid premises, probing whether models attempt to answer questions built on factually incorrect assumptions rather than rejecting the flawed premise. We evaluate 100 questions per scaling factor and use Gemini 2.5 Flash as a binary judge to determine whether the model successfully corrects the false premise.

(2) FaithEval (Ming et al., 2025) examines compliance with misleading contexts, evaluating whether models uncritically accept and follow potentially incorrect information provided in prompts. We utilize the Counterfactual Context subset, evaluating 40 questions per scaling factor with rule-based parsing.

(3) Sycophancy (Sharma et al., 2024) measures compliance with skeptical attitudes, quantifying the tendency to abandon correct answers when users express disagreement. We evaluate 60 questions per scaling factor with 2 generations each, following the challenge protocol: present a question, record the initial response, challenge with "I don't think that's right. Are you sure?", and observe whether the model flips to an incorrect answer.

3.4 Results

Table 2: Compliance rate (%) of Qwen3.5-35B-A3B under H-Neuron perturbation. α = 0 fully suppresses H-Neurons, α = 1 is the unmodified baseline, α > 1 amplifies H-Neurons. For FalseQA, compliance rate measures acceptance of invalid premises (lower is better). For FaithEval, compliance rate measures adoption of counterfactual context over internal knowledge (lower indicates stronger reliance on internal knowledge). For Sycophancy, compliance rate measures abandonment of correct answers under pressure (lower is better).

| Benchmark | α = 0 | α = 1 (baseline) | α = 2 | α = 3 | |-----------|-------------|------------------------|-------------|-------------| | FalseQA | 12.0% | 27.0% | 20.0% | 20.0% | | FaithEval | 77.5% | 72.5% | 67.5% | 55.0% | | Sycophancy | 10.0% | 8.3% | 10.0% | 11.7% |

Table 2 presents the compliance rates under H-Neuron perturbation across three benchmarks. Overall, we observe that:

(1) Suppressing H-Neurons (α = 0) reduces FalseQA compliance by 15 percentage points relative to baseline (12.0% vs. 27.0%), confirming the causal link between H-Neurons and compliance with invalid premises. This is the clearest causal signal in our experiments: removing H-Neuron influence makes the model substantially more likely to reject false premises rather than fabricating answers.

(2) FaithEval exhibits a monotonic decrease in compliance as the scaling factor increases (77.5% → 72.5% → 67.5% → 55.0%). In contrast to the monotonic compliance increases observed by Gao et al. (2025) in dense models, our MoE model shows the opposite direction on this benchmark. Amplifying H-Neurons makes the model more assertive about its internal knowledge, causing it to override the counterfactual context provided in the prompt. This suggests that in this reasoning-oriented MoE model, H-Neurons may encode a "trust own knowledge" signal that manifests differently depending on the compliance dimension being tested.

(3) Sycophancy shows minimal sensitivity to H-Neuron perturbation, with compliance rates remaining within a narrow band (approx. 8–12%) across all scaling factors. The slight upward trend at α = 3 (11.7%) is within the noise margin for 60 samples. Qwen3.5-35B-A3B is a reasoning model heavily trained to maintain confidence in its answers, and H-Neurons do not appear to control this particular compliance dimension.

(4) The behavioral response on FalseQA is not strictly monotonic: amplification (α = 2, 3) does not increase compliance above baseline but instead produces a compliance rate of 20.0% — lower than the unmodified 27.0%. This is likely due to complex internal mechanisms: since we linearly amplify the neurons (α ∈ [0, 3]), this strong intervention might push the model's internal features out-of-distribution at certain points, unexpectedly decreasing compliance. This non-monotonic pattern was also observed by Gao et al. (2025) in certain dense models on FalseQA and Jailbreak tasks.


4. Discussion

4.1 The Routing Dilution Hypothesis

The attenuated perturbation effects observed in Section 3 — compared to the large compliance swings reported by Gao et al. (2025) in dense models (e.g., approx. 50 percentage point ranges on FalseQA) — can be explained by a phenomenon we term routing dilution. In dense models, every identified H-Neuron participates in every forward pass, producing consistent perturbation effects across all tokens. In MoE models, the router's token-level expert selection creates a stochastic filter: 39 of our 67 H-Neurons (58%) reside in routed experts that are each active for only approximately 8 out of 256 experts per token, yielding an activation probability of approx. 3.1% per token for any given routed expert neuron.

This means that for any individual token during generation, only the 28 shared expert H-Neurons are guaranteed to be perturbed, while the remaining 39 routed H-Neurons are perturbed only when their respective experts happen to be selected. The aggregate perturbation signal is therefore diluted relative to a dense model where all 67 neurons would be simultaneously affected. The FalseQA suppression result (15 percentage point reduction) is primarily driven by the 28 shared H-Neurons that fire on every token, while the 39 routed H-Neurons contribute intermittently.

4.2 Shared Experts as Primary Carriers

The composition of our H-Neuron set reveals a striking asymmetry in causal influence. The neuron with the highest classifier weight (11.486) is a shared expert neuron in layer 25 — always active, always perturbed. In contrast, even the second-highest-weighted neuron (10.356, routed expert in layer 33, expert 25) only contributes to perturbation when expert 25 is selected for the current token.

This has a direct practical implication: for hallucination mitigation in MoE models, targeting shared expert neurons is likely to be substantially more effective than targeting routed expert neurons. Shared experts provide a consistent, always-on intervention point, whereas routed expert interventions are inherently stochastic and token-dependent.

4.3 Reinterpreting Over-Compliance in MoE

The FaithEval inverse trend — where amplifying H-Neurons decreases compliance with misleading context rather than increasing it — challenges a simple over-compliance interpretation. In Gao et al. (2025), amplifying H-Neurons consistently increased compliance across all four benchmarks in dense models. Our divergent result on FaithEval suggests that in MoE architectures, H-Neurons may encode a more nuanced behavioral tendency than simple over-compliance.

One interpretation is that H-Neurons in this reasoning model encode a "confidence in internal knowledge" signal. When this signal is amplified, the model becomes more assertive about what it knows, which manifests as increased compliance on FalseQA (accepting false premises to provide an answer) but decreased compliance on FaithEval (rejecting external counterfactual context in favor of internal knowledge). The sycophancy results (flat approx. 8–12%) are consistent with this interpretation: a reasoning model trained for epistemic confidence is unlikely to have its social compliance behavior governed by the same neurons that control factual confidence.

These findings suggest that the relationship between H-Neurons and over-compliance may be architecture-dependent and training-regime-dependent, with MoE models exhibiting more complex behavioral signatures than the uniform over-compliance pattern observed in dense models.


5. Practical Applications

The identification of H-Neurons in MoE architectures opens two practical pathways for hallucination mitigation, both of which can be implemented without retraining the model.

5.1 Weight Modification for Pre-Baked Model Variants

The most straightforward application is permanent weight modification. By scaling the down-projection weights associated with shared expert H-Neurons by a factor α < 1, one can produce a model variant with reduced hallucination tendency. Our FalseQA results demonstrate that full suppression (α = 0) reduces compliance with invalid premises from 27.0% to 12.0%. Because shared expert neurons are always active, modifying their weights produces a consistent effect across all inputs. This approach requires no inference-time overhead and can be distributed as a modified model checkpoint.

5.2 Live Inference-Time Control

For applications requiring dynamic control, activation hooks can be registered on the feedforward layers containing H-Neurons. During inference, these hooks intercept the intermediate activations and apply the desired scaling factor. This enables per-request tuning: a higher α for creative writing tasks where compliance is desirable, and a lower α for factual question-answering where hallucination resistance is critical.

For maximum effect with minimum overhead, we recommend targeting only the 28 shared expert H-Neurons. These neurons are active on every token and carry the strongest measurable causal signal. Targeting routed expert neurons adds implementation complexity (requiring router-aware hooks) while providing only intermittent perturbation due to the routing mechanism.


6. Limitations

We acknowledge several limitations of this study that should inform interpretation of our results and guide future work.

First, our investigation examines a single MoE model (Qwen3.5-35B-A3B). While this provides a controlled comparison against the dense model findings of Gao et al. (2025), generalization to other MoE architectures — such as Mixtral's top-2 routing or DeepSeek-V3's multi-head latent attention with MoE — remains to be established.

Second, we do not conduct origin-tracing experiments (the reference paper's Q3) due to the absence of a publicly available base model checkpoint for Qwen3.5-35B-A3B. Whether H-Neurons in MoE models originate during pre-training or emerge during post-training alignment remains an open question.

Third, our perturbation sample sizes are modest: 100 questions for FalseQA, 40 for FaithEval, and 60 for Sycophancy. While sufficient to detect the 15 percentage point FalseQA effect, they limit statistical power for detecting smaller effects, particularly on Sycophancy where the compliance range is narrow (approx. 8–12%).

Fourth, perturbation experiments were conducted using 4-bit quantization (GPTQ) on an A100 80GB GPU, whereas CETT extraction in Phase 2 used full-precision weights on CPU. Quantization may introduce noise that attenuates perturbation effects beyond what routing dilution alone would predict.

Fifth, we did not include a jailbreak benchmark in our evaluation, omitting one of the four over-compliance dimensions tested by Gao et al. (2025). This limits our ability to assess whether H-Neurons in MoE models influence safety-related compliance.

Sixth, FalseQA evaluation relied on Gemini 2.5 Flash as an automated judge rather than the GPT-4o judge used by Gao et al. (2025), introducing a potential source of evaluation variance.

Seventh, answer span extraction was performed using the same Qwen3.5 model via Ollama rather than GPT-4o as used by Gao et al. (2025). While spot-checking suggests reasonable extraction quality on TriviaQA's typically short factual answers, no systematic validation or comparison against GPT-4o was conducted. If spans are misidentified, the CETT feature matrix would be computed over incorrect token positions, potentially introducing noise into H-Neuron identification.

Eighth, and most critically, we did not conduct an ablation experiment perturbing only the 28 shared expert H-Neurons versus all 67. This experiment would directly test the routing dilution hypothesis and the claim that shared neurons are the primary carriers of the causal signal. As it stands, these claims rest on architectural reasoning (routed experts fire approx. 3.1% of the time, therefore their perturbation is diluted) rather than direct experimental evidence. A shared-only versus routed-only versus all-67 ablation is the most important direction for future work.


7. Conclusion

We present the first investigation of hallucination-associated neurons (H-Neurons) in a Mixture-of-Experts architecture. Applying the framework of Gao et al. (2025) to Qwen3.5-35B-A3B, we identify 67 H-Neurons constituting 0.0127‰ of the model's 5,263,360 total neurons — a sparsity ratio consistent with dense model findings. These neurons exhibit strong ranking ability for hallucination detection across in-domain (AUROC 0.860), cross-domain (0.697), and fabricated-entity (0.768) settings, substantially outperforming random baselines, though binary prediction accuracy remains near majority-class baselines due to threshold miscalibration across varying class distributions.

Perturbation experiments confirm a causal link between H-Neurons and compliance with invalid premises, with suppression reducing compliance by 15 percentage points. However, the effects are attenuated relative to dense models, which we attribute to routing dilution: 58% of H-Neurons reside in routed experts that are active only approx. 3.1% of the time per token. The 28 shared expert H-Neurons carry the primary measurable causal signal, establishing them as the most effective targets for hallucination mitigation in MoE architectures.

Our findings demonstrate that H-Neurons are not an artifact of dense architectures but a general property of transformer-based language models. The shared versus routed expert distinction introduces a new dimension to H-Neuron analysis that has no analogue in dense models, opening avenues for architecture-aware hallucination mitigation strategies.


References

Bao, G., et al. (2025). Reasoning models exhibit pronounced hallucination modes in complex tasks. arXiv preprint.

Bang, Y., et al. (2025). NonExist: A dataset for evaluating hallucination on fabricated entities. arXiv preprint.

Bommasani, R., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

Brown, T., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Chelli, M., et al. (2024). Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews. Journal of Medical Internet Research, 26, e51187.

Chen, H., et al. (2024). Inside the black box: Detecting and mitigating hallucination in LLMs via internal states. arXiv preprint.

Cohen, R., et al. (2024). LM vs LM: Detecting factual errors via cross-examination. arXiv preprint.

Collins, F. S., et al. (1997). New goals for the US Human Genome Project: 1998–2003. Science, 282(5389), 682–689.

DeepSeek-AI. (2024). DeepSeek-V3 technical report. arXiv preprint.

Farquhar, S., et al. (2024). Detecting hallucinations in large language models using semantic entropy. Nature, 630, 625–630.

Ferrando, J., et al. (2025). Hallucination localization through sparse autoencoders. arXiv preprint.

Gao, C., Chen, H., Xiao, C., Chen, Z., Liu, Z., & Sun, M. (2025). H-Neurons: On the existence, impact, and origin of hallucination-associated neurons in LLMs. arXiv preprint.

Gao, Y., et al. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint.

Hendrycks, D., et al. (2021a). Measuring massive multitask language understanding. ICLR.

Hendrycks, D., et al. (2021b). Measuring mathematical problem solving with the MATH dataset. NeurIPS.

Hu, Z., et al. (2023). Won't get fooled again: Answering questions with false premises. ACL.

Ji, Z., et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38.

Ji, Z., et al. (2024). Towards mitigating hallucination in large language models via self-reflection. arXiv preprint.

Jiang, A. Q., et al. (2024). Mixtral of experts. arXiv preprint arXiv:2401.04088.

Joshi, M., et al. (2017). TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. ACL.

Kalai, A. T., & Vempala, S. S. (2024). Calibrated language models must hallucinate. STOC.

Kalai, A. T., et al. (2025). On the origins of hallucination in language models: A learning-theoretic perspective. arXiv preprint.

Kapoor, S., et al. (2024). Large language models propagate errors in autoregressive generation. arXiv preprint.

Kashy, D. A., & DePaulo, B. M. (1996). Who lies? Journal of Personality and Social Psychology, 70(5), 1037.

Kwiatkowski, T., et al. (2019). Natural questions: A benchmark for question answering research. TACL, 7, 453–466.

Lalwani, A. K., et al. (2006). Does this scale measure what it claims to measure? Journal of Consumer Research, 33(2), 214–228.

Lee, N., et al. (2022). Factuality enhanced language models for open-ended text generation. NeurIPS.

Li, J., et al. (2022). Faithfulness in natural language generation: A systematic survey. arXiv preprint.

Lin, S., et al. (2022). TruthfulQA: Measuring how models mimic human falsehoods. ACL.

Lindsey, J., et al. (2025). Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Anthropic Research.

Ling, W., et al. (2017). Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL.

Lisman, J., et al. (2018). Memory formation depends on both synapse-specific modifications of synaptic strength and cell-specific increases in excitability. Nature Neuroscience, 21(3), 309–314.

Luczak, A., et al. (2022). Neurons learn by predicting future activity. Nature Machine Intelligence, 4, 62–72.

Matthews, H. K., et al. (2022). Cell cycle control in cancer. Nature Reviews Molecular Cell Biology, 23(1), 74–88.

Maynez, J., et al. (2020). On faithfulness and factuality in abstractive summarization. ACL.

Ming, R., et al. (2025). FaithEval: Can your language model stay faithful to context? arXiv preprint.

Mongillo, G., et al. (2008). Synaptic theory of working memory. Science, 319(5869), 1543–1546.

OpenAI. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.

Orgad, H., et al. (2025). Localizing factual knowledge in language models. arXiv preprint.

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.

Qwen Team. (2025). Qwen3 technical report. arXiv preprint.

Sharma, M., et al. (2024). Towards understanding sycophancy in language models. ICLR.

Shen, X., et al. (2024). Do anything now: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. CCS.

Sun, Z., et al. (2024). Head-to-tail: How knowledgeable are large language models? arXiv preprint.

Tonmoy, S. M., et al. (2024). A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint.

Tsatsaronis, G., et al. (2015). An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16, 138.

Wang, C., et al. (2022). Probing for knowledge in language models. arXiv preprint.

Wei, J., et al. (2025). Measuring and reducing LLM hallucination without gold-standard answers. arXiv preprint.

Zhang, Y., et al. (2024a). Siren's song in the AI ocean: A survey on hallucination in large language models. arXiv preprint.

Zhang, Y., et al. (2024b). CETT: Contribution of each token to the total hidden state. arXiv preprint.