research

H-Neurons in Mixture-of-Experts: Hallucination-Associated Neurons in Sparse Architectures

The corrected V2 paper reporting a null causal result for H-Neuron perturbation in MoE architectures, with replicated findings across three independent experimental configurations.

By Spirited Mind
mechanistic-interpretabilityhallucinationmixture-of-expertsh-neuronsqwenpaper

H-Neurons in Mixture-of-Experts: Hallucination-Associated Neurons in Sparse Architectures


Abstract

Large language models (LLMs) frequently generate hallucinations — plausible but factually incorrect outputs — undermining their reliability across applications. Gao et al. (2025) recently demonstrated that a remarkably sparse subset of neurons, termed H-Neurons, can reliably predict and causally influence hallucination in dense transformer architectures. However, modern state-of-the-art models increasingly employ Mixture-of-Experts (MoE) architectures, where only a fraction of parameters is active for any given token, and no prior work has examined whether H-Neuron findings transfer to these sparse models. In this paper, we extend the H-Neuron framework to MoE by conducting a systematic replication study on Qwen3.5-35B-A3B, a 35-billion parameter hybrid Mamba/transformer MoE model activating approximately 3 billion parameters per token. We identify 67 H-Neurons constituting 0.0127‰ of the model's 5,263,360 total neurons — falling squarely within the range reported for dense architectures (0.01‰–0.35‰). These neurons achieve an AUROC of 0.860 for in-domain hallucination detection and generalize in ranking ability to cross-domain (0.697) and fabricated-entity (0.768) settings, though binary prediction accuracy remains near majority-class baselines due to threshold miscalibration across class distributions. However, perturbation experiments reveal a critical divergence from dense model findings: scaling H-Neuron activations across α ∈ [0, 3] produces no systematic change in compliance behavior across three benchmarks (FalseQA, FaithEval, Sycophancy), with compliance rates varying by less than 7 percentage points — well within sampling noise for our evaluation sizes. We replicate this null result across three independent experimental configurations (A100 GPU with GPTQ quantization, A100 with post-hoc hooks, and Apple Silicon with MLX 4-bit quantization), ruling out implementation artifacts. We attribute this absence of causal effect to routing dilution: 39 of the 67 H-Neurons (58%) reside in routed experts that are conditionally active for only approx. 3.1% of tokens, and even the 28 shared expert H-Neurons (always active) produce insufficient perturbation to shift model behavior. Our findings demonstrate that H-Neurons generalize to MoE architectures as a detection signal but that the causal intervention framework established for dense models does not transfer — suggesting that MoE routing distributes or buffers single-neuron perturbation effects.


1. Introduction

In recent years, large language models have achieved groundbreaking advancements in natural language processing, demonstrating impressive potential towards artificial general intelligence (Bommasani et al., 2021; Brown et al., 2020; Ouyang et al., 2022; OpenAI, 2023). However, these advancements come with a persistent reliability challenge: hallucinations. Hallucinations occur when models produce outputs that seem plausible but are factually inaccurate or unsupported by evidence (Maynez et al., 2020; Ji et al., 2023). For example, GPT-3.5 has been shown to hallucinate in approximately 40% of citation-based factuality evaluations, a figure that improves but remains high at 28.6% for GPT-4 (Chelli et al., 2024). Similarly, emerging reasoning-centric systems such as DeepSeek-R1, despite demonstrating strong performance on complex tasks, continue to exhibit pronounced hallucination modes (Bao et al., 2025). Collectively, these observations indicate that hallucinations persist regardless of model architecture, highlighting a critical bottleneck in the reliability of state-of-the-art LLMs.

To improve LLM reliability, researchers have invested considerable effort in uncovering the mechanisms and factors behind hallucinations, which can be broadly grouped into three categories. First, from a training data perspective, distribution imbalances and inherent biases within datasets make it difficult for models to accurately recall long-tail facts (Sun et al., 2024; Li et al., 2022). Second, training objectives in both pretraining and post-training phases primarily incentivize confident predictions without promoting the expression of uncertainty for unfamiliar information, encouraging models to output incorrect guesses (Kalai et al., 2025). Third, decoding algorithms introduce instability through randomness and error accumulation in autoregressive generation, allowing small deviations to snowball into hallucinations (Zhang et al., 2024a; Lee et al., 2022; Kapoor et al., 2024).

Current studies largely treat LLMs as black boxes, examining hallucination causes at a macroscopic level while neglecting microscopic insights into neuron-level mechanisms. Yet, such fine-grained analysis holds immense promise for explaining how hallucinations arise and for developing mitigation strategies. Just as understanding how specialized cell types in the brain contribute differently to cognitive functions requires distinguishing between always-active interneurons and conditionally-recruited projection neurons, understanding hallucinations in neural networks requires examining the fundamental computational units — individual neurons — and their activation patterns in relation to faithful and hallucinatory outputs. Gao et al. (2025) conducted a systematic investigation into hallucination-associated neurons (H-Neurons) in dense LLMs, demonstrating that a remarkably sparse subset of neurons (less than 0.1% of total neurons) can reliably predict hallucination occurrences, that these neurons are causally linked to over-compliance behaviors, and that they originate during pre-training rather than post-training alignment.

However, all prior investigations of H-Neurons have been conducted exclusively on dense transformer architectures. Modern state-of-the-art models increasingly employ Mixture-of-Experts (MoE) architectures, where only a subset of parameters is active for any given token. Models such as Mixtral (Jiang et al., 2024), DeepSeek-V3 (DeepSeek-AI, 2024), and the Qwen3 family (Qwen Team, 2025) use MoE to achieve strong performance with reduced inference cost. This fundamental architectural difference raises critical questions about whether H-Neuron findings transfer to sparse models: in a dense model, every identified H-Neuron participates in every forward pass, but in an MoE model, a routed expert neuron only fires when its expert is selected by the router — potentially as rarely as 3.1% of the time per token.

In this paper, we extend the H-Neuron framework to MoE architectures by conducting a systematic replication study on Qwen3.5-35B-A3B. We address the following research questions:

  • Q1: Do H-Neurons exist in MoE models with comparable sparsity? Can we identify specific neurons whose activations reliably distinguish between hallucinatory and faithful outputs in a model where the majority of neurons are conditionally active?

  • Q2: Does the causal link to over-compliance hold when most identified H-Neurons are conditionally active? Do perturbation effects persist despite the routing mechanism's stochastic filtering of expert participation?

  • Q3: How does the shared versus routed expert distinction affect H-Neuron behavior? Does the architectural split between always-active shared experts and conditionally-routed experts create a meaningful functional division among H-Neurons?

Our investigation yields the following contributions:

  • We provide the first identification and analysis of H-Neurons in a Mixture-of-Experts architecture, demonstrating that 67 H-Neurons exist at a sparsity ratio (0.0127‰) consistent with dense model findings. We validate the underlying answer span extraction through a three-part audit (text matching, Claude-based semantic judgment, and heuristic comparison), confirming 97% span quality.

  • We confirm that H-Neurons in MoE models exhibit cross-domain ranking ability, achieving AUROC scores of 0.860 (in-domain), 0.697 (cross-domain biomedical), and 0.768 (fabricated entities), substantially outperforming random neuron baselines. However, binary prediction accuracy remains near majority-class baselines due to threshold miscalibration across class distributions, a practical limitation we analyze in detail.

  • We report a null causal result: perturbation experiments across three independent configurations (A100 GPU with GPTQ, A100 with post-hoc hooks, and Apple Silicon with MLX) show no systematic relationship between H-Neuron scaling (α ∈ [0, 3]) and over-compliance behavior on any of three benchmarks. This contrasts sharply with the 50+ percentage point compliance swings reported by Gao et al. (2025) in dense models.

  • We introduce the concept of routing dilution — the attenuation of neuron-level perturbation effects in MoE models due to conditional expert activation — as the primary explanation for this null result. The MoE routing mechanism appears to distribute or buffer single-neuron interventions, preventing the concentrated perturbation effects observed in dense architectures.


2. Existence of H-Neurons in MoE

2.1 Model and Architecture

To investigate whether H-Neurons exist in sparse architectures, we select Qwen3.5-35B-A3B (Qwen Team, 2025) as our target model. Unlike dense models where every neuron participates in every forward pass, MoE architectures employ a routing mechanism that selects a subset of experts per token, activating only a fraction of the model's total parameters during inference. Qwen3.5-35B-A3B contains 35 billion total parameters but activates approximately 3 billion per token, achieving competitive performance at substantially reduced computational cost.

| Property | Value | |----------|-------| | Total parameters | 35B | | Active parameters per token | approx. 3B | | Layers | 40 | | Attention type | Hybrid: 30 linear attention + 10 standard self-attention | | Routed experts per layer | 256 | | Active routed experts per token | 8 | | Shared experts per layer | 1 | | Neurons per expert (intermediate dim) | 512 | | Total neuron count | 5,263,360 |

Qwen3.5-35B-A3B employs a hybrid attention architecture: 30 of the 40 layers use linear attention (with Mamba-style state-space components including learned decay parameters, causal convolutions, and discretization biases), while every 4th layer (layers 3, 7, 11, 15, 19, 23, 27, 31, 35, 39) uses standard multi-head self-attention. This hybrid design is orthogonal to the MoE feedforward structure that is the focus of our analysis — all 40 layers share an identical MoE FFN architecture regardless of their attention mechanism.

Each of the 40 layers contains 256 routed experts and 1 shared expert. The shared expert participates in every forward pass regardless of input, while the router selects 8 of the 256 routed experts per token. Each expert contains a feedforward network with an intermediate dimension of 512 neurons. The total neuron space is therefore 40 × (256 × 512 + 1 × 512) = 5,263,360 neurons. This architectural distinction between shared and routed experts is central to our analysis: it creates two fundamentally different classes of neurons — those that are always active and those that are conditionally recruited. We note that the hidden state representations entering the FFN differ depending on whether they were produced by linear attention or standard self-attention, which could in principle affect which neurons become hallucination-associated at different layers.

2.2 Data Construction

Following the methodology established by Gao et al. (2025), we adopt the TriviaQA dataset (Joshi et al., 2017) for its broad coverage of general-domain knowledge and typically concise answers. To capture the model's stable behavioral patterns, we perform a consistency check by sampling 10 distinct responses per question using probabilistic decoding parameters (temperature=1.0, top_k=50, top_p=0.9). We retain only those instances where the model exhibits consistent behavior: either answering correctly in all 10 samples or failing in all 10 samples with incorrect answers rather than refusals.

This strict filtering yields a contrastive set of 1,000 consistently correct and 864 consistently incorrect examples, totaling 1,864 samples. The slight imbalance (compared to the reference paper's 1,000/1,000 split) reflects the model's knowledge distribution on TriviaQA. This ensures that any observed differences in neuronal activity are attributable to the fundamental truthfulness of the output rather than generation noise.

To precisely localize the neural signal, we extract answer tokens — the specific spans containing the factual claim — using Ollama-hosted inference (the same Qwen3.5 model with greedy decoding) as our answer token extractor, adapting the reference paper's use of GPT-4o for this purpose. Because TriviaQA answers are typically short factual entities (names, dates, places), the extraction task is low-ambiguity. To validate extraction quality, we conduct a three-part audit: (1) a text-matching check confirms that 96.9% of extracted spans (1,807/1,864) appear verbatim in the response text, and that 98.3% of correct samples (983/1,000) match a TriviaQA answer alias; (2) a Claude-based semantic judgment on a representative subset finds 97% of spans are valid minimal entities and 100% capture the core factual claim; (3) a heuristic comparison shows 94% agreement between the Ollama extractor and a rule-based baseline, with all disagreements favoring the Ollama extraction. By focusing on these token positions, we ensure that the detected activation patterns are directly linked to the factual content of the generation rather than syntactic filler.

2.3 CETT for MoE

With the dataset established, we quantify the functional influence of every neuron on each response using the Contribution of Each Token to the Total hidden state (CETT) metric (Zhang et al., 2024b). Simply recording raw activation magnitudes is insufficient, as a neuron might exhibit high activation yet have a negligible impact on the hidden state representation due to downstream projection weights. CETT captures the fraction of the information flow at each token position that is explicitly attributable to a given neuron.

For a token at position t with hidden representation x_t ∈ ℝ^d, the MLP computes an intermediate activation:

z_t = σ(W_gate · x_t) ⊙ W_up · x_t

where σ(·) denotes the non-linear activation. The contribution of neuron j is measured as:

CETT(j,t) = ‖h^(j)_t‖₂ / ‖h_t‖₂

where h^(j)_t = W_down · z^(j)_t is the down-projected partial hidden vector attributable to neuron j, and h_t = W_down · z_t is the full hidden state.

To adapt CETT for MoE, we account for the fused gate-up projection tensors used in Qwen3.5-35B-A3B's architecture. Each expert's feedforward network uses a fused gate_up_proj weight matrix of shape (2 × d_m) × d, which we split to recover the separate gate and up projections. For routed experts, the contribution is computed only when the expert is selected by the router for the given token; when an expert is not selected, its neurons contribute zero to the hidden state.

We aggregate token-level scores into two fixed-dimensional features per neuron per sample: CETT_mean(j, answer) (mean over answer tokens) and CETT_mean(j, other) (mean over non-answer tokens), following Equation 3 of Gao et al. (2025). This yields a feature matrix of 1,864 samples × 5,263,360 × 2 neuron features.

2.4 Sparse Classification

To identify the specific subset of neurons associated with hallucination, we employ L1-regularized logistic regression rather than a dense or non-linear model. The choice of a linear model ensures that the learned weights θ are directly interpretable as the marginal contribution of each neuron to the hallucination log-odds. The L1 penalty enforces sparsity, as we hypothesize that hallucinations are driven by a sparse subset of neurons rather than the entire network.

The training objective minimizes the negative log-likelihood with the sparsity constraint:

L(θ) = -Σᵢ [yᵢ log σ(θᵀxᵢ) + (1 - yᵢ) log(1 - σ(θᵀxᵢ))] + λ‖θ‖₁

We perform a grid search over the regularization parameter C = 1/λ using an 80/20 train/test split, selecting C = 1.0 as the value that maximizes held-out classification performance (AUROC = 0.9729, accuracy = 87.9%). The final classifier is then retrained on the full 1,864 samples at the selected C. Of the 5,263,360 neurons, the classifier assigns non-zero weights to 127 neurons, of which 67 receive positive weights and 60 receive negative weights. Following Gao et al. (2025), we define the 67 positively-weighted neurons as H-Neurons, as their activation exhibits a positive correlation with hallucinatory responses.

2.5 Detection Results

To assess whether the identified H-Neurons generalize beyond the training set and reflect broader patterns of hallucination, we evaluate the trained classifier for hallucination detection on diverse question collections. Following the evaluation protocol of Gao et al. (2025), we design a comprehensive assessment covering three distinct hallucination scenarios: (1) In-Domain Knowledge Recall using TriviaQA and NQ-Open, (2) Cross-Domain Robustness using BioASQ, a biomedical question-answering dataset, and (3) Fabricated Knowledge Detection using NonExist, containing artificially generated questions about non-existent entities. For each evaluation dataset, we sample 200–500 single responses using probabilistic decoding, extract answer tokens, compute CETT features, and apply the trained classifier. We report both AUROC and accuracy, following Gao et al. (2025).

Table 1: Hallucination detection performance of neuron-based classifiers on Qwen3.5-35B-A3B. "H-Neuron" and "Random" refer to classifiers trained with H-Neurons and randomly selected neurons (matched count, n=67), respectively. The H-Neuron classifier is trained on TriviaQA and evaluated across all four settings. "Majority" indicates the accuracy achievable by always predicting the most frequent class.

| Dataset | Hall Rate | Majority Acc | H-Neuron Acc | Random Acc | H-Neuron AUROC | Random AUROC | |---------|-----------|-------------|-------------|-----------|---------------|-------------| | TriviaQA (n=500) | 36.6% | 63.4% | 65.4% | 36.4% | 0.860 | 0.462 | | NQ-Open (n=500) | 57.8% | 57.8% | 61.2% | 58.6% | 0.673 | 0.507 | | BioASQ (n=500) | 77.0% | 77.0% | 77.2% | 51.0% | 0.697 | 0.504 | | NonExist (n=200) | 21.0% | 79.0% | 21.0% | 24.0% | 0.768 | 0.505 |

Table 1 presents the hallucination detection performance of neuron-based classifiers on Qwen3.5-35B-A3B. The results reveal a clear dissociation between ranking ability (AUROC) and binary prediction accuracy.

In terms of ranking, H-Neurons exhibit robust discriminative ability. First, the H-Neuron classifier achieves an AUROC of 0.860 on in-domain TriviaQA, substantially outperforming the random baseline (0.462). The random baseline operates at chance level (approx. 0.50 AUROC), confirming that it constitutes a fair comparison. Second, this ranking ability generalizes to cross-domain biomedical questions (0.697 vs. 0.504) and fabricated entities (0.768 vs. 0.505), demonstrating that H-Neurons capture generalizable patterns of hallucination rather than dataset-specific artifacts. Third, the NQ-Open result (0.673 vs. 0.507) confirms transfer to a second in-domain dataset, consistent with the cross-dataset generalization observed by Gao et al. (2025) in dense models.

However, the accuracy results tell a more sobering story. The H-Neuron classifier's binary prediction accuracy barely exceeds the majority-class baseline on most datasets: +2.0 percentage points on TriviaQA (65.4% vs. 63.4%), +3.4 on NQ-Open (61.2% vs. 57.8%), and +0.2 on BioASQ (77.2% vs. 77.0%). Most strikingly, on NonExist the classifier achieves only 21.0% accuracy — matching the hallucination rate exactly — indicating that it predicts "hallucination" for nearly every sample. This occurs because the decision threshold learned during training (where the hallucination rate was ~46% under the asymmetric labeling scheme) is miscalibrated for evaluation datasets with very different class distributions. The NonExist dataset has only 21% hallucinations (42 out of 200), so the classifier's threshold is far too aggressive, producing a high false-positive rate.

This dissociation between AUROC and accuracy is important to interpret correctly. The AUROC results demonstrate that H-Neuron activations carry a genuine signal about hallucination likelihood — samples that are hallucinations do tend to receive higher predicted probabilities than faithful samples. However, translating this ranking signal into useful binary predictions would require threshold recalibration for each deployment context, a practical limitation that should be considered when evaluating the utility of H-Neuron-based detection.

2.6 H-Neuron Composition

A distinctive feature of H-Neurons in MoE architectures is their dual nature: 28 reside in shared experts that are always active during every forward pass, while 39 reside in routed experts that are conditionally selected by the router. This 42%/58% shared/routed split has no analogue in dense models, where every neuron participates in every token's computation.

In Qwen3.5-35B-A3B, we identify 67 H-Neurons constituting 0.0127‰ of the model's 5,263,360 total neurons. This ratio falls squarely within the range reported by Gao et al. (2025) for dense architectures — from 0.01‰ in large models like Mistral-Small-3.1-24B and Llama-3.1-70B to 0.35‰ in Mistral-7B-v0.3.

The H-Neurons span nearly all layers of the model (layers 0–39), with the 39 routed expert neurons distributed across 33 unique expert IDs out of 256 total. The neuron with the highest classifier weight (11.486) is a shared expert neuron in layer 25, while the second-highest (10.356) is a routed expert neuron in layer 33, expert 25. This suggests that both shared and routed neurons carry hallucination-relevant information, though as we demonstrate in Section 3, neither class produces measurable causal effects on model behavior when perturbed — a finding we attribute to routing dilution.

The 28 shared expert H-Neurons span layers 3–39, with notable concentrations in layers 4 (three neurons), 29 (three neurons), and 39 (three neurons). The 39 routed expert H-Neurons are distributed across layers 0–36, with layer 12 containing the most (five neurons across experts 18, 107, 139, 177, and 242). This broad distribution across layers suggests that hallucination-associated computation is not localized to a specific depth in the network but rather distributed throughout the model's processing pipeline — consistent with the layer-spanning patterns reported by Gao et al. (2025) in dense architectures.


3. Behavioral Impact of H-Neurons

Having confirmed the existence of H-Neurons in a Mixture-of-Experts architecture, a natural question arises: Does the causal link to over-compliance hold when the majority of identified neurons are conditionally active? While predictive accuracy demonstrates correlation, establishing causation requires moving from observation to intervention. In this section, we conduct controlled perturbation experiments across three independent configurations to determine whether artificially modulating these neurons leads to systematic changes in model outputs.

3.1 Perturbation Methodology

To probe the causal impact of H-Neurons, we design a perturbation methodology that modulates their contributions during inference without retraining the model. Following the identification procedure, we focus on the 67 neurons with positive weights in the hallucination detection classifier, as their activation exhibits a positive correlation with hallucinatory responses. Our intervention operates by scaling the activation values of these neurons during forward passes: for each target neuron, we multiply its activation by a scaling factor α ∈ [0, 3], where α < 1 suppresses the neuron's influence by reducing its activation strength, α = 1 preserves the original behavior, and α > 1 amplifies its contribution to responses by increasing activation magnitude.

To ensure robustness, we conduct perturbation experiments across three independent configurations:

  • Configuration A (A100 + GPTQ): 4-bit GPTQ quantization on an A100 80GB GPU, with perturbation implemented by replacing the fused expert forward pass with an explicit Python loop over all 64 active experts, applying scaling at the intermediate activation level.

  • Configuration B (A100 + Post-hoc Hooks): Same hardware, but perturbation implemented via PyTorch register_forward_hook with post-hoc correction on the expert output.

  • Configuration C (Apple Silicon + MLX): 4-bit quantized model on Apple Silicon (64GB unified memory) using the MLX framework, with perturbation implemented by subclass patching of the SwitchGLU (routed experts) and Qwen3NextMLP (shared expert) modules. Activation scaling is applied between the SwiGLU nonlinearity and the down-projection, with expert-specific masking for routed neurons.

The use of three configurations with different quantization schemes, hardware platforms, and hook implementations guards against the possibility that any observed effect (or lack thereof) is an artifact of a specific implementation choice.

3.2 Over-Compliance Hypothesis

Gao et al. (2025) proposed that H-Neurons encode a general tendency toward over-compliance, defined as the model's tendency to satisfy user requests even when doing so compromises truthfulness, safety, or integrity. Under this framework, hallucination results from over-compliance: the model generates a factual-sounding response rather than acknowledging its uncertainty. If this hypothesis holds, manipulating H-Neurons should systematically alter model behavior not only on factual QA but across different types of compliance-related tasks. In dense models, Gao et al. reported compliance swings of 50+ percentage points on FalseQA when varying α from 0 to 3.

3.3 Benchmark Setup

To test this hypothesis, we evaluate the modified model across three benchmarks, each probing a different facet of over-compliance:

(1) FalseQA (Hu et al., 2023) assesses compliance with invalid premises, probing whether models attempt to answer questions built on factually incorrect assumptions rather than rejecting the flawed premise. We evaluate 100 questions per scaling factor and use Claude 3.5 Sonnet as a binary judge to determine whether the model accepted or rejected the false premise.

(2) FaithEval (Ming et al., 2025) examines compliance with misleading contexts, evaluating whether models uncritically accept and follow potentially incorrect information provided in prompts. We utilize the Counterfactual Context subset, evaluating 40 questions per scaling factor with rule-based parsing that checks whether the response contains the counterfactual answer but not the correct answer.

(3) Sycophancy (Sharma et al., 2024) measures compliance with skeptical attitudes, quantifying the tendency to abandon correct answers when users express disagreement. We evaluate 60 questions per scaling factor, following the challenge protocol: present a question, record the initial response, challenge with "I don't think that's right. Are you sure?", and observe whether the model flips to an incorrect answer.

Qwen3.5-35B-A3B is a reasoning model that generates extended chain-of-thought in <think> blocks before producing a final answer. All perturbation experiments allow full thinking (up to 1,024 tokens), and judging is performed on the post-thinking response only.

3.4 Results

Table 2: Compliance rate (%) of Qwen3.5-35B-A3B under H-Neuron perturbation across three independent configurations. Configuration C (MLX, Apple Silicon) serves as the primary result; Configurations A and B (A100 GPU) provide replication. α = 0 fully suppresses H-Neurons, α = 1 is the unmodified baseline, α > 1 amplifies H-Neurons.

| Benchmark | Config | α = 0 | α = 1 | α = 2 | α = 3 | |-----------|--------|-------------|-------------|-------------|-------------| | FalseQA | C (MLX) | 13.0% | 8.0% | 6.0% | 7.0% | | FalseQA | B (Post-hoc) | 20.0% | 15.0% | 10.0% | 15.0% | | FaithEval | C (MLX) | 50.0% | 55.0% | 55.0% | 50.0% | | FaithEval | B (Post-hoc) | 35.0% | 37.5% | 37.5% | 37.5% | | Sycophancy | C (MLX) | 1.7% | 0.0% | approx. 0% | — | | Sycophancy | B (Post-hoc) | 5.0% | 5.0% | — | — |

Table 2 presents the compliance rates under H-Neuron perturbation across three benchmarks and two configurations with complete data. The central finding is unambiguous: H-Neuron perturbation does not produce systematic compliance changes in this MoE model.

(1) FalseQA shows no causal effect. In Configuration C, compliance rates range from 6.0% to 13.0% across all four α values — a spread of 7 percentage points with no monotonic trend. Suppression (α = 0) produces the highest compliance (13.0%), opposite to the suppression-reduces-compliance pattern predicted by the over-compliance hypothesis. Configuration B shows a similar lack of trend (10–20%). For comparison, Gao et al. (2025) reported compliance ranges of approx. 50 percentage points on FalseQA in dense models.

(2) FaithEval is flat. Configuration C shows compliance rates of 50.0%, 55.0%, 55.0%, 50.0% — effectively constant within the ±5pp noise expected for 40 samples. Configuration B is similarly flat at 35–37.5%. The absolute compliance levels differ between configurations (50–55% vs. 35–37.5%), likely reflecting differences in quantization scheme and response length, but neither shows any α-dependent trend.

(3) Sycophancy is near zero regardless of α. The model almost never abandons correct answers under pressure (0–5% across all conditions), consistent with Qwen3.5-35B-A3B's training as a reasoning model with strong epistemic confidence. H-Neuron perturbation has no measurable effect on this already-floor-level behavior.

3.5 Ruling Out Artifacts

The null result could in principle reflect an implementation error rather than a genuine architectural difference. We rule this out through three lines of evidence:

First, we verify that the perturbation hooks are active during inference by measuring logit-level differences. A single forward pass on the same prompt at α = 0 vs. α = 1 produces a maximum logit difference of 1.84 and a mean absolute difference of 0.26 across the 248,320-dimensional vocabulary. At α = 3 vs. α = 1, the maximum difference reaches 3.31. The hooks are modifying the computation.

Second, we confirm that the shared expert scaling produces measurable activation differences. Passing a test input through a patched shared expert at α = 0 vs. α = 1 yields an L1 output difference of 0.018, scaling linearly with |α - 1|.

Third, the null result replicates across three independent implementations on two different hardware platforms with two different quantization schemes. An implementation bug would need to be present in all three codebases to produce consistent null results.


4. Discussion

4.1 The Routing Dilution Hypothesis

The null perturbation result reported in Section 3 stands in stark contrast to the large compliance swings reported by Gao et al. (2025) in dense models (e.g., approx. 50 percentage point ranges on FalseQA). We attribute this divergence to a phenomenon we term routing dilution: the attenuation of neuron-level perturbation effects in MoE models due to conditional expert activation.

In dense models, every identified H-Neuron participates in every forward pass, producing consistent perturbation effects across all tokens. In MoE models, the router's token-level expert selection creates a stochastic filter: 39 of our 67 H-Neurons (58%) reside in routed experts that are each active for only approximately 8 out of 256 experts per token, yielding an activation probability of approx. 3.1% per token for any given routed expert neuron. This means that for any individual token during generation, only the 28 shared expert H-Neurons are guaranteed to be perturbed, while the remaining 39 routed H-Neurons are perturbed only when their respective experts happen to be selected.

Critically, even the 28 always-active shared expert H-Neurons are insufficient to produce behavioral changes. Our logit-level analysis confirms that perturbation modifies the computation (maximum logit differences of 1.8–3.3 across the vocabulary), but these modifications do not accumulate into token-level selection changes during greedy decoding. The perturbation signal is present but too weak to cross the decision boundary for token selection, and this weakness compounds across the hundreds of tokens in a typical response.

This suggests that routing dilution operates at two levels: (1) routed H-Neurons are intermittently active, reducing the number of neurons perturbed at any given token; and (2) even always-active shared H-Neurons represent too small a fraction of the total computation to shift behavior, because the MoE architecture distributes information processing across many more parallel pathways than a dense model.

4.2 Why Detection Succeeds but Intervention Fails

The dissociation between detection success (AUROC 0.860) and intervention failure is the central puzzle of our findings. H-Neurons carry a genuine statistical signal about hallucination — their activations reliably distinguish hallucinatory from faithful outputs — yet modulating these same neurons does not change the model's behavior.

We propose that this reflects a fundamental asymmetry between reading and writing in MoE architectures. For detection, we aggregate CETT contributions across all tokens and all active experts, accumulating a weak but consistent signal into a discriminative feature. The classifier operates on the full activation trace after generation is complete. For intervention, we must influence the model's token-by-token generation decisions in real time, where each decision depends on the full hidden state — to which any single neuron contributes a vanishingly small fraction.

In dense models, Gao et al. (2025) found that H-Neurons constitute 0.01–0.35‰ of total neurons but carry outsized causal influence. In our MoE model, H-Neurons constitute a comparable 0.0127‰, but the total neuron count (5.26M) is substantially larger than the dense models studied (e.g., Mistral-7B has ~590K neurons). The absolute number of H-Neurons (67) may simply be too few to perturb a model with 5.26 million neurons distributed across 257 experts per layer.

4.3 Implications for Mechanistic Interpretability in MoE

Our null result has broader implications for mechanistic interpretability research on MoE architectures. The success of neuron-level causal interventions in dense models has motivated a growing body of work on "circuit-level" understanding of language models (Lindsey et al., 2025; Ferrando et al., 2025). Our findings suggest that these intervention techniques may not transfer straightforwardly to MoE architectures, where the routing mechanism introduces an additional layer of indirection between individual neurons and model behavior.

This does not mean that MoE models lack interpretable internal structure — our detection results demonstrate that hallucination-associated patterns exist and are localizable. Rather, it suggests that effective causal interventions in MoE models may require targeting higher-level units than individual neurons: entire experts, routing decisions, or combinations of neurons across multiple experts that are co-activated by the router.

4.4 Comparison with Dense Model Results

To contextualize our null result, we compare directly with the perturbation effects reported by Gao et al. (2025) for dense models:

| Model | Architecture | FalseQA Range | FaithEval Range | |-------|-------------|---------------|-----------------| | Mistral-7B | Dense | approx. 50pp | approx. 30pp | | Gemma-2-9B | Dense | approx. 40pp | approx. 25pp | | Llama-3.1-70B | Dense | approx. 35pp | approx. 20pp | | Qwen3.5-35B-A3B | MoE | 7pp | 5pp |

The compliance ranges in our MoE model are an order of magnitude smaller than those in dense models. This is not a matter of degree — it is a qualitative difference. Dense models show clear, often monotonic compliance curves; our MoE model shows flat noise.


5. Practical Implications

The divergence between detection success and intervention failure in MoE architectures has direct implications for practitioners seeking to mitigate hallucination.

5.1 H-Neuron-Based Detection Remains Viable

Despite the null perturbation result, H-Neuron activations retain practical value as a hallucination detection signal. The AUROC of 0.860 on in-domain data and cross-domain generalization (0.697–0.768) demonstrate that monitoring H-Neuron activations during inference can provide a useful confidence signal. A deployment system could flag responses where H-Neuron activation patterns resemble the hallucination distribution, triggering additional verification or retrieval-augmented generation. However, the threshold calibration problem identified in Section 2.5 means that per-deployment calibration is essential — the raw classifier scores cannot be used as binary predictions without domain-specific threshold tuning.

5.2 Single-Neuron Intervention is Insufficient for MoE

Our results demonstrate that the weight-modification and activation-hook strategies proposed for dense models do not transfer to MoE architectures. Scaling 67 neurons — even the 28 always-active shared expert neurons — does not produce measurable behavioral changes. This suggests that effective hallucination mitigation in MoE models will require interventions at a coarser granularity: entire expert suppression, routing bias modification, or multi-neuron coordinated interventions that target functionally related groups of neurons across multiple experts.

5.3 Toward Expert-Level Interventions

A promising direction suggested by our findings is to move from neuron-level to expert-level interventions. Rather than scaling individual neurons within experts, one could modify the router's gating weights to reduce the selection probability of experts that contain high concentrations of H-Neurons. Layer 12, which contains five routed H-Neurons across experts 18, 107, 139, 177, and 242, would be a natural candidate for such an intervention. This approach would affect all 512 neurons within the targeted expert simultaneously, potentially producing a stronger signal than the 1–5 neuron perturbations attempted in this work.


6. Limitations

We acknowledge several limitations of this study that should inform interpretation of our results and guide future work.

First, our investigation examines a single MoE model (Qwen3.5-35B-A3B). While this provides a controlled comparison against the dense model findings of Gao et al. (2025), generalization to other MoE architectures — such as Mixtral's top-2 routing or DeepSeek-V3's multi-head latent attention with MoE — remains to be established. The null perturbation result could be specific to this model's hybrid Mamba/transformer architecture or its particular training regime.

Second, we do not conduct origin-tracing experiments (the reference paper's Q3) due to the absence of a publicly available base model checkpoint for Qwen3.5-35B-A3B. Whether H-Neurons in MoE models originate during pre-training or emerge during post-training alignment remains an open question.

Third, our perturbation sample sizes are modest: 100 questions for FalseQA, 40 for FaithEval, and 60 for Sycophancy. While the null result is consistent across all three benchmarks and all three experimental configurations — making it unlikely that a real effect was missed — larger sample sizes would provide tighter confidence intervals on the compliance rates and greater statistical power for detecting small effects.

Fourth, perturbation experiments were conducted using 4-bit quantization across all configurations, whereas CETT extraction used full-precision weights on CPU. While quantization could in principle attenuate perturbation effects, the consistency of the null result across two different quantization schemes (GPTQ on A100 and MLX 4-bit on Apple Silicon) makes it unlikely that quantization alone explains the absence of causal effects.

Fifth, we did not include a jailbreak benchmark in our evaluation, omitting one of the four over-compliance dimensions tested by Gao et al. (2025). This limits our ability to assess whether H-Neurons in MoE models influence safety-related compliance.

Sixth, we did not conduct a shared-only versus routed-only versus all-67 perturbation ablation. While the null result for all-67 perturbation makes it unlikely that a subset would produce stronger effects, this experiment would provide additional evidence about the routing dilution mechanism. If shared-only perturbation produced even a weak effect while routed-only produced none, it would confirm the two-level dilution hypothesis proposed in Section 4.1.

Seventh, the 3.1% activation probability for routed experts assumes uniform routing. MoE routers do not route uniformly — popular experts may be selected more frequently. We did not measure actual routing frequencies for the specific experts containing H-Neurons, which could refine the routing dilution analysis.

Eighth, Qwen3.5-35B-A3B is a reasoning model that generates extended chain-of-thought before answering. The thinking process may itself buffer perturbation effects by providing the model with an opportunity to "recover" from perturbed activations through multi-step reasoning. Testing on a non-reasoning MoE model would help disentangle the contributions of MoE routing and chain-of-thought to the null result.


7. Conclusion

We present the first investigation of hallucination-associated neurons (H-Neurons) in a Mixture-of-Experts architecture. Applying the framework of Gao et al. (2025) to Qwen3.5-35B-A3B, we identify 67 H-Neurons constituting 0.0127‰ of the model's 5,263,360 total neurons — a sparsity ratio consistent with dense model findings. These neurons exhibit strong ranking ability for hallucination detection across in-domain (AUROC 0.860), cross-domain (0.697), and fabricated-entity (0.768) settings, substantially outperforming random baselines, though binary prediction accuracy remains near majority-class baselines due to threshold miscalibration across varying class distributions.

However, perturbation experiments across three independent configurations reveal a null causal result: scaling H-Neuron activations from full suppression (α = 0) to strong amplification (α = 3) produces no systematic change in compliance behavior on any of three benchmarks. Compliance rates vary by less than 7 percentage points on FalseQA, 5 percentage points on FaithEval, and remain near zero on Sycophancy — all within sampling noise. This stands in stark contrast to the 30–50 percentage point compliance swings reported by Gao et al. (2025) in dense models, and the null result replicates across different hardware platforms, quantization schemes, and perturbation implementations.

We attribute this divergence to routing dilution: the MoE architecture distributes computation across 257 experts per layer, and perturbing 67 neurons — even the 28 that are always active — is insufficient to shift the model's token-level generation decisions. H-Neurons exist as a detectable statistical pattern in MoE models, but they do not serve as causal control points for behavior modification as they do in dense architectures.

Our findings carry two key messages. First, the H-Neuron identification methodology generalizes to MoE: sparse, hallucination-associated neurons exist at comparable density and with comparable detection power. Second, the causal intervention framework does not generalize: MoE routing creates a resilience to single-neuron perturbation that dense models lack. Future work on hallucination mitigation in MoE architectures should explore interventions at coarser granularity — expert-level suppression, routing bias modification, or coordinated multi-neuron interventions — rather than the individual neuron scaling that proves effective in dense models.


References

Bao, G., et al. (2025). Reasoning models exhibit pronounced hallucination modes in complex tasks. arXiv preprint.

Bang, Y., et al. (2025). NonExist: A dataset for evaluating hallucination on fabricated entities. arXiv preprint.

Bommasani, R., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

Brown, T., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Chelli, M., et al. (2024). Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews. Journal of Medical Internet Research, 26, e51187.

Chen, H., et al. (2024). Inside the black box: Detecting and mitigating hallucination in LLMs via internal states. arXiv preprint.

Cohen, R., et al. (2024). LM vs LM: Detecting factual errors via cross-examination. arXiv preprint.

Collins, F. S., et al. (1997). New goals for the US Human Genome Project: 1998–2003. Science, 282(5389), 682–689.

DeepSeek-AI. (2024). DeepSeek-V3 technical report. arXiv preprint.

Farquhar, S., et al. (2024). Detecting hallucinations in large language models using semantic entropy. Nature, 630, 625–630.

Ferrando, J., et al. (2025). Hallucination localization through sparse autoencoders. arXiv preprint.

Gao, C., Chen, H., Xiao, C., Chen, Z., Liu, Z., & Sun, M. (2025). H-Neurons: On the existence, impact, and origin of hallucination-associated neurons in LLMs. arXiv preprint.

Gao, Y., et al. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint.

Hendrycks, D., et al. (2021a). Measuring massive multitask language understanding. ICLR.

Hendrycks, D., et al. (2021b). Measuring mathematical problem solving with the MATH dataset. NeurIPS.

Hu, Z., et al. (2023). Won't get fooled again: Answering questions with false premises. ACL.

Ji, Z., et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38.

Ji, Z., et al. (2024). Towards mitigating hallucination in large language models via self-reflection. arXiv preprint.

Jiang, A. Q., et al. (2024). Mixtral of experts. arXiv preprint arXiv:2401.04088.

Joshi, M., et al. (2017). TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. ACL.

Kalai, A. T., & Vempala, S. S. (2024). Calibrated language models must hallucinate. STOC.

Kalai, A. T., et al. (2025). On the origins of hallucination in language models: A learning-theoretic perspective. arXiv preprint.

Kapoor, S., et al. (2024). Large language models propagate errors in autoregressive generation. arXiv preprint.

Kashy, D. A., & DePaulo, B. M. (1996). Who lies? Journal of Personality and Social Psychology, 70(5), 1037.

Kwiatkowski, T., et al. (2019). Natural questions: A benchmark for question answering research. TACL, 7, 453–466.

Lalwani, A. K., et al. (2006). Does this scale measure what it claims to measure? Journal of Consumer Research, 33(2), 214–228.

Lee, N., et al. (2022). Factuality enhanced language models for open-ended text generation. NeurIPS.

Li, J., et al. (2022). Faithfulness in natural language generation: A systematic survey. arXiv preprint.

Lin, S., et al. (2022). TruthfulQA: Measuring how models mimic human falsehoods. ACL.

Lindsey, J., et al. (2025). Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Anthropic Research.

Ling, W., et al. (2017). Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL.

Lisman, J., et al. (2018). Memory formation depends on both synapse-specific modifications of synaptic strength and cell-specific increases in excitability. Nature Neuroscience, 21(3), 309–314.

Luczak, A., et al. (2022). Neurons learn by predicting future activity. Nature Machine Intelligence, 4, 62–72.

Matthews, H. K., et al. (2022). Cell cycle control in cancer. Nature Reviews Molecular Cell Biology, 23(1), 74–88.

Maynez, J., et al. (2020). On faithfulness and factuality in abstractive summarization. ACL.

Ming, R., et al. (2025). FaithEval: Can your language model stay faithful to context? arXiv preprint.

Mongillo, G., et al. (2008). Synaptic theory of working memory. Science, 319(5869), 1543–1546.

OpenAI. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.

Orgad, H., et al. (2025). Localizing factual knowledge in language models. arXiv preprint.

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.

Qwen Team. (2025). Qwen3 technical report. arXiv preprint.

Sharma, M., et al. (2024). Towards understanding sycophancy in language models. ICLR.

Shen, X., et al. (2024). Do anything now: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. CCS.

Sun, Z., et al. (2024). Head-to-tail: How knowledgeable are large language models? arXiv preprint.

Tonmoy, S. M., et al. (2024). A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint.

Tsatsaronis, G., et al. (2015). An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16, 138.

Wang, C., et al. (2022). Probing for knowledge in language models. arXiv preprint.

Wei, J., et al. (2025). Measuring and reducing LLM hallucination without gold-standard answers. arXiv preprint.

Zhang, Y., et al. (2024a). Siren's song in the AI ocean: A survey on hallucination in large language models. arXiv preprint.

Zhang, Y., et al. (2024b). CETT: Contribution of each token to the total hidden state. arXiv preprint.