research

Hunting for Hallucination Neurons in a Mixture-of-Experts Model

A walkthrough of replicating the H-Neuron framework on Qwen3.5-35B-A3B — what worked, what didn't, and what we learned about the limits of neuron-level interpretability in sparse architectures.

By Spirited Mind
mechanistic-interpretabilityhallucinationmixture-of-expertsh-neuronsqwen

Hunting for Hallucination Neurons in a Mixture-of-Experts Model

This is the story of a replication study that didn't go the way we expected. We took a compelling result from dense transformer research — that a tiny handful of neurons can predict and causally control hallucination — and asked whether it holds in a Mixture-of-Experts model. The answer turned out to be half yes and half no, and the "no" part is more interesting than the "yes."

We're going to walk through this the way it actually happened: the original paper that inspired the work, the first attempt that overclaimed its results, and the corrected version that reported a genuine null finding. If you're a researcher or a student thinking about replication studies, this is a case study in how confirmation bias creeps in and how you catch it.


The Original Paper: H-Neurons in Dense Models

In early 2025, Gao et al. from Tsinghua University published a paper called "H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs." The core claim was striking: in dense transformer models like Mistral-7B, Gemma-3-27B, and Llama-3.3-70B, fewer than 0.1% of all neurons — sometimes as few as 0.01 per mille — can reliably predict whether the model is about to hallucinate.

They called these neurons H-Neurons, and their methodology had three parts:

Identification. Sample multiple responses per question from TriviaQA. Keep only questions where the model is perfectly consistent — always right or always wrong across 10 samples. Extract the answer tokens, compute a metric called CETT (Contribution of Each Token to the Total hidden state) that measures how much each neuron actually contributes to the output, and train an L1-regularized logistic regression classifier on the CETT features. The sparsity penalty drives most neuron weights to zero. The survivors with positive weights are your H-Neurons.

Causal Impact. Scale the activations of these neurons during inference. At alpha=0 you suppress them entirely. At alpha=1 you leave them alone. At alpha=2 or 3 you amplify them. Gao et al. found that across six dense models, amplifying H-Neurons systematically increased "over-compliance" — the model's tendency to accept false premises (FalseQA), follow misleading context (FaithEval), cave to skeptical pushback (Sycophancy), and comply with harmful instructions (Jailbreak). The compliance swings were dramatic: 50+ percentage points on FalseQA for some models.

Origin Tracing. They showed H-Neurons exist in base models before instruction tuning, suggesting they emerge during pretraining rather than alignment.

The results across six models were consistent. H-Neuron classifiers beat random baselines by 10-20 percentage points on accuracy across in-domain, cross-domain, and fabricated-entity benchmarks. The perturbation curves were monotonic and large. It was a clean story.

The natural question: does any of this hold in a Mixture-of-Experts model, where most neurons only fire 3% of the time?


Why MoE Changes Everything

Before we get into the experiment, you need to understand why Mixture-of-Experts is not just "a bigger dense model." The difference is architectural and it's fundamental to everything that follows.

In a dense transformer like Mistral-7B, every neuron in every feedforward layer fires on every token. If you identify 200 H-Neurons and scale their activations, all 200 are modified on every single token the model generates. The perturbation is total and consistent.

In Qwen3.5-35B-A3B — the MoE model we chose — each of the 40 layers has 256 routed experts and 1 shared expert. Each expert has 512 neurons. For any given token, a router selects 8 of the 256 experts. That means:

  • The shared expert's 512 neurons fire on every token (always active)
  • Each routed expert's 512 neurons fire only when selected — roughly 8/256 = 3.1% of the time
  • Total neuron count: 40 layers x (256 x 512 + 1 x 512) = 5,263,360 neurons
  • Active neurons per token: 40 x (8 x 512 + 1 x 512) = 184,320 — about 3.5% of total

There's another wrinkle. Qwen3.5-35B-A3B is a hybrid architecture: 30 of its 40 layers use linear attention with Mamba-style state-space components (learned decay, causal convolutions, discretization biases), while every 4th layer uses standard multi-head self-attention. The MoE feedforward structure is identical across all 40 layers, but the hidden states entering those feedforward blocks come from fundamentally different attention mechanisms depending on the layer. This matters for interpretability — and it was initially missed in the first version of the paper.

So if you find H-Neurons in this model, some will be in shared experts (always active, like dense model neurons) and some will be in routed experts (conditionally active, a new category with no analogue in dense models). The question is whether either type — or both together — can predict and control hallucination the way they do in dense architectures.


Phase 1: Building the Dataset

We followed Gao et al.'s protocol closely. The pipeline starts with TriviaQA — a large-scale QA dataset with short factual answers (names, dates, places) that makes it easy to check correctness.

Sampling. We ran Qwen3.5-35B-A3B through Ollama, generating 10 responses per question with probabilistic decoding (temperature=1.0, top_k=50, top_p=0.9). The goal is to find questions where the model is consistently right or consistently wrong — not questions where it sometimes gets lucky. This consistency filtering is critical: it ensures the activation patterns we measure reflect the model's stable knowledge state, not generation noise.

Filtering. After processing thousands of TriviaQA questions, we retained 1,000 consistently correct and 864 consistently incorrect examples — 1,864 total. The slight imbalance (vs. Gao et al.'s clean 1,000/1,000 split) reflects Qwen3.5's actual knowledge distribution. We kept it rather than artificially balancing, which is the right call — forcing balance would mean discarding valid data.

Answer Span Extraction. For each response, we need to identify exactly which tokens contain the factual claim. Gao et al. used GPT-4o for this. We used the same Qwen3.5 model via Ollama with greedy decoding — a pragmatic substitution since TriviaQA answers are typically short unambiguous entities.

To make sure this substitution didn't compromise data quality, we ran a three-part validation audit:

  1. Text matching: 96.9% of extracted spans (1,807/1,864) appear verbatim in the response. 98.3% of correct-answer spans match a TriviaQA answer alias.
  2. Claude semantic judgment: On 100 random samples, 97% were valid minimal entities and 100% captured the core factual claim.
  3. Heuristic comparison: 94% agreement between the Ollama extractor and a rule-based baseline. All 6 disagreements favored the Ollama extraction on manual inspection.

This validation step wasn't in the first version of the paper. It was added after review feedback pointed out that substituting GPT-4o with a local model needed justification. The 97% quality rate is solid — comparable to what you'd expect from GPT-4o on this task.


Phase 2: Finding the Neurons

With 1,864 labeled samples and their answer spans in hand, we need to measure every neuron's contribution to the model's output and find the ones that distinguish hallucination from faithful responses.

CETT: Measuring What Neurons Actually Do

Raw activation magnitude is misleading. A neuron can fire strongly but contribute almost nothing to the output if its downstream projection weights are small. CETT (Contribution of Each Token to the Total hidden state) captures the actual information flow:

CETT(j, t) = ||h_j_t||_2 / ||h_t||_2

Where h_j_t is the partial hidden vector attributable to neuron j at token position t (after the down-projection), and h_t is the full hidden state. This gives you the fraction of the output that neuron j is responsible for.

For MoE, we had to adapt this. The key challenge is the fused gate_up_proj tensor — Qwen3.5's experts concatenate the gate and up projections into a single weight matrix of shape (2 x 512) x hidden_dim. We split this to recover separate gate and up projections, compute z = act(gate(x)) * up(x) for each expert, and weight the contribution by the router's gating weight for that expert on that token.

For routed experts, CETT is zero when the expert isn't selected. For the shared expert, it's computed on every token — same as a dense model.

We aggregate across tokens into two features per neuron per sample: mean CETT over answer tokens and mean CETT over non-answer tokens. This gives us a feature matrix of 1,864 samples by 5,263,360 neurons by 2 features. That's over 19 billion values. The CETT extraction ran on CPU at full precision to avoid quantization artifacts — a deliberate choice that cost compute time but preserved numerical fidelity.

Sparse Classification: L1 Logistic Regression

We train an L1-regularized logistic regression on the CETT features. The L1 penalty is the whole point — it forces most weights to exactly zero, automatically selecting the smallest set of neurons that can predict hallucination.

Grid search over the regularization parameter C (= 1/lambda):

| C | Accuracy | AUROC | Nonzero Weights | Positive Weights | |---|----------|-------|-----------------|------------------| | 0.001 | 76.8% | 0.500 | 0 | 0 | | 0.005 | 76.8% | 0.500 | 0 | 0 | | 0.01 | 76.8% | 0.500 | 0 | 0 | | 0.05 | 72.7% | 0.836 | 6 | 0 | | 0.1 | 74.3% | 0.881 | 9 | 1 | | 0.5 | 85.8% | 0.969 | 42 | 16 | | 1.0 | 87.9% | 0.973 | 110 | 57 |

Look at this table carefully. At C=0.001 through 0.01, the regularization is so strong that every weight is driven to zero — the classifier predicts the majority class for everything (76.8% = the proportion of correct samples). At C=0.05, six neurons break through and AUROC jumps to 0.836. By C=1.0, we have 110 nonzero weights and AUROC of 0.973.

We selected C=1.0 on the held-out split, then retrained on all 1,864 samples. The final classifier has 127 nonzero weights: 67 positive (H-Neurons) and 60 negative. The count difference between the grid search (110/57) and the final model (127/67) is expected — more training data means the regularization can afford to keep more features.

What We Found: 67 H-Neurons

67 neurons out of 5,263,360. That's 0.0127 per mille — squarely within the 0.01-0.35 per mille range Gao et al. reported for dense models. The sparsity finding replicates.

But the composition is new. Of the 67:

  • 28 are in shared experts (always active, like dense model neurons)
  • 39 are in routed experts (conditionally active, ~3.1% of the time per token)

This 42%/58% shared/routed split has no analogue in dense models. It means that for any given token during generation, only the 28 shared H-Neurons are guaranteed to be participating. The other 39 are only "on" when their specific expert happens to be selected by the router.

The highest-weighted H-Neuron sits in layer 25's shared expert (weight = 11.49). The second-highest is in layer 33, routed expert 25 (weight = 10.36). Both types carry strong signal. The neurons span nearly all layers (0-39), with notable concentrations at layers 4, 12, 29, and 39.

One finding that didn't make it into the paper but sits in the output data: routed H-Neurons are significantly concentrated in linear attention layers (35/39 = 89.7%, binomial p=0.040). Whether this reflects something about how Mamba-style attention interacts with the MoE routing, or is a statistical artifact of the small sample, remains an open question.


Phase 3: Do These Neurons Generalize?

Finding neurons that predict hallucination on the training set is necessary but not sufficient. The real test is whether they generalize — to new questions in the same domain, to entirely different domains, and to questions about things that don't exist.

We evaluated the H-Neuron classifier on four benchmarks, generating 200-500 fresh responses per dataset, extracting answer spans, computing CETT features, and running them through the trained classifier. We also trained a matched random baseline — a classifier using 67 randomly selected neurons — to confirm that the signal is specific to H-Neurons rather than an artifact of the method.

| Dataset | Hall Rate | Majority Acc | H-Neuron Acc | Random Acc | H-Neuron AUROC | Random AUROC | |---------|-----------|-------------|-------------|-----------|---------------|-------------| | TriviaQA (n=500) | 36.6% | 63.4% | 65.4% | 36.4% | 0.860 | 0.462 | | NQ-Open (n=500) | 57.8% | 57.8% | 61.2% | 58.6% | 0.673 | 0.507 | | BioASQ (n=500) | 77.0% | 77.0% | 77.2% | 51.0% | 0.697 | 0.504 | | NonExist (n=200) | 21.0% | 79.0% | 21.0% | 24.0% | 0.768 | 0.505 |

There are two stories in this table, and you need to read both.

The AUROC story is good. H-Neurons achieve 0.860 AUROC on in-domain TriviaQA, meaning they rank hallucinated samples higher than faithful ones 86% of the time. This generalizes to biomedical questions (0.697), fabricated entities (0.768), and a second in-domain set (0.673). The random baseline is at chance (~0.50) everywhere. The signal is real and it transfers across domains.

The accuracy story is sobering. On TriviaQA, the classifier beats majority-class guessing by 2 percentage points. On BioASQ, by 0.2 points. On NonExist, it gets 21% accuracy — it predicts "hallucination" for almost everything, because the decision threshold learned during training (where ~46% of samples were hallucinations) is wildly miscalibrated for a dataset with only 21% hallucinations.

This dissociation matters. AUROC measures ranking — "do hallucinated samples get higher scores than faithful ones?" Accuracy measures binary prediction — "can you actually flag the hallucinations?" The H-Neurons are good at the first task and nearly useless at the second without per-deployment threshold recalibration.

Gao et al.'s original paper reported only accuracy, and their numbers looked much better (70-96% across models and datasets). Part of this is because their dense models had stronger H-Neuron signals. Part of it is that accuracy can be misleading when class distributions shift between training and evaluation. Reporting both metrics — as the final version of our paper does — gives a more honest picture.

The bottom line for Phase 3: H-Neurons exist in MoE models with comparable sparsity and comparable detection power to dense models. The identification methodology generalizes. This is a genuine positive result.


Phase 4, Take 1: The Causal Claim That Wasn't

This is where the story gets honest.

The first version of the paper (V1) ran perturbation experiments on a single configuration: A100 GPU with GPTQ 4-bit quantization, using a Python loop that replaced the entire fused expert forward pass to inject activation scaling. The results looked like this:

| Benchmark | alpha=0 | alpha=1 | alpha=2 | alpha=3 | |-----------|---------|---------|---------|---------| | FalseQA | 12.0% | 27.0% | 20.0% | — | | FaithEval | 77.5% | 72.5% | 67.5% | 55.0% | | Sycophancy | 10.0% | 8.3% | 10.0% | 11.7% |

The V1 paper looked at FalseQA and said: suppression (alpha=0) gives 12%, baseline (alpha=1) gives 27%, that's a 15 percentage point causal effect. It looked at FaithEval and said: there's a monotonic decrease from 77.5% to 55% as we amplify H-Neurons — the model becomes less compliant with misleading context when H-Neurons are amplified, suggesting they encode "confidence in internal knowledge" rather than over-compliance.

The V1 paper built an entire discussion section around this reinterpretation. It proposed that H-Neurons in MoE models play a different functional role than in dense models — encoding epistemic confidence rather than compliance tendency. It was a creative interpretation. It was also wrong.

Here's what the V1 paper missed or chose not to see:

The FalseQA "effect" isn't monotonic. If amplifying H-Neurons causes more compliance, alpha=2 should be higher than alpha=1. It's not — it drops from 27% to 20%. The 15pp "effect" is the cherry-picked comparison between alpha=0 and alpha=1. Pick alpha=0 vs alpha=2 and you get 8pp. Pick alpha=1 vs alpha=2 and the effect reverses direction. On 100 samples, these swings are within the confidence interval of random variation.

The FaithEval "trend" is on 40 samples. Each percentage point is 0.4 samples. The entire "monotonic decrease" from 77.5% to 55% is a swing of 9 samples. With n=40, the 95% confidence interval for a proportion around 65% is roughly +/-15 percentage points. The whole "trend" fits inside the noise.

Sycophancy was flat and near zero. The paper acknowledged this but didn't grapple with what it means: if H-Neurons encode over-compliance, suppressing them should reduce sycophancy. It doesn't. The model almost never caves to pressure regardless of alpha, because Qwen3.5 is a reasoning model with strong epistemic confidence baked in during training.

Only one experimental configuration was tested. A single implementation on a single hardware platform. No replication.

Accuracy metrics were omitted. The V1 paper reported only AUROC for detection, hiding the fact that binary prediction accuracy was barely above majority-class baselines.

The hybrid Mamba/transformer architecture was not disclosed. The model was described as a standard MoE transformer, omitting the fact that 30 of 40 layers use linear attention with state-space components.

The V1 paper was reviewed and received a B+/A-. The reviewer flagged the non-monotonic FalseQA curve, the tiny FaithEval sample size, the missing accuracy metrics, the undisclosed architecture, and the absent shared-only perturbation ablation. The researcher wrote a reflection document acknowledging these issues. Then they went back to work.


Phase 4, Take 2: The Null Result

The V2 paper did something that takes real integrity: it retracted the causal claim.

Not because new data contradicted the old data — the original numbers didn't change. But because two additional experimental configurations made the null interpretation undeniable.

Configuration A was the original: A100 GPU, GPTQ 4-bit quantization, perturbation via a Python loop replacing the fused expert forward pass. This is the x-version-of-perturbation.py in the codebase.

Configuration B was a rewrite: same A100 GPU, but perturbation via PyTorch register_forward_hook with a post-hoc additive correction. Instead of replacing the entire fused expert kernel with a slow Python loop, the native CUDA kernel runs unmodified and a small correction is computed for only the affected H-Neurons. The math is algebraically identical — if the native output includes z_j * down_proj[:, j] and you want alpha * z_j * down_proj[:, j], you just add (alpha - 1) * z_j * down_proj[:, j] after the fact. This is perturbation.py in the codebase, and it's about 10x faster.

Configuration C was a completely independent implementation: Apple Silicon (64GB unified memory), MLX framework, 4-bit quantization, perturbation via subclass patching of the SwitchGLU and Qwen3NextMLP modules. Different hardware, different framework, different quantization scheme, different code.

Here's what all three showed:

| Benchmark | Config | alpha=0 | alpha=1 | alpha=2 | alpha=3 | |-----------|--------|---------|---------|---------|---------| | FalseQA | A (GPTQ) | 12.0% | 27.0% | 20.0% | — | | FalseQA | B (Post-hoc) | 20.0% | 15.0% | 10.0% | 15.0% | | FalseQA | C (MLX) | 13.0% | 8.0% | 6.0% | 7.0% | | FaithEval | A (GPTQ) | 77.5% | 72.5% | 67.5% | 55.0% | | FaithEval | B (Post-hoc) | 35.0% | 37.5% | 37.5% | 37.5% | | FaithEval | C (MLX) | 50.0% | 55.0% | 55.0% | 50.0% | | Sycophancy | B (Post-hoc) | 5.0% | 5.0% | — | — | | Sycophancy | C (MLX) | 1.7% | 0.0% | ~0% | — |

Read this table row by row and ask yourself: is there a pattern?

FalseQA Config A goes 12, 27, 20. Config B goes 20, 15, 10, 15. Config C goes 13, 8, 6, 7. Three different "patterns," none monotonic, none consistent with each other. If there were a real causal effect, all three implementations — which are algebraically equivalent operations on the same model — should show the same direction. They don't.

FaithEval Config A goes 77.5, 72.5, 67.5, 55 — the "monotonic trend" from V1. Config B goes 35, 37.5, 37.5, 37.5 — dead flat. Config C goes 50, 55, 55, 50 — dead flat. The V1 "trend" was a fluke on 40 samples that didn't replicate. The absolute levels differ across configurations (likely quantization and response-length effects), but no configuration shows an alpha-dependent trend.

Sycophancy is near zero everywhere. The model doesn't cave to pressure regardless of what you do to its H-Neurons.

The V2 paper also verified the hooks were actually working. A forward pass at alpha=0 vs alpha=1 produces a maximum logit difference of 1.84 across the 248,320-dimensional vocabulary. At alpha=3 vs alpha=1, the max difference reaches 3.31. The perturbation is modifying the computation. It's just not modifying it enough to change which tokens get selected during generation.


Why Detection Works but Intervention Fails: Routing Dilution

This is the central puzzle of the entire study, and the V2 paper's most interesting contribution. The same 67 neurons that reliably detect hallucination (AUROC 0.860) have zero causal effect on hallucination behavior. How?

The V2 paper introduces a concept called routing dilution to explain this. The argument has two levels:

Level 1: Routed H-Neurons are intermittently active. 39 of the 67 H-Neurons live in routed experts. Each routed expert is selected for roughly 8/256 = 3.1% of tokens. So for any given token during generation, only the 28 shared H-Neurons are guaranteed to be perturbed. The other 39 are perturbed only when their expert happens to be selected — which is rare for any individual neuron.

Level 2: Even always-active shared H-Neurons are too few. 28 neurons out of 5,263,360 is 0.00053% of the model. In a dense model like Mistral-7B with ~590K neurons, the same 0.01 per mille ratio gives you neurons that each contribute a larger fraction of the total computation. In Qwen3.5 with 5.26M neurons distributed across 257 experts per layer, each neuron's contribution is diluted across far more parallel pathways.

Think of it this way. For detection, you're reading the full activation trace after generation is complete. You aggregate CETT contributions across all tokens and all active experts, accumulating a weak but consistent signal into a discriminative feature. The classifier has the luxury of looking at the whole picture.

For intervention, you're trying to influence token-by-token generation decisions in real time. Each decision depends on the full hidden state — to which any single neuron contributes a vanishingly small fraction. The perturbation signal is present (the logit differences prove it) but too weak to cross the decision boundary for token selection. And this weakness compounds across hundreds of tokens in a typical response.

To put numbers on it: Gao et al. reported compliance swings of 30-50 percentage points in dense models. We see swings of 5-7 percentage points — an order of magnitude smaller, and indistinguishable from sampling noise at our evaluation sizes. This isn't a matter of degree. It's a qualitative difference. Dense models show clear dose-response curves. Our MoE model shows flat noise.

| Model | Architecture | FalseQA Range | FaithEval Range | |-------|-------------|---------------|-----------------| | Mistral-7B | Dense | ~50pp | ~30pp | | Gemma-2-9B | Dense | ~40pp | ~25pp | | Llama-3.1-70B | Dense | ~35pp | ~20pp | | Qwen3.5-35B-A3B | MoE | 7pp | 5pp |

The MoE routing mechanism appears to create a natural resilience to single-neuron perturbation that dense models lack. This has implications beyond hallucination research — it suggests that the entire paradigm of neuron-level causal intervention, which has been productive in dense models, may not transfer to MoE architectures without rethinking the granularity of intervention.


What the V1-to-V2 Journey Teaches Us

This study went through three stages: initial results that looked promising, a review that challenged the interpretation, and a corrected version that reported a null finding. That arc is worth examining because it illustrates patterns that show up constantly in ML research.

Confirmation bias is the default

The researcher expected to replicate Gao et al.'s causal findings. When the FalseQA numbers showed a 15pp swing between two alpha values, it was easy to see a "causal effect" rather than noise. When FaithEval showed a downward slope on 40 samples, it was easy to build a theory around it. The human tendency is to find the story in the data, especially when you have a prior about what the story should be.

The fix wasn't more sophisticated statistics. It was replication. Running the same experiment three different ways on two different hardware platforms made the null result impossible to explain away. If your "effect" produces a different pattern every time you measure it, it's not an effect.

Report the metrics that hurt

The V1 paper reported AUROC but not accuracy. AUROC looked good (0.860). Accuracy was embarrassing (65.4% vs 63.4% majority baseline on TriviaQA, 21% on NonExist). The V2 paper reports both, with a full paragraph explaining why the accuracy numbers are bad and what they mean. This is harder to write but more useful to the field.

The same principle applies to the null perturbation result. A paper that says "we found H-Neurons and they causally control hallucination in MoE" would be more publishable than one that says "we found H-Neurons but they don't do anything when you poke them." The V2 paper chose honesty over a cleaner narrative. The null result — properly documented and replicated — is arguably a bigger contribution than a positive result would have been, because it identifies a fundamental limitation of neuron-level intervention in MoE architectures.

Disclose your architecture

The V1 paper described Qwen3.5-35B-A3B as a standard MoE transformer. It's not. It's a hybrid Mamba/transformer with linear attention in 75% of its layers. This isn't a minor detail — it means the hidden states entering the MoE feedforward blocks come from fundamentally different computations depending on the layer. The V2 paper discloses this fully and even notes that routed H-Neurons are statistically concentrated in linear attention layers (p=0.040).

If you're working with a model, read the architecture paper. Read the config file. If there's something unusual, say so. Reviewers will find it, and the omission looks worse than the complexity.

The experiment you didn't run is the one that matters

The single most important missing experiment across both versions is the shared-only vs. routed-only perturbation ablation. The code supports it — load_h_neurons() accepts a neuron_type parameter, the compliance evaluation passes it through, the shell script run_b3.sh is set up for a three-way ablation. The infrastructure is built. The experiment was never run.

This matters because the routing dilution hypothesis makes a specific prediction: shared-only perturbation (28 always-active neurons) should produce a stronger effect than all-67 perturbation, because the routed neurons are mostly inactive and add noise to the intervention. If shared-only perturbation produced even a weak signal while routed-only produced nothing, it would confirm the two-level dilution mechanism. If shared-only also produced nothing, it would suggest the problem isn't routing but something deeper — maybe 28 neurons out of 5.26 million is simply too few to matter regardless of activation frequency.

The V2 paper correctly frames routing dilution as a hypothesis rather than a finding. But the ablation that could have tested it was within reach.


Where This Goes Next

The V2 paper suggests several directions, and having reviewed the code and data, I think some are more promising than others.

Expert-level intervention is the most obvious next step. Instead of scaling individual neurons within experts, modify the router's gating weights to reduce the selection probability of experts that contain high concentrations of H-Neurons. Layer 12 has five routed H-Neurons across experts 18, 107, 139, 177, and 242 — suppressing those entire experts (all 512 neurons each) would produce a much stronger signal than the 1-5 neuron perturbations attempted here.

Testing on a non-reasoning MoE model would help disentangle two confounds. Qwen3.5-35B-A3B generates extended chain-of-thought in <think> blocks before answering. This thinking process may buffer perturbation effects — the model has hundreds of tokens of "recovery time" between the perturbed activations and the final answer. A non-reasoning MoE model (like Mixtral) would remove this confound.

Measuring actual routing frequencies for H-Neuron experts would refine the dilution analysis. The 3.1% activation probability assumes uniform routing, but MoE routers don't route uniformly. If the experts containing H-Neurons happen to be "popular" experts selected more frequently, the dilution effect would be weaker than estimated. If they're rarely-selected specialists, it would be stronger.

Cross-architecture replication on Mixtral (top-2 routing) or DeepSeek-V3 (multi-head latent attention with MoE) would test whether the null perturbation result is specific to Qwen3.5's hybrid Mamba architecture or general to MoE models.


The Bottom Line

We set out to answer three questions about H-Neurons in MoE models:

Do they exist? Yes. 67 H-Neurons at 0.0127 per mille — consistent with dense model findings. The identification methodology transfers cleanly to MoE.

Can they detect hallucination? Yes, with caveats. AUROC of 0.860 in-domain, 0.697-0.768 cross-domain. The ranking signal is real and generalizable. But binary prediction accuracy is near majority-class baselines without threshold recalibration — a practical limitation that matters for deployment.

Can they control hallucination? No. Three independent experimental configurations, two hardware platforms, two quantization schemes — all show the same thing: scaling H-Neuron activations from full suppression to 3x amplification produces no systematic change in compliance behavior. The MoE routing mechanism creates a resilience to single-neuron perturbation that dense models don't have.

The positive finding (detection works) and the negative finding (intervention doesn't) are both real contributions. But if we're being honest, the negative finding is the more important one. It tells us something fundamental about the difference between dense and sparse architectures: you can read the same signals from both, but you can only write to one of them at the neuron level. MoE models distribute their computation across too many parallel pathways for any individual neuron to serve as a meaningful control point.

For the field of mechanistic interpretability, this means the tools that work on dense models — activation patching, neuron-level steering, causal tracing through individual units — may need to be rethought for the MoE architectures that increasingly dominate the frontier. The atoms of intervention in MoE might not be neurons. They might be experts, routing decisions, or coordinated groups of neurons across multiple experts.

That's a harder problem. But at least now we know it's the right one to work on.