Back to Signal Feed
CodeTracked since May 20, 2026

Enable Optional Stage-2 SimHash Pattern Clustering in MoAI Classifier

Added an opt-in Stage-2 clustering upgrade for the MoAI harness pattern classifier (SPEC-V3R4-HARNESS-003): when `learning.classifier.stage_2_enabled` is enabled, pattern events are passed through a SimHash64 (FNV-1a) + Hamming-distance Union-Find clustering path, while Stage-1 classification output remains byte-identical by default (`stage_2_enabled: false`).

Embedding-Cluster ClassifierSimHashUnion-FindHamming distance

What Happened

  • Added an opt-in Stage-2 clustering upgrade for the MoAI harness pattern classifier (SPEC-V3R4-HARNESS-003): when `learning.classifier.stage_2_enabled` is enabled, pattern events are passed through a SimHash64 (FNV-1a) + Hamming-distance Union-Find clustering path, while Stage-1 classification output remains byte-identical by default (`stage_2_enabled: false`).
  • Added an opt-in Stage-2 clustering upgrade for the MoAI harness pattern classifier (SPEC-V3R4-HARNESS-003): when `learning.classifier.stage_2_enabled` is enabled, pattern events are passed through a SimHash64 (FNV-1a) + Hamming-distance Union-Find clustering path, while Stage-1 classification output remains byte-identical by default (`stage_2_enabled: false`).
  • 1 evidence item attached for review.

What is Different

Before

Scattered source updates, isolated context, and manual follow-up across multiple feeds.

Now

Introduced a concrete new capability: configuration-driven, opt-in Tier-2 pattern aggregation that computes 64-bit SimHash features, clusters matches by Hamming distance via Union-Find, logs merge audits, and applies a privacy guard using only 64-byte `PromptPreview` instead of raw `PromptContent`.

Why Track This

Why It Matters

Operators and model platform teams can reduce noisy duplicate pattern signals by turning on the new Stage-2 clustering mode, while keeping existing deployments unchanged by default because Stage-2 is off unless explicitly configured. This adds a controlled quality-improvement path for triage without forcing behavior changes in production, and it also lowers privacy risk by preventing full prompt content from being fed into clustering logic. The implementation already shows strong AC pass rate and 6.6ms/op benchmark margins, but rollout should continue to watch Tier assignment correctness and uncovered filesystem/error-path tests, plus remote CI (Windows + CodeQL) outcomes before broad rollout.

Impact

Operators and model platform teams can reduce noisy duplicate pattern signals by turning on the new Stage-2 clustering mode, while keeping existing deployments unchanged by default because Stage-2 is off unless explicitly configured. This adds a controlled quality-improvement path for triage without forcing behavior changes in production, and it also lowers privacy risk by preventing full prompt content from being fed into clustering logic. The implementation already shows strong AC pass rate and 6.6ms/op benchmark margins, but rollout should continue to watch Tier assignment correctness and uncovered filesystem/error-path tests, plus remote CI (Windows + CodeQL) outcomes before broad rollout.

What To Watch Next

  • Watch whether Embedding-Cluster Classifier becomes a repeated pattern.
  • Track follow-up changes around Evals and Benchmarks.
  • Compare future signals against this evidence trail.
  • Re-check risk flags: tier_field_merged_value_not_asserted, cluster_merge_audit_os_error_paths_uncovered.
Open Topic TimelineOpen Technical EventOpen Original Sourcetier_field_merged_value_not_asserted / cluster_merge_audit_os_error_paths_uncovered / with_defaults_zero_field_branches_not_directly_tested / remote_windows_ci_and_codeql_pending

Supporting Evidence