Track important changes in Evals and Benchmarks, including capabilities, product updates, adoption signals, risks, and evidence worth continued monitoring.
This change updates SublinearAdapter so solver calls use the native `mcp__ruflo-sublinear__solve` tool when reachable, with automatic fallback to the in-repo JS CG kernel when the native path is unavailable.
What ChangedThis change updates SublinearAdapter so solver calls use the native `mcp__ruflo-sublinear__solve` tool when reachable, with automatic fallback to the in-repo JS CG kernel when the native path is unavailable.
Why It MattersOperators using `trader-portfolio-cg` can keep risk-solving workflows operating during native-tool outages while still gaining the native sublinear path automatically when the MCP tool is mounted, which directly reduces runtime disruption risk and preserves throughput in mixed environments. The PR also makes backend choice auditable per solve result and ships parity-checked local-benchmarks (1.61x–1.9x faster than Neumann locally, parity within tight error bounds), so teams should now watch whether CI with a mounted daemon actually validates the expected native path and whether tool-mount detection or canary override flags ever misroute traffic to the wrong solver.
Final score 81Confidence 941 evidence itemSublinearAdaptermcp__ruflo-sublinear__solveRUFLO_SUBLINEAR_NATIVESolveResulttrader-portfolio-cg[email protected]
Added an opt-in Stage-2 clustering upgrade for the MoAI harness pattern classifier (SPEC-V3R4-HARNESS-003): when `learning.classifier.stage_2_enabled` is enabled, pattern events are passed through a SimHash64 (FNV-1a) + Hamming-distance Union-Find clustering path, while Stage-1 classification output remains byte-identical by default (`stage_2_enabled: false`).
What ChangedAdded an opt-in Stage-2 clustering upgrade for the MoAI harness pattern classifier (SPEC-V3R4-HARNESS-003): when `learning.classifier.stage_2_enabled` is enabled, pattern events are passed through a SimHash64 (FNV-1a) + Hamming-distance Union-Find clustering path, while Stage-1 classification output remains byte-identical by default (`stage_2_enabled: false`).
Why It MattersOperators and model platform teams can reduce noisy duplicate pattern signals by turning on the new Stage-2 clustering mode, while keeping existing deployments unchanged by default because Stage-2 is off unless explicitly configured. This adds a controlled quality-improvement path for triage without forcing behavior changes in production, and it also lowers privacy risk by preventing full prompt content from being fed into clustering logic. The implementation already shows strong AC pass rate and 6.6ms/op benchmark margins, but rollout should continue to watch Tier assignment correctness and uncovered filesystem/error-path tests, plus remote CI (Windows + CodeQL) outcomes before broad rollout.
Final score 80Confidence 961 evidence itemEmbedding-Cluster ClassifierSimHashUnion-FindHamming distancePattern clusteringPromptPreviewstage_2_enabledhamming_threshold
This change adds a dedicated daily evaluation module so teams can benchmark browser-agent runs in a repeatable, task-card-driven way instead of using ad-hoc scripts, making quality drift easier to detect over time.
What ChangedThis change adds a dedicated daily evaluation module so teams can benchmark browser-agent runs in a repeatable, task-card-driven way instead of using ad-hoc scripts, making quality drift easier to detect over time.
Why It MattersEvaluation operators and model teams can now run the same daily browser-agent tasks through one CLI workflow and compare human-vs-agent outcomes side by side, so regressions in automation quality become easier to detect before they spread into broader evaluation or deployment workflows; the next thing to watch is whether score variance stays stable across repeated runs and provider backends. The pull request adds a self-contained `browser_use/experiments/daily_task_eval` package with `A/B/C/D` presets, navigator options (including periodic replanning), and structured JSON/CSV outputs (`history`, `conversation`, summaries), but this should still be monitored for noise from replan timing and preset/provider defaults that could skew trend interpretation.
The PR adds support for running evals directly against a deployed agent, so model quality checks can target the exact service endpoint that users interact with.
What ChangedThe PR adds support for running evals directly against a deployed agent, so model quality checks can target the exact service endpoint that users interact with.
Why It MattersDevelopers and operators can now validate agent behavior on the same deployed instance that serves requests, which makes release quality checks more representative of production outcomes and helps catch regressions before users are exposed. After rollout, monitor for endpoint-level issues (auth, network stability, and environment drift across staging/production) and for evaluation flakiness from rate limits or timeouts that could hide or overstate real failures.
Final score 77Confidence 971 evidence itemopen-swedeployed agenteval execution
Variant A changes only the plugin-mode prompt/description logic so the two overlapping test skills can be selected correctly, removing conflicting exclusion text and replacing it with explicit invocation rules and tighter "do not use" boundaries.
What ChangedVariant A changes only the plugin-mode prompt/description logic so the two overlapping test skills can be selected correctly, removing conflicting exclusion text and replacing it with explicit invocation rules and tighter "do not use" boundaries.
Why It MattersDevelopers and operators using these skills in plugin mode should see the intended skill invoked more reliably, which directly reduces no-op or wrong-check responses during automated test analysis and helps keep review suggestions aligned with the user intent. Practically, this targets the earlier failure pattern where anti-pattern and smell requests often triggered neither skill or the sibling one; the change is technically driven by removing mutually exclusive guidance and adding explicit `INVOKE THIS SKILL` trigger phrasing. Continue monitoring `/evaluate` results and real user prompts for residual sibling misrouting or any drop in recall on edge prompts.
Final score 76Confidence 921 evidence itemplugin modetest-anti-patternstest-smell-detectionskill activation promptsmutual-exclusion clausesDO NOT USE FORskill-validator
The PR converts Databricks AI skill ground-truth definitions from MCP-prescriptive response checks to tool-agnostic natural-language checks, preserving metadata and restructuring cases so the same benchmarks can compare MCP-enabled and CLI-only workflows consistently in A/B evals.
What ChangedThe PR converts Databricks AI skill ground-truth definitions from MCP-prescriptive response checks to tool-agnostic natural-language checks, preserving metadata and restructuring cases so the same benchmarks can compare MCP-enabled and CLI-only workflows consistently in A/B evals.
Why It MattersEvaluators and model-engineering teams can now run A/B evaluations of Databricks skills on a common, tool-neutral basis, which should make behavioral regressions and real improvements easier to detect before review instead of being confused by MCP-vs-CLI prompt formatting differences; the run already shows a concrete CLI-side quality win for agent-bricks, while MCP-only-only cases were intentionally removed, so watch for coverage gaps in MCP-only scenarios and whether dropped cases need equivalent replacements.
Final score 75Confidence 901 evidence itemtool-agnostic ground truthsstf compareoutputs.responseMCPCLIdatabricks-agent-bricksdatabricks-ai-functions
The announcement introduces a new multimodal evaluation path in Strands Evals: an MLLM-as-a-judge method for image-to-text tasks, so generated text can be checked against the source image content instead of being judged only by text-only criteria.
What ChangedThe announcement introduces a new multimodal evaluation path in Strands Evals: an MLLM-as-a-judge method for image-to-text tasks, so generated text can be checked against the source image content instead of being judged only by text-only criteria.
Why It MattersDevelopers running visual AI systems can now automate checks that catch ungrounded or incorrect image-to-text outputs before users see them, reducing silent quality regressions in shopping, document-processing, and chart-analysis workflows. This matters because it moves evaluation from heuristic text-only checks to a multimodal judge path in Strands Evals, so mismatches like wrong invoice totals or off-image descriptions can be detected earlier. Watch for judge-score calibration drift, domain-specific hallucination tolerance, and sensitivity to prompt/template changes as the evaluator itself becomes part of the validation quality control loop.
Final score 73Confidence 871 evidence itemStrands EvalsMLLM-as-a-judgemultimodal evaluationimage-to-text
This PR republished the 2026-05-17 startaitools Tier 2 deep-dive on honest performance benchmarking for pipelines that route through paid APIs, with a concrete method for deterministic test data generation and explicit API-access gating.
What ChangedThis PR republished the 2026-05-17 startaitools Tier 2 deep-dive on honest performance benchmarking for pipelines that route through paid APIs, with a concrete method for deterministic test data generation and explicit API-access gating.
Why It MattersTeams running performance benchmarking for paid API or compiler integrations can compare results across machines and runs with fewer misleading fluctuations, reducing the chance of shipping the wrong optimization because of noisy measurements. Next, operators should monitor whether CI and local scripts correctly propagate API_KEY and EXPLICIT_OPT_IN, because gating mismatches can silently suppress workloads or generate less trustworthy comparisons when skip events are not consistently recorded.
Final score 72Confidence 881 evidence itemseeded-RNG corpus generationAPI_KEYEXPLICIT_OPT_INpaid API benchmark pipeline
A new open issue in databricks-solutions/ai-dev-kit asks maintainers to run evaluations on the experimental `ai-ml-engineering` branch, with explicit test scope across `databricks-agent-bricks`, `databricks-ai-functions`, `databricks-model-serving`, `databricks-mlflow-evaluation`, and `databricks-vector-search`.
What ChangedA new open issue in databricks-solutions/ai-dev-kit asks maintainers to run evaluations on the experimental `ai-ml-engineering` branch, with explicit test scope across `databricks-agent-bricks`, `databricks-ai-functions`, `databricks-model-serving`, `databricks-mlflow-evaluation`, and `databricks-vector-search`.
Why It MattersDevelopers and operators who depend on these Databricks AI components will have a clearer gate for adopting new experimental features, because the issue requests broad evaluation coverage before rollout, reducing the chance of exposing users to untested branch behavior. If executed, the evals should surface regressions and integration gaps across agent, serving, function, model-evaluation, and vector-search areas; the next watch points are whether results are completed, whether failures are consistent across repos, and whether follow-up fixes are triggered from the test findings.
Final score 64Confidence 961 evidence itemdatabricks-solutions/ai-dev-kitai-ml-engineeringdatabricks-agent-bricksdatabricks-ai-functionsdatabricks-model-servingdatabricks-mlflow-evaluationdatabricks-vector-searchevaluation
This change adds a dedicated daily evaluation module so teams can benchmark browser-agent runs in a repeatable, task-card-driven way instead of using ad-hoc scripts, making quality drift easier to detect over time.
ContributionIntroduces a concrete, reusable evaluation harness for browser-use by adding a standalone experiment module plus CLI orchestration (`init`, `run-agent`, `compare`) that executes predefined task cards, supports multiple model backends/presets, and produces comparable result artifacts.
ImpactEvaluation operators and model teams can now run the same daily browser-agent tasks through one CLI workflow and compare human-vs-agent outcomes side by side, so regressions in automation quality become easier to detect before they spread into broader evaluation or deployment workflows; the next thing to watch is whether score variance stays stable across repeated runs and provider backends. The pull request adds a self-contained `browser_use/experiments/daily_task_eval` package with `A/B/C/D` presets, navigator options (including periodic replanning), and structured JSON/CSV outputs (`history`, `conversation`, summaries), but this should still be monitored for noise from replan timing and preset/provider defaults that could skew trend interpretation.
This change updates SublinearAdapter so solver calls use the native `mcp__ruflo-sublinear__solve` tool when reachable, with automatic fallback to the in-repo JS CG kernel when the native path is unavailable.
ContributionImplemented native dispatch integration for sublinear conjugate-gradient solving in `SublinearAdapter` (with preserved legacy alias compatibility), including backend-discovery fallback logic and explicit result tagging (`cg-sublinear-native` vs `cg-local`, `[email protected]` vs `local-js-cg`) for operator visibility.
ImpactOperators using `trader-portfolio-cg` can keep risk-solving workflows operating during native-tool outages while still gaining the native sublinear path automatically when the MCP tool is mounted, which directly reduces runtime disruption risk and preserves throughput in mixed environments. The PR also makes backend choice auditable per solve result and ships parity-checked local-benchmarks (1.61x–1.9x faster than Neumann locally, parity within tight error bounds), so teams should now watch whether CI with a mounted daemon actually validates the expected native path and whether tool-mount detection or canary override flags ever misroute traffic to the wrong solver.
The announcement introduces a new multimodal evaluation path in Strands Evals: an MLLM-as-a-judge method for image-to-text tasks, so generated text can be checked against the source image content instead of being judged only by text-only criteria.
ContributionAdds a concrete multimodal evaluator capability for image-to-text pipelines, replacing text-only validation with an MLLM judge that verifies whether outputs are grounded in the visual input (for captions, document extraction, and chart summaries).
ImpactDevelopers running visual AI systems can now automate checks that catch ungrounded or incorrect image-to-text outputs before users see them, reducing silent quality regressions in shopping, document-processing, and chart-analysis workflows. This matters because it moves evaluation from heuristic text-only checks to a multimodal judge path in Strands Evals, so mismatches like wrong invoice totals or off-image descriptions can be detected earlier. Watch for judge-score calibration drift, domain-specific hallucination tolerance, and sensitivity to prompt/template changes as the evaluator itself becomes part of the validation quality control loop.
Added an opt-in Stage-2 clustering upgrade for the MoAI harness pattern classifier (SPEC-V3R4-HARNESS-003): when `learning.classifier.stage_2_enabled` is enabled, pattern events are passed through a SimHash64 (FNV-1a) + Hamming-distance Union-Find clustering path, while Stage-1 classification output remains byte-identical by default (`stage_2_enabled: false`).
ContributionIntroduced a concrete new capability: configuration-driven, opt-in Tier-2 pattern aggregation that computes 64-bit SimHash features, clusters matches by Hamming distance via Union-Find, logs merge audits, and applies a privacy guard using only 64-byte `PromptPreview` instead of raw `PromptContent`.
ImpactOperators and model platform teams can reduce noisy duplicate pattern signals by turning on the new Stage-2 clustering mode, while keeping existing deployments unchanged by default because Stage-2 is off unless explicitly configured. This adds a controlled quality-improvement path for triage without forcing behavior changes in production, and it also lowers privacy risk by preventing full prompt content from being fed into clustering logic. The implementation already shows strong AC pass rate and 6.6ms/op benchmark margins, but rollout should continue to watch Tier assignment correctness and uncovered filesystem/error-path tests, plus remote CI (Windows + CodeQL) outcomes before broad rollout.
The PR converts Databricks AI skill ground-truth definitions from MCP-prescriptive response checks to tool-agnostic natural-language checks, preserving metadata and restructuring cases so the same benchmarks can compare MCP-enabled and CLI-only workflows consistently in A/B evals.
ContributionReplaced MCP-specific expectation blocks in the eval cases with tool-independent natural-language response criteria and aligned per-skill suites for direct A/B comparison (MCP 44-tools vs CLI 0-tools), with metadata retained so evaluations remain traceable.
ImpactEvaluators and model-engineering teams can now run A/B evaluations of Databricks skills on a common, tool-neutral basis, which should make behavioral regressions and real improvements easier to detect before review instead of being confused by MCP-vs-CLI prompt formatting differences; the run already shows a concrete CLI-side quality win for agent-bricks, while MCP-only-only cases were intentionally removed, so watch for coverage gaps in MCP-only scenarios and whether dropped cases need equivalent replacements.
A new open issue in databricks-solutions/ai-dev-kit asks maintainers to run evaluations on the experimental `ai-ml-engineering` branch, with explicit test scope across `databricks-agent-bricks`, `databricks-ai-functions`, `databricks-model-serving`, `databricks-mlflow-evaluation`, and `databricks-vector-search`.
ContributionIt adds an explicit validation signal for an experimental branch: a coordinated eval request that ties the ai-ml-engineering changes to five related Databricks AI repositories, defining where quality and integration checks should now be executed before broader use.
ImpactDevelopers and operators who depend on these Databricks AI components will have a clearer gate for adopting new experimental features, because the issue requests broad evaluation coverage before rollout, reducing the chance of exposing users to untested branch behavior. If executed, the evals should surface regressions and integration gaps across agent, serving, function, model-evaluation, and vector-search areas; the next watch points are whether results are completed, whether failures are consistent across repos, and whether follow-up fixes are triggered from the test findings.
Variant A changes only the plugin-mode prompt/description logic so the two overlapping test skills can be selected correctly, removing conflicting exclusion text and replacing it with explicit invocation rules and tighter "do not use" boundaries.
ContributionImplemented an activation-only rewrite for the two test-analysis skills in plugin mode: both skills are retained, cross-skill exclusion instructions were removed, USE-FOR triggers were expanded with explicit eval-grade phrasing, DO NOT USE FOR was narrowed to actual disallow cases, and prompt wording from PR #653 was integrated to reduce wrong-or-empty skill selection.
ImpactDevelopers and operators using these skills in plugin mode should see the intended skill invoked more reliably, which directly reduces no-op or wrong-check responses during automated test analysis and helps keep review suggestions aligned with the user intent. Practically, this targets the earlier failure pattern where anti-pattern and smell requests often triggered neither skill or the sibling one; the change is technically driven by removing mutually exclusive guidance and adding explicit `INVOKE THIS SKILL` trigger phrasing. Continue monitoring `/evaluate` results and real user prompts for residual sibling misrouting or any drop in recall on edge prompts.
This PR republished the 2026-05-17 startaitools Tier 2 deep-dive on honest performance benchmarking for pipelines that route through paid APIs, with a concrete method for deterministic test data generation and explicit API-access gating.
ContributionIntroduces a reproducible benchmark workflow: deterministic corpus generation and explicit opt-in/credential checks for paid-API calls, so benchmark results are less dependent on accidental randomness or implicit run conditions.
ImpactTeams running performance benchmarking for paid API or compiler integrations can compare results across machines and runs with fewer misleading fluctuations, reducing the chance of shipping the wrong optimization because of noisy measurements. Next, operators should monitor whether CI and local scripts correctly propagate API_KEY and EXPLICIT_OPT_IN, because gating mismatches can silently suppress workloads or generate less trustworthy comparisons when skip events are not consistently recorded.
The PR adds support for running evals directly against a deployed agent, so model quality checks can target the exact service endpoint that users interact with.
ContributionIntroduces a deployed-agent evaluation path, enabling validation runs to execute against live deployment targets rather than only local test setups.
ImpactDevelopers and operators can now validate agent behavior on the same deployed instance that serves requests, which makes release quality checks more representative of production outcomes and helps catch regressions before users are exposed. After rollout, monitor for endpoint-level issues (auth, network stability, and environment drift across staging/production) and for evaluation flakiness from rate limits or timeouts that could hide or overstate real failures.