Back to Signal Feed
BenchmarkTracked since May 19, 2026

Make AI skill ground truths tool-agnostic for MCP vs CLI A/B evaluation

The PR converts Databricks AI skill ground-truth definitions from MCP-prescriptive response checks to tool-agnostic natural-language checks, preserving metadata and restructuring cases so the same benchmarks can compare MCP-enabled and CLI-only workflows consistently in A/B evals.

tool-agnostic ground truthsstf compareoutputs.responseMCP

What Happened

  • The PR converts Databricks AI skill ground-truth definitions from MCP-prescriptive response checks to tool-agnostic natural-language checks, preserving metadata and restructuring cases so the same benchmarks can compare MCP-enabled and CLI-only workflows consistently in A/B evals.
  • The PR converts Databricks AI skill ground-truth definitions from MCP-prescriptive response checks to tool-agnostic natural-language checks, preserving metadata and restructuring cases so the same benchmarks can compare MCP-enabled and CLI-only workflows consistently in A/B evals.
  • 1 evidence item attached for review.

What is Different

Before

Scattered source updates, isolated context, and manual follow-up across multiple feeds.

Now

Replaced MCP-specific expectation blocks in the eval cases with tool-independent natural-language response criteria and aligned per-skill suites for direct A/B comparison (MCP 44-tools vs CLI 0-tools), with metadata retained so evaluations remain traceable.

Why Track This

Why It Matters

Evaluators and model-engineering teams can now run A/B evaluations of Databricks skills on a common, tool-neutral basis, which should make behavioral regressions and real improvements easier to detect before review instead of being confused by MCP-vs-CLI prompt formatting differences; the run already shows a concrete CLI-side quality win for agent-bricks, while MCP-only-only cases were intentionally removed, so watch for coverage gaps in MCP-only scenarios and whether dropped cases need equivalent replacements.

Impact

Evaluators and model-engineering teams can now run A/B evaluations of Databricks skills on a common, tool-neutral basis, which should make behavioral regressions and real improvements easier to detect before review instead of being confused by MCP-vs-CLI prompt formatting differences; the run already shows a concrete CLI-side quality win for agent-bricks, while MCP-only-only cases were intentionally removed, so watch for coverage gaps in MCP-only scenarios and whether dropped cases need equivalent replacements.

What To Watch Next

  • Watch whether tool-agnostic ground truths becomes a repeated pattern.
  • Track follow-up changes around Evals and Benchmarks.
  • Compare future signals against this evidence trail.
  • Re-check risk flags: reduced_coverage_for_mcp_only_scenarios, dropped_cases_masking_mcp_regressions.
Open Topic TimelineOpen Technical EventOpen Original Sourcereduced_coverage_for_mcp_only_scenarios / dropped_cases_masking_mcp_regressions / evaluation_results_not_committed_to_repo / single_commit_validation_gap

Supporting Evidence