BenchmarkTracked since May 19, 2026

Make AI skill ground truths tool-agnostic for MCP vs CLI A/B evaluation

The PR converts Databricks AI skill ground-truth definitions from MCP-prescriptive response checks to tool-agnostic natural-language checks, preserving metadata and restructuring cases so the same benchmarks can compare MCP-enabled and CLI-only workflows consistently in A/B evals.

tool-agnostic ground truthsstf compareoutputs.responseMCP

What Happened

The PR converts Databricks AI skill ground-truth definitions from MCP-prescriptive response checks to tool-agnostic natural-language checks, preserving metadata and restructuring cases so the same benchmarks can compare MCP-enabled and CLI-only workflows consistently in A/B evals.
The PR converts Databricks AI skill ground-truth definitions from MCP-prescriptive response checks to tool-agnostic natural-language checks, preserving metadata and restructuring cases so the same benchmarks can compare MCP-enabled and CLI-only workflows consistently in A/B evals.
1 evidence item attached for review.

What is Different

Before

Scattered source updates, isolated context, and manual follow-up across multiple feeds.

Now

Replaced MCP-specific expectation blocks in the eval cases with tool-independent natural-language response criteria and aligned per-skill suites for direct A/B comparison (MCP 44-tools vs CLI 0-tools), with metadata retained so evaluations remain traceable.

Why Track This

Why It Matters

Evaluators and model-engineering teams can now run A/B evaluations of Databricks skills on a common, tool-neutral basis, which should make behavioral regressions and real improvements easier to detect before review instead of being confused by MCP-vs-CLI prompt formatting differences; the run already shows a concrete CLI-side quality win for agent-bricks, while MCP-only-only cases were intentionally removed, so watch for coverage gaps in MCP-only scenarios and whether dropped cases need equivalent replacements.

Impact

What To Watch Next

Watch whether tool-agnostic ground truths becomes a repeated pattern.
Track follow-up changes around Evals and Benchmarks.
Compare future signals against this evidence trail.
Re-check risk flags: reduced_coverage_for_mcp_only_scenarios, dropped_cases_masking_mcp_regressions.

Open Topic Timeline Open Technical Event Open Original Sourcereduced_coverage_for_mcp_only_scenarios / dropped_cases_masking_mcp_regressions / evaluation_results_not_committed_to_repo / single_commit_validation_gap

Supporting Evidence

GITHUB PULL REQUESTHigh Trust

databricks-solutions/ai-dev-kit PR #538: Add tool-agnostic A/B ground truths for ai-ml-engineering skills (issue #485)

Single commit rewrites `outputs.response` into tool-agnostic descriptions, adds handcrafted databricks-ai-functions grounds, and removes MCP-only cases that cannot be mapped to CLI prompts.