Your WebMCP Tools Look Fine. But Will AI Agents Actually Use Them?
Test every tool against Gemini, GPT-4o, and Claude with auto-generated prompts -- formal, casual, multilingual, adversarial. Identify the failures that manual testing misses. Fix them before they reach users.
Get Your Free Prompt Coverage Analysis
Our team will test how AI models understand your website's tools and walk you through coverage gaps — completely free.
Simulated data for illustration. Actual results vary by implementation.
The Gap Between “Defined” and “Works” Is Where Customers Disappear
You’ve implemented WebMCP. Your tools pass basic checks. But three invisible problems are waiting to surface in production.
The Semantic Gap
Your tool works when you test “Search for electronics.” But real users say “Find me a birthday present under $50.”
navigator.modelContext.registerTool({
name: "search_products",
description: "Search products by category",
inputSchema: {
type: "object",
properties: {
category: { type: "string" },
max_price: { type: "number" }
}
}
});Real users say:
Will the agent select the right tool? Extract correct parameters? Avoid hallucinating values?
The Model Problem
Same prompt. Three AI models. Three different interpretations. Two of them are wrong.
Prompt: “Book me a table for two this Friday at 7”
Gemini assumed AM. Users don’t choose their model — your tools must work with all of them.
The Scale Problem
12 tools × 50 prompts × 3 models = 1,800 test cases. At 2 min each, that's 60 hours of manual work.
And you’d redo it every time you change a description, after every model update, for every new tool.
Without automated testing, you’re deploying with hope. Prompt Coverage Testing replaces hope with data.
From Tool Definitions to Full Coverage Report in 60 Seconds
Four steps. Zero configuration. Paste your tools and let the system do the work.
Provide Your Tools
Paste your registerTool() definitions, enter a URL to auto-detect tools, or upload a tools.json file.
We Generate Prompts
50–200 contextually relevant prompts per tool across 8 categories: formal, casual, multilingual, adversarial, and more.
Test Across Models
Each prompt runs against Gemini, GPT-4o, and Claude simultaneously. 3× per model for stability scoring.
Analyze & Report
Get tool routing accuracy, parameter extraction quality, hallucination detection, and confidence scoring — all in 60 seconds.
Industry prompt packs available: Travel, E-Commerce, Healthcare, Restaurants, Real Estate, SaaS. Each adds 50+ domain-specific prompts.
Your Coverage Report: Everything You Need, Nothing You Don’t
From 30,000-foot overview to individual prompt forensics — drill into exactly the level of detail you need.
Every Failure Gets a Diagnosis and a Fix
Not just “it broke.” We tell you WHY it broke, WHICH models are affected, and exactly HOW to fix it — with predicted success rates.
book_table (restaurant booking tool)contact_form (general inquiry tool)Root Cause Analysis
The description for book_table says "Book a table at the restaurant" but never mentions "reservation." Gemini treats "reservation" as semantically closer to "inquiry" than "booking." GPT-4o and Claude handle this correctly — this is a Gemini-specific behavior.
WebMCP spec recommends: "Include synonyms and related terms in descriptions to improve agent comprehension."
Recommended Fix
"Book a table at the restaurant""Book a table at the restaurant. Also handles reservations, dining appointments, and table bookings for any party size."{ size: 2, date: "2026-02-13", time: "19:00" }{ size: 2, date: "2026-02-13", time: "07:00" }Root Cause Analysis
The "time" parameter is typed as "string" with no format constraint. When the user says "at 7," Gemini defaults to 24-hour format and interprets 7 as 07:00 (AM). GPT-4o correctly infers evening context. Claude passes "7pm" as-is.
Recommended Fix
time: { type: "string" }time: { type: "string", description: "Time in 24h HH:MM. For restaurant bookings, assume PM for single-digit hours (7 = 19:00)." }{ cart_id: "..." }{ cart_id: "...", shipping_method: "express" }Root Cause Analysis
Your checkout schema only defines cart_id. GPT-4o invented a shipping_method parameter that doesn’t exist. The user thinks they selected express shipping — but your tool doesn’t support it. Your backend may silently ignore or crash on the extra field.
Recommended Fix
"Complete the purchase for items in the cart.""Complete the purchase for items in the cart. Shipping method is determined automatically — do not pass a shipping parameter."Three failures. Three different root causes. Three specific fixes with predicted success rates. Most fixes take under 30 seconds to apply.
Same Prompt. Three Models. Three Different Interpretations.
Your users don’t choose their AI model. Chrome uses Gemini. ChatGPT uses GPT-4o. Claude uses Claude. Your tools need to work with all of them.
Insight: Gemini defaults to AM for ambiguous hours. Claude preserves natural language format. GPT-4o infers evening context from “dinner reservation” semantics.
Insight: Ambiguous prompts reveal the most about tool quality. Claude correctly identified insufficient information. Gemini guessed (sometimes right, sometimes wrong). GPT-4o confidently chose the wrong tool — a “pass” on GPT-4o doesn’t mean it works correctly.
| Tool | Gemini | GPT-4o | Claude | All |
|---|---|---|---|---|
| search_flights | 91% | 96% | 94% | 94% ✓ |
| book_flight | 84% | 92% | 88% | 88% ✓ |
| select_seat | 89% | 93% | 91% | 91% ✓ |
| checkout | 72% | 81% | 79% | 77% ⚠ |
| track_order | 68% | 74% | 82% | 75% ⚠ |
| contact_form | 58% | 64% | 66% | 63% ⚠ |
| ALL TOOLS | 82% | 91% | 88% | 87% |
Example insight: In this simulation, Gemini underperforms on checkout (72%) and contact_form (58%). Actual results vary by implementation.
This matrix tells you not just WHAT’s failing, but WHERE and FOR WHOM. A checkout tool at 72% on Gemini means 28% of Chrome users’ agents will struggle to complete a purchase.
Automated Coverage Testing on Every Deploy
Block deploys that reduce prompt coverage. Catch regressions in PR review. Integrate in under 5 minutes.
# .github/workflows/webmcp.yml
name: Agent Readiness Check
on: [push, pull_request]
jobs:
prompt-coverage:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: webmcp/prompt-coverage-action@v2
with:
api-key: ${{ secrets.WEBMCP_API_KEY }}
tools-file: ./webmcp/tools.json
models: "gemini,gpt4o,claude"
min-coverage: 85
fail-on-regression: true
max-new-failures: 0
stability-runs: 3
max-cost: "5.00"
pr-mode: "smoke"
pr-model: "gemini"Flakiness Policy
All 3 stability runs must pass. Zero tolerance.
Best for: Release branches, production deploys
2 of 3 runs must pass. Flaky prompts logged but don't block.
Best for: Daily development
1 of 3 runs must pass. Maximum developer velocity.
Best for: Feature branches, exploration
CI false-positive rate below 2% — comparable to traditional unit test suites.
Test. Fix. Improve. Repeat.
Average customers improve from 73% to 89% coverage within two weeks. Here’s how the improvement cycle works.
"Contact the company""Contact the company for general inquiries""Contact the company for general inquiries, questions, and partnership opportunities. Not for support issues (use support_request) or feedback (use feedback_form)."A/B Testing Mode
Not sure which description variant will perform better? Use A/B testing mode to compare two description variants side-by-side before committing to a fix. Run both against the same prompt set and see which one scores higher — with statistical significance.
What “Good” Coverage Looks Like — Target Ranges
Illustrative coverage targets by industry vertical. Use these as starting guidelines for your optimization goals.
Coverage Tiers
Agents frequently select wrong tools. Users will notice failures.
Works for common requests. Breaks on natural language variations.
Most realistic scenarios handled. Some model-specific failures remain.
Production-ready. Handles natural language, multilingual, most edge cases.
Top-tier. Adversarial prompts handled. Full multi-model consistency.
| Industry | Before | After | Top 10% | Target |
|---|---|---|---|---|
| E-Commerce | 68% | 86% | 94% | 85%+ |
| Travel & Booking | 71% | 88% | 95% | 90%+ |
| Healthcare | 54% | 79% | 88% | 85%+ |
| Restaurants | 65% | 84% | 92% | 80%+ |
| SaaS Platforms | 72% | 89% | 96% | 85%+ |
| Real Estate | 61% | 82% | 90% | 80%+ |
| Overall Average | 73% | 89% | 94% | 85%+ |
Every failure = lost revenue. Zero tolerance for checkout misrouting.
Failures frustrate users but don’t directly cost revenue.
Lower stakes. Broader tolerance for natural language variation.
Transparent Testing Economics
You should know exactly what prompt coverage testing costs before you run it.
How Testing Is Metered
A “prompt test” = one prompt tested against one model, one time.
Estimated costs include LLM API usage. Pricing subject to change.
Estimated comparison (illustrative)
Budget Controls
Stop Guessing. Start Testing.
Your WebMCP tools are either ready for the real world — or they're not. Get a free expert analysis and find out.
Free during early access. No credit card required.
Questions Before You Test
Fully transparent. A typical test run for a medium site (12 tools, 100 prompts, 3 models, 3 stability runs) costs approximately $2.70. You see the estimated cost BEFORE you click "Run." Every plan includes monthly budget caps. For development, use Ollama with local models for zero-cost testing.
This is the #1 concern we designed for. Every prompt runs 3× per model for stability scoring. A prompt that passes 2/3 times is flagged as "flaky" — not "passing." Our CI integration includes three flakiness policies (strict, standard, relaxed). The result: false-positive rate below 2%.
Yes. Enterprise plans support custom model endpoints. For development, we support Ollama out of the box — test against Llama 3, Mistral, Qwen at zero cost. Results from local models are labeled separately since they may differ from commercial models.
Absolutely — and arguably MORE useful. Even 2 tools × 50 prompts × 3 models × 3 stability runs = 900 test cases. Model-specific quirks affect every tool regardless. Small implementations often have the MOST to gain because each tool carries more weight.
Testing in ChatGPT gives you ONE data point: one prompt, one model, one run. Prompt Coverage Testing gives you THOUSANDS: 200+ prompts × 3 models × 3 stability runs, with automated failure categorization, root cause analysis, fix recommendations, and regression tracking over time.
Works identically. Whether you define tools via the imperative API (registerTool) or via HTML attributes (toolname, tooldescription), we evaluate the same thing. Declarative tools often score LOWER initially because form labels tend to be shorter. Our recommendations include specific attribute improvements.