io.github.dbsectrainer/mcp-eval-runner
A standardized testing harness for MCP servers and agent workflows
★ 0MITdatabases
Install
Config snippet generator goes here (5 client tabs)
README
# MCP Eval Runner
npm `mcp-eval-runner` package
A standardized testing harness for MCP servers and agent workflows. Define test cases as YAML fixtures (steps → expected tool calls → expected outputs), run regression suites directly from your MCP client, and get pass/fail results with diffs — without leaving Claude Code or Cursor.
[Tool reference](#tools) | [Configuration](#configuration) | [Fixture format](#fixture-format) | [Contributing](#contributing) | [Troubleshooting](#troubleshooting) | [Design principles](#design-principles)
## Key features
- **YAML fixtures**: Test cases are plain files in version control — diffable, reviewable, and shareable.
- **Two execution modes**: Live mode spawns a real MCP server and calls tools via stdio; simulation mode runs assertions against `expected_output` without a server.
- **Composable assertions**: Combine `output_contains`, `output_not_contains`, `output_equals`, `output_matches`, `schema_match`, `tool_called`, and `latency_under` per step.
- **Step output piping**: Reference a previous step's output in downstream inputs via `{{steps.<step_id>.output}}`.
- **Regression reports**: Compare the current run to any past run and surface what changed.
- **Watch mode**: Automatically reruns the affected fixture when files change.
- **CI-ready**: Includes a GitHub Action for running evals on every config change.
## Requirements
- Node.js v22.5.0 or newer.
- npm.
## Getting started
Add the following config to your MCP client:
```json
{
"mcpServers": {
"eval-runner": {
"command": "npx",
"args": ["-y", "mcp-eval-runner@latest"]
}
}
}
```
By default, eval fixtures are loaded from `./evals/` in the current working directory. To use a different path:
```json
{
"mcpServers": {
"eval-runner": {
"command": "npx",
"args": ["-y", "mcp-eval-runner@latest", "--fixtures=~/my-project/evals"]
}
}
}
```
### MCP Client configuration
Amp · Claude Code · Cline · Cursor · VS Code · Windsurf · Zed
## Your first prompt
Create a file at `evals/smoke.yaml`. Use **live mode** (recommended) by including a `server` block:
```yaml
name: smoke
description: "Verify eval runner itself is working"
server:
command: node
args: ["dist/index.js"]
steps:
- id: list_check
description: "List available test cases"
tool: list_cases
input: {}
expect:
output_contains: "smoke"
```
Then enter the following in your MCP client:
```
Run the eval suite.
```
Your client should return a pass/fail result for the smoke test.
## Fixture format
Fixtures are YAML (or JSON) files placed in the fixtures directory. Each file defines one test case.
### Top-level fields
| Field | Required | Description |
| ------------- | -------- | ----------------------------------------------------------------------------------------- |
| `name` | Yes | Unique name for the test case |
| `description` | No | Human-readable description |
| `server` | No | Server config — if present, runs in **live mode**; if absent, runs in **simulation mode** |
| `steps` | Yes | Array of steps to execute |
### `server` block (live mode)
```yaml
server:
command: node # executable to spawn
args: ["dist/index.js"] # arguments
env: # optional environment variables
MY_VAR: "value"
```
When `server` is present the eval runner spawns the server as a child process, connects via MCP stdio transport, and calls each step's tool against the live server.
### `steps` array
Each step has the following fields:
| Field | Required | Description |
| ----------------- | -------- | ------------------------------------------------------------- |
| `id` | Yes | Unique identifier within the fixture (used for output piping) |
| `tool` | Yes | MCP tool name to call |
| `description` | No | Human-readable step description |
| `input` | No | Key-value map of arguments passed to the tool (default: `{}`) |
| `expected_output` | No | Literal string used as output in simulation mode |
| `expect` | No | Assertions evaluated against the step output |
### Execution modes
**Live mode** — fixture has a `server` block:
- The server is spawned and each step calls the named tool via MCP stdio.
- Assertions run against the real tool response.
- Errors from the server cause the step (and by default the case) to fail immediately.
**Simulation mode** — no `server` block:
- No server is started.
- Each step's output is taken from `expected_out