io.github.dbsectrainer/mcp-eval-runner

A standardized testing harness for MCP servers and agent workflows

0MITdatabases

Install

Config snippet generator goes here (5 client tabs)

README

# MCP Eval Runner

npm `mcp-eval-runner` package

A standardized testing harness for MCP servers and agent workflows. Define test cases as YAML fixtures (steps → expected tool calls → expected outputs), run regression suites directly from your MCP client, and get pass/fail results with diffs — without leaving Claude Code or Cursor.

[Tool reference](#tools) | [Configuration](#configuration) | [Fixture format](#fixture-format) | [Contributing](#contributing) | [Troubleshooting](#troubleshooting) | [Design principles](#design-principles)

## Key features

- **YAML fixtures**: Test cases are plain files in version control — diffable, reviewable, and shareable.
- **Two execution modes**: Live mode spawns a real MCP server and calls tools via stdio; simulation mode runs assertions against `expected_output` without a server.
- **Composable assertions**: Combine `output_contains`, `output_not_contains`, `output_equals`, `output_matches`, `schema_match`, `tool_called`, and `latency_under` per step.
- **Step output piping**: Reference a previous step's output in downstream inputs via `{{steps.<step_id>.output}}`.
- **Regression reports**: Compare the current run to any past run and surface what changed.
- **Watch mode**: Automatically reruns the affected fixture when files change.
- **CI-ready**: Includes a GitHub Action for running evals on every config change.

## Requirements

- Node.js v22.5.0 or newer.
- npm.

## Getting started

Add the following config to your MCP client:

```json
{
  "mcpServers": {
    "eval-runner": {
      "command": "npx",
      "args": ["-y", "mcp-eval-runner@latest"]
    }
  }
}
```

By default, eval fixtures are loaded from `./evals/` in the current working directory. To use a different path:

```json
{
  "mcpServers": {
    "eval-runner": {
      "command": "npx",
      "args": ["-y", "mcp-eval-runner@latest", "--fixtures=~/my-project/evals"]
    }
  }
}
```

### MCP Client configuration

Amp · Claude Code · Cline · Cursor · VS Code · Windsurf · Zed

## Your first prompt

Create a file at `evals/smoke.yaml`. Use **live mode** (recommended) by including a `server` block:

```yaml
name: smoke
description: "Verify eval runner itself is working"
server:
  command: node
  args: ["dist/index.js"]
steps:
  - id: list_check
    description: "List available test cases"
    tool: list_cases
    input: {}
    expect:
      output_contains: "smoke"
```

Then enter the following in your MCP client:

```
Run the eval suite.
```

Your client should return a pass/fail result for the smoke test.

## Fixture format

Fixtures are YAML (or JSON) files placed in the fixtures directory. Each file defines one test case.

### Top-level fields

| Field         | Required | Description                                                                               |
| ------------- | -------- | ----------------------------------------------------------------------------------------- |
| `name`        | Yes      | Unique name for the test case                                                             |
| `description` | No       | Human-readable description                                                                |
| `server`      | No       | Server config — if present, runs in **live mode**; if absent, runs in **simulation mode** |
| `steps`       | Yes      | Array of steps to execute                                                                 |

### `server` block (live mode)

```yaml
server:
  command: node # executable to spawn
  args: ["dist/index.js"] # arguments
  env: # optional environment variables
    MY_VAR: "value"
```

When `server` is present the eval runner spawns the server as a child process, connects via MCP stdio transport, and calls each step's tool against the live server.

### `steps` array

Each step has the following fields:

| Field             | Required | Description                                                   |
| ----------------- | -------- | ------------------------------------------------------------- |
| `id`              | Yes      | Unique identifier within the fixture (used for output piping) |
| `tool`            | Yes      | MCP tool name to call                                         |
| `description`     | No       | Human-readable step description                               |
| `input`           | No       | Key-value map of arguments passed to the tool (default: `{}`) |
| `expected_output` | No       | Literal string used as output in simulation mode              |
| `expect`          | No       | Assertions evaluated against the step output                  |

### Execution modes

**Live mode** — fixture has a `server` block:

- The server is spawned and each step calls the named tool via MCP stdio.
- Assertions run against the real tool response.
- Errors from the server cause the step (and by default the case) to fail immediately.

**Simulation mode** — no `server` block:

- No server is started.
- Each step's output is taken from `expected_out