io.github.TickTockBent/charlotte

Renders web pages into structured, agent-readable representations using headless Chromium.

109MITdevtools

Install

Config snippet generator goes here (5 client tabs)

README

# Charlotte

**The Web, Readable.**

Your AI agent spends 60,000 tokens just to look at a web page. Charlotte does it in 336.

Charlotte is an MCP server that gives AI agents structured, token-efficient access to the web.
Instead of dumping the full accessibility tree on every call, Charlotte returns only what
the agent needs: a compact page summary on arrival, targeted queries for specific elements,
and full detail only when explicitly requested. The result is 25-182x less data per page
compared to [Playwright MCP](https://github.com/anthropics/playwright-mcp), saving thousands of dollars across production workloads.

## Why Charlotte?

Most browser MCP servers dump the entire accessibility tree on every call — a flat text blob that can exceed a million characters on content-heavy pages. Agents pay for all of it whether they need it or not.

Charlotte decomposes each page into a typed, structured representation — landmarks, headings, interactive elements, forms, content summaries — and lets agents control how much they receive with three detail levels. When an agent navigates to a new page, it gets a compact orientation (336 characters for Hacker News) instead of the full element dump (61,000+ characters). When it needs specifics, it asks for them.

### Benchmarks

Charlotte v0.5.1 vs Playwright MCP, measured by characters returned per tool call on real websites:

**Navigation** (first contact with a page):

| Site | Charlotte `navigate` | Playwright `browser_navigate` |
|:---|---:|---:|
| example.com | 612 | 817 |
| Wikipedia (AI article) | 7,667 | 1,040,636 |
| Hacker News | 336 | 61,230 |
| GitHub repo | 3,185 | 80,297 |

Charlotte's `navigate` returns minimal detail by default — landmarks, headings, and interactive element counts grouped by page region. Enough to orient, not enough to overwhelm. On Wikipedia, that's **135x smaller** than Playwright's response.

**Tool definition overhead** (invisible cost per API call):

| Profile | Tools | Def. tokens/call | Savings vs full |
|:---|---:|---:|---:|
| full | 42 | ~7,400 | — |
| browse (default) | 23 | ~3,900 | **~47%** |
| core | 7 | 1,677 | **~77%** |

Tool definitions are sent on every API round-trip. With the default `browse` profile, Charlotte carries ~47% less definition overhead than loading all tools. Over a 20-call browsing session, that's **~38% fewer total tokens**. See the [profile benchmark report](docs/charlotte-profile-benchmark-report.md) for full results.

**The workflow difference:** Playwright agents receive 61K+ characters every time they look at Hacker News, whether they're reading headlines or looking for a login button. Charlotte agents get 336 characters on arrival, call `find({ type: "link", text: "login" })` to get exactly what they need, and never pay for the rest.

## How It Works

Charlotte maintains a persistent headless Chromium session and acts as a translation layer between the visual web and the agent's text-native reasoning. Every page is decomposed into a structured representation:

```
┌─────────────┐     MCP Protocol     ┌──────────────────┐
│   AI Agent  │<────────────────────>│    Charlotte     │
└─────────────┘                      │                  │
                                     │  ┌────────────┐  │
                                     │  │  Renderer  │  │
                                     │  │  Pipeline  │  │
                                     │  └─────┬──────┘  │
                                     │        │         │
                                     │  ┌─────▼──────┐  │
                                     │  │  Headless  │  │
                                     │  │  Chromium  │  │
                                     │  └────────────┘  │
                                     └──────────────────┘
```

Agents receive landmarks, headings, interactive elements with typed metadata, bounding boxes, form structures, and content summaries — all derived from what the browser already knows about every page.

## Features

**Navigation** — `navigate`, `back`, `forward`, `reload`

**Observation** — `observe` (3 detail levels, structural tree view), `find` (spatial + semantic search, CSS selector mode), `screenshot` (with persistent artifact management), `screenshots`, `screenshot_get`, `screenshot_delete`, `diff` (structural comparison against snapshots)

**Interaction** — `click`, `click_at` (coordinate-based), `type`, `select`, `toggle`, `submit`, `scroll`, `hover`, `drag`, `key` (single/sequence with element targeting), `wait_for` (async condition polling), `upload` (file input), `dialog` (accept/dismiss JS dialogs)

**Monitoring** — `console` (all severity levels, filtering, timestamps), `requests` (full HTTP history, method/status/resource type filtering)

**Session Management** — `tabs`, `tab_open`, `tab_switch`, `tab_close`, `viewport` (device presets), `network` (throttling, URL blocking), `set_cookies`, `get_cookies`, `clear_cookies`, `set_headers`, `configure`

**Development Mode** — `dev_se