io.github.jghiringhelli/codeseeker

Graph-powered code intelligence with semantic search and knowledge graph for AI assistants

11NOASSERTIONdevtools

Install

Config snippet generator goes here (5 client tabs)

README

# CodeSeeker

**Four-layer hybrid search and knowledge graph for AI coding assistants.**  
BM25 + vector embeddings + RAPTOR directory summaries + graph expansion — fused into a single MCP tool that gives Claude, Copilot, and Cursor a real understanding of your codebase.

[![npm version](https://img.shields.io/npm/v/codeseeker.svg)](https://www.npmjs.com/package/codeseeker)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![TypeScript](https://img.shields.io/badge/TypeScript-100%25-blue.svg)](https://www.typescriptlang.org/)

Works with **Claude Code**, **GitHub Copilot** (VS Code 1.99+), **Cursor**, **Windsurf**, and **Claude Desktop**.  
Zero configuration — indexes on first use, stays in sync automatically.

## The Problem

AI assistants are powerful editors, but they navigate code like a tourist:
- **Grep finds text** — not meaning. `"find authentication logic"` returns every file containing the word "auth"
- **File reads are isolated** — Claude sees a file but not its dependencies, callers, or the patterns your team established
- **No memory of your project** — every session starts from scratch

CodeSeeker fixes this. It indexes your codebase once and gives AI assistants a queryable knowledge graph they can use on every turn.

## How It Works

A 4-stage pipeline runs on every query:

```
Query: "find JWT refresh token logic"
        │
        ▼  Stage 1 — Hybrid retrieval
   ┌─────────────────────────────────────────────────────┐
   │ BM25 (exact symbols, camelCase tokenized)           │
   │   +                                                 │
   │ Vector search (384-dim Xenova embeddings)           │
   │   ↓                                                 │
   │ Reciprocal Rank Fusion: score = Σ 1/(60 + rank_i)  │
   │ Top-30 results, including RAPTOR directory nodes    │
   └─────────────────────────────────────────────────────┘
        │
        ▼  Stage 2 — RAPTOR cascade (conditional)
   ┌─────────────────────────────────────────────────────┐
   │ IF best directory-summary score ≥ 0.5:              │
   │   → narrow results to that directory automatically  │
   │ ELSE: all 30 results pass through unchanged         │
   │ Effect: "what does auth/ do?" scopes to auth/       │
   │         "jwt.ts decode function" bypasses this      │
   └─────────────────────────────────────────────────────┘
        │
        ▼  Stage 3 — Scoring and deduplication
   ┌─────────────────────────────────────────────────────┐
   │ Dedup: keep highest-score chunk per file            │
   │ Source files:  +0.10  (definition sites matter)     │
   │ Test files:    −0.15  (prevent test dominance)      │
   │ Symbol boost:  +0.20  (query token in filename)     │
   │ Multi-chunk:   up to +0.30  (file has many hits)    │
   └─────────────────────────────────────────────────────┘
        │
        ▼  Stage 4 — Graph expansion
   ┌─────────────────────────────────────────────────────┐
   │ Top-10 results → follow IMPORTS/CALLS/EXTENDS edges │
   │ Structural neighbors scored at source × 0.7        │
   │ Avg graph connectivity: 20.8 edges/node             │
   └─────────────────────────────────────────────────────┘
        │
        ▼
   auth/jwt.ts (0.94), auth/refresh.ts (0.89), ...
```

The knowledge graph is built from AST-parsed imports at index time. It's what powers `analyze dependencies`, dead-code detection, and graph expansion in every search.

## What Makes It Different

| Approach | Strengths | Limitations |
|----------|-----------|-------------|
| **Grep / ripgrep** | Fast, universal | No semantic understanding |
| **Vector search only** | Finds similar code | Misses structural relationships |
| **Serena** | Precise LSP symbol navigation, 30+ languages | No semantic search, no cross-file reasoning |
| **Codanna** | Fast symbol lookup, good call graphs | Semantic search needs JSDoc — undocumented code gets no embeddings; no BM25, no RAPTOR, Windows experimental |
| **CodeSeeker** | BM25 + embedding fusion + RAPTOR + graph + coding standards + multi-language AST | Requires initial indexing (30s–5min) |

**What LSP tools can't do:**
- *"Find code that handles errors like this"* → semantic pattern search
- *"What validation approach does this project use?"* → auto-detected coding standards
- *"Show me everything related to authentication"* → graph traversal across indirect dependencies

**What vector-only search misses:**
- Direct import/export chains
- Class inheritance hierarchies
- Which files actually depend on which

## Installation

### Recommended: npx (no install needed)

The standard way to configure any MCP server — no global install required:

```json
{
  "mcpServers": {
    "codeseeker": {
      "command": "npx",
      "args": ["-y", "codeseeker", "serve", "--mcp"]
    }
  }
}
```

Add this to your MCP config file ([see below](#advanced-installation-options) for per-client locations) and restart your editor.

### npm global install

```bash
npm install -g codeseeker
codesee