Discovery Engine

Find novel, statistically validated patterns in tabular data — hypothesis-free.

3MITother

Install

Config snippet generator goes here (5 client tabs)

README

# Disco

**Find novel, statistically validated patterns in tabular data** — feature interactions, subgroup effects, and conditional relationships that correlation analysis and LLMs miss.

[![PyPI](https://img.shields.io/pypi/v/discovery-engine-api)](https://pypi.org/project/discovery-engine-api/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

Made by [Leap Laboratories](https://www.leap-labs.com).

---

## What it actually does

Most data analysis starts with a question. Disco starts with the data.

Without biases or assumptions, it searches for combinations of feature conditions that significantly shift your target column — things like "patients aged 45–65 with low HDL *and* high CRP have 3× the readmission rate" — without you needing to hypothesise that interaction first.

Each pattern is:
- **Validated on a hold-out set** — increases the chance of generalisation
- **FDR-corrected** — p-values included, adjusted for multiple testing
- **Checked against academic literature** — to help you understand what you've found, and identify if it is novel.

The output is structured: conditions, effect sizes, p-values, citations, and a novelty classification for every pattern found.

**Use it when:** "which variables are most important with respect to X", "are there patterns we're missing?", "I don't know where to start with this data", "I need to understand how A and B affect C".

**Not for:** summary statistics, visualisation, filtering, SQL queries — use pandas for those

---

## Quickstart

```bash
pip install discovery-engine-api
```

Get an API key:

```bash
# Step 1: request verification code (no password, no card)
curl -X POST https://disco.leap-labs.com/api/signup \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com"}'

# Step 2: submit code from email → get key
curl -X POST https://disco.leap-labs.com/api/signup/verify \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com", "code": "123456"}'
# → {"key": "disco_...", "credits": 10, "tier": "free_tier"}
```

Or create a key at [disco.leap-labs.com/docs](https://disco.leap-labs.com/docs).

Run your first analysis:

```python
from discovery import Engine

engine = Engine(api_key="disco_...")
result = await engine.discover(
    file="data.csv",
    target_column="outcome",
)

for pattern in result.patterns:
    if pattern.p_value < 0.05 and pattern.novelty_type == "novel":
        print(f"{pattern.description} (p={pattern.p_value:.4f})")

print(f"Explore: {result.report_url}")
```

Runs take 3–15 minutes. `discover()` polls automatically and logs progress — queue position, estimated wait, current pipeline step, and ETA. For background runs, see [Running asynchronously](#running-asynchronously).

→ [Full Python SDK reference](docs/python-sdk.md) · [Example notebook](notebooks/quickstart.ipynb)

---

## What you get back

Each `Pattern` in `result.patterns` looks like this (real output from a crop yield dataset):

```python
Pattern(
    description="When humidity is between 72–89% AND wind speed is below 12 km/h, "
                "crop yield increases by 34% above the dataset average",
    conditions=[
        {"type": "continuous", "feature": "humidity_pct",
         "min_value": 72.0, "max_value": 89.0},
        {"type": "continuous", "feature": "wind_speed_kmh",
         "min_value": 0.0, "max_value": 12.0},
    ],
    p_value=0.003,              # FDR-corrected
    novelty_type="novel",
    novelty_explanation="Published studies examine humidity and wind speed as independent "
                        "predictors, but this interaction effect — where low wind amplifies "
                        "the benefit of high humidity within a specific range — has not been "
                        "reported in the literature.",
    citations=[
        {"title": "Effects of relative humidity on cereal crop productivity",
         "authors": ["Zhang, L.", "Wang, H."], "year": "2021",
         "journal": "Journal of Agricultural Science"},
    ],
    target_change_direction="max",
    abs_target_change=0.34,     # 34% increase
    support_count=847,          # rows matching this pattern
    support_percentage=16.9,
)
```

Key things to notice:

- **Patterns are combinations of conditions** — humidity AND wind speed together, not just "more humidity is better"
- **Specific thresholds** — 72–89%, not a vague correlation
- **Novel vs confirmatory** — every pattern is classified; confirmatory ones validate known science, novel ones are what you came for
- **Citations** — shows what IS known, so you can see what's genuinely new
- **`report_url`** links to an interactive web report with all patterns visualised

The `result.summary` gives an LLM-generated narrative overview:

```python
result.summary.overview
# "Disco identified 14 statistically significant patterns. 5 are novel.
#  The strongest driver is a previously unreported interaction between humidity
#  and wind speed at specific thresholds."

resu