Discovery Engine
Find novel, statistically validated patterns in tabular data — hypothesis-free.
★ 3MITother
Install
Config snippet generator goes here (5 client tabs)
README
# Disco
**Find novel, statistically validated patterns in tabular data** — feature interactions, subgroup effects, and conditional relationships that correlation analysis and LLMs miss.
[](https://pypi.org/project/discovery-engine-api/)
[](LICENSE)
Made by [Leap Laboratories](https://www.leap-labs.com).
---
## What it actually does
Most data analysis starts with a question. Disco starts with the data.
Without biases or assumptions, it searches for combinations of feature conditions that significantly shift your target column — things like "patients aged 45–65 with low HDL *and* high CRP have 3× the readmission rate" — without you needing to hypothesise that interaction first.
Each pattern is:
- **Validated on a hold-out set** — increases the chance of generalisation
- **FDR-corrected** — p-values included, adjusted for multiple testing
- **Checked against academic literature** — to help you understand what you've found, and identify if it is novel.
The output is structured: conditions, effect sizes, p-values, citations, and a novelty classification for every pattern found.
**Use it when:** "which variables are most important with respect to X", "are there patterns we're missing?", "I don't know where to start with this data", "I need to understand how A and B affect C".
**Not for:** summary statistics, visualisation, filtering, SQL queries — use pandas for those
---
## Quickstart
```bash
pip install discovery-engine-api
```
Get an API key:
```bash
# Step 1: request verification code (no password, no card)
curl -X POST https://disco.leap-labs.com/api/signup \
-H "Content-Type: application/json" \
-d '{"email": "you@example.com"}'
# Step 2: submit code from email → get key
curl -X POST https://disco.leap-labs.com/api/signup/verify \
-H "Content-Type: application/json" \
-d '{"email": "you@example.com", "code": "123456"}'
# → {"key": "disco_...", "credits": 10, "tier": "free_tier"}
```
Or create a key at [disco.leap-labs.com/docs](https://disco.leap-labs.com/docs).
Run your first analysis:
```python
from discovery import Engine
engine = Engine(api_key="disco_...")
result = await engine.discover(
file="data.csv",
target_column="outcome",
)
for pattern in result.patterns:
if pattern.p_value < 0.05 and pattern.novelty_type == "novel":
print(f"{pattern.description} (p={pattern.p_value:.4f})")
print(f"Explore: {result.report_url}")
```
Runs take 3–15 minutes. `discover()` polls automatically and logs progress — queue position, estimated wait, current pipeline step, and ETA. For background runs, see [Running asynchronously](#running-asynchronously).
→ [Full Python SDK reference](docs/python-sdk.md) · [Example notebook](notebooks/quickstart.ipynb)
---
## What you get back
Each `Pattern` in `result.patterns` looks like this (real output from a crop yield dataset):
```python
Pattern(
description="When humidity is between 72–89% AND wind speed is below 12 km/h, "
"crop yield increases by 34% above the dataset average",
conditions=[
{"type": "continuous", "feature": "humidity_pct",
"min_value": 72.0, "max_value": 89.0},
{"type": "continuous", "feature": "wind_speed_kmh",
"min_value": 0.0, "max_value": 12.0},
],
p_value=0.003, # FDR-corrected
novelty_type="novel",
novelty_explanation="Published studies examine humidity and wind speed as independent "
"predictors, but this interaction effect — where low wind amplifies "
"the benefit of high humidity within a specific range — has not been "
"reported in the literature.",
citations=[
{"title": "Effects of relative humidity on cereal crop productivity",
"authors": ["Zhang, L.", "Wang, H."], "year": "2021",
"journal": "Journal of Agricultural Science"},
],
target_change_direction="max",
abs_target_change=0.34, # 34% increase
support_count=847, # rows matching this pattern
support_percentage=16.9,
)
```
Key things to notice:
- **Patterns are combinations of conditions** — humidity AND wind speed together, not just "more humidity is better"
- **Specific thresholds** — 72–89%, not a vague correlation
- **Novel vs confirmatory** — every pattern is classified; confirmatory ones validate known science, novel ones are what you came for
- **Citations** — shows what IS known, so you can see what's genuinely new
- **`report_url`** links to an interactive web report with all patterns visualised
The `result.summary` gives an LLM-generated narrative overview:
```python
result.summary.overview
# "Disco identified 14 statistically significant patterns. 5 are novel.
# The strongest driver is a previously unreported interaction between humidity
# and wind speed at specific thresholds."
resu