Disco

Find novel, statistically validated patterns in tabular data — hypothesis-free.

OtherPythonv1.0.1

Disco

Find novel, statistically validated patterns in tabular data — feature interactions, subgroup effects, and conditional relationships that correlation analysis and LLMs miss.

PyPI License: MIT

Made by Leap Laboratories.


What it actually does

Most data analysis starts with a question. Disco starts with the data.

Without biases or assumptions, it searches for combinations of feature conditions that significantly shift your target column — things like "patients aged 45–65 with low HDL and high CRP have 3× the readmission rate" — without you needing to hypothesise that interaction first.

Each pattern is:

  • Validated on a hold-out set — increases the chance of generalisation
  • FDR-corrected — p-values included, adjusted for multiple testing
  • Checked against academic literature — to help you understand what you've found, and identify if it is novel.

The output is structured: conditions, effect sizes, p-values, citations, and a novelty classification for every pattern found.

Use it when: "which variables are most important with respect to X", "are there patterns we're missing?", "I don't know where to start with this data", "I need to understand how A and B affect C".

Not for: summary statistics, visualisation, filtering, SQL queries — use pandas for those


Quickstart

bash
pip install discovery-engine-api

Get an API key:

bash
# Step 1: request verification code (no password, no card)
curl -X POST https://disco.leap-labs.com/api/signup \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com"}'

# Step 2: submit code from email → get key
curl -X POST https://disco.leap-labs.com/api/signup/verify \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com", "code": "123456"}'
# → {"key": "disco_...", "credits": 10, "tier": "free_tier"}

Or create a key at disco.leap-labs.com/docs.

Run your first analysis:

python
from discovery import Engine

engine = Engine(api_key="disco_...")
result = await engine.discover(
    file="data.csv",
    target_column="outcome",
)

for pattern in result.patterns:
    if pattern.p_value < 0.05 and pattern.novelty_type == "novel":
        print(f"{pattern.description} (p={pattern.p_value:.4f})")

print(f"Explore: {result.report_url}")

Runs take 3–15 minutes. discover() polls automatically and logs progress — queue position, estimated wait, current pipeline step, and ETA. For background runs, see Running asynchronously.

Full Python SDK reference · Example notebook


What you get back

Each Pattern in result.patterns looks like this (real output from a crop yield dataset):

python
Pattern(
    description="When humidity is between 72–89% AND wind speed is below 12 km/h, "
                "crop yield increases by 34% above the dataset average",
    conditions=[
        {"type": "continuous", "feature": "humidity_pct",
         "min_value": 72.0, "max_value": 89.0},
        {"type": "continuous", "feature": "wind_speed_kmh",
         "min_value": 0.0, "max_value": 12.0},
    ],
    p_value=0.003,              # FDR-corrected
    novelty_type="novel",
    novelty_explanation="Published studies examine humidity and wind speed as independent "
                        "predictors, but this interaction effect — where low wind amplifies "
                        "the benefit of high humidity within a specific range — has not been "
                        "reported in the literature.",
    citations=[
        {"title": "Effects of relative humidity on cereal crop productivity",
         "authors": ["Zhang, L.", "Wang, H."], "year": "2021",
         "journal": "Journal of Agricultural Science"},
    ],
    target_change_direction="max",
    abs_target_change=0.34,     # 34% increase
    support_count=847,          # rows matching this pattern
    support_percentage=16.9,
)

Key things to notice:

  • Patterns are combinations of conditions — humidity AND wind speed together, not just "more humidity is better"
  • Specific thresholds — 72–89%, not a vague correlation
  • Novel vs confirmatory — every pattern is classified; confirmatory ones validate known science, novel ones are what you came for
  • Citations — shows what IS known, so you can see what's genuinely new
  • report_url links to an interactive web report with all patterns visualised

The result.summary gives an LLM-generated narrative overview:

python
result.summary.overview
# "Disco identified 14 statistically significant patterns. 5 are novel.
#  The strongest driver is a previously unreported interaction between humidity
#  and wind speed at specific thresholds."

result.summary.key_insights
# ["Humidity × low wind speed at 72–89% humidity produces a 34% yield increase — novel.",
#  "Soil nitrogen above 45 mg/kg shows diminishing returns when phosphorus is below 12 mg/kg.",
#  ...]

How it works

Disco is a pipeline, not prompt engineering over data. It:

  1. Trains machine learning models on a subset of your data
  2. Uses interpretability techniques to extract learned patterns
  3. Validates every pattern on the held-out data with FDR correction (Benjamini-Hochberg)
  4. Checks surviving patterns against academic literature via semantic search

You cannot replicate this by writing pandas code or asking an LLM to look at a CSV. It finds structure that hypothesis-driven analysis misses because it doesn't start with hypotheses.


Preparing your data

Before running, exclude columns that would produce meaningless findings. Disco finds statistically real patterns — but if the input includes columns that are definitionally related to the target, the patterns will be tautological.

Exclude:

  1. Identifiers — row IDs, UUIDs, patient IDs, sample codes
  2. Data leakage — the target renamed or reformatted (e.g., diagnosis_text when the target is diagnosis_code)
  3. Tautological columns — alternative encodings of the same construct as the target. If target is serious, then serious_outcome, not_serious, death are all part of the same classification. If target is profit, then revenue and cost together compose it. If target is a survey index, the sub-items are tautological.

Full guidance with examples: SKILL.md


Parameters

python
await engine.discover(
    file="data.csv",           # path, Path, or pd.DataFrame
    target_column="outcome",   # column to predict/explain
    depth_iterations=2,        # 2=default, higher=deeper (max: num_columns − 2)
    visibility="public",       # "public" (free) or "private" (costs credits)
    column_descriptions={      # improves pattern explanations and literature context
        "bmi": "Body mass index",
        "hdl": "HDL cholesterol in mg/dL",
    },
    excluded_columns=["id", "timestamp"],  # see "Preparing your data" above
    title="My dataset",
    description="...", # improves pattern explanations and literature context
)

Public runs are free but results are published. Set visibility="private" for private data — this costs credits.


Running asynchronously

Runs take 3–15 minutes. For agent workflows or scripts that do other work in parallel:

python
# Submit without waiting
run = await engine.run_async(file="data.csv", target_column="outcome", wait=False)
print(f"Submitted {run.run_id}, continuing...")

# ... do other things ...

result = await engine.wait_for_completion(run.run_id, timeout=1800)

For synchronous scripts and Jupyter notebooks:

python
result = engine.run(file="data.csv", target_column="outcome", wait=True)
# or: pip install discovery-engine-api[jupyter] for notebook compatibility

MCP server

Disco is available as an MCP server — no local install required.

json
{
  "mcpServers": {
    "discovery-engine": {
      "url": "https://disco.leap-labs.com/mcp",
      "env": { "DISCOVERY_API_KEY": "disco_..." }
    }
  }
}

Tools: discovery_estimate, discovery_upload, discovery_analyze, discovery_status, discovery_get_results, plus account management tools.

Full agent skill file · MCP docs


Pricing

Cost
Public runsFree — results and data are published
Private runs1 credit per MB
Free tier10 credits/month, no card required
Researcher$49/month — 50 credits
Team$199/month — 200 credits
Purchase more credits at $1 per credit

Estimate before running:

python
estimate = await engine.estimate(file_size_mb=10.5, num_columns=25, depth_iterations=2, visibility="private")
# estimate["cost"]["credits"] → 21
# estimate["cost"]["free_alternative"] → True
# estimate["account"]["sufficient"] → True/False

Links


Learn More