This page documents the full methodology behind the Deep Research Benchmark.


What Are We Comparing?

Three fundamentally different approaches to AI-powered research:

  1. Perplexity Deep Research — a black-box hosted service (sonar-deep-research). You send one prompt, it autonomously plans, searches, reads, iterates and synthesises. ~50 sources, 2-5 minutes.
  2. Extended Search Pipeline — a custom 7-step pipeline where we control every stage. Each step is a separate LLM call in a separate sub-agent. Full transparency, full control.
  3. Single-Agent Pipeline — the same 7 steps, the same prompts, but executed in one continuous session without spawning sub-agents. Context grows, but there’s zero spawn overhead.

We also tested optimisations (cheap models, lightContext, minimal steps) that turned out to be dead ends.


The 7-Step Pipeline

Both Extended Search and Single-Agent share the same algorithm. The difference is orchestration (sub-agents vs single session), not the steps.

User query
     │
     ▼
┌──────────────────────────────┐
│ Step 1: PLANNING              │
│ Dimensions of Understanding   │
│     → Search Threads          │
│ LLM call: 1-2                │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 2: FIRST-PASS SEARCH     │
│ Top-10 results per thread     │
│ LLM call: 0 (mechanical)      │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 3: RELEVANCE FILTERING   │
│ LLM judges each URL           │
│ Min 2, max 5 per thread       │
│ LLM call: 1                   │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 4: DEEP READING          │
│ web_fetch + text extraction   │
│ Retry: 3× exponential backoff │
│ LLM call: 0 (mechanical)      │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 5: GAP ANALYSIS          │
│ Per-thread summary (N calls)  │
│ Cross-thread gaps (1 call)    │
│ Contradictions + blind spots  │
│ LLM calls: N + 1              │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 6: ITERATIVE SEARCH      │
│ Round 1: up to 5 follow-ups   │
│ Round 2: up to 2 (critical)   │
│ Drill-down top 2-3 objects    │
│ LLM call: 1 (prioritisation)  │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 7: SYNTHESIS             │
│ Draft (1 call)                │
│ Critical review (1 call)      │
│ Final revision (1 call)       │
│ LLM calls: 3                  │
└──────────────────────────────┘

Step 1: Planning — Dimensions & Threads

Two-level decomposition. The query is first broken into what to understand (dimensions), then into how to find it (search threads).

Dimensions of Understanding

A dimension is NOT a search query. NOT a section header. NOT “tell me about X”. It is a specific type of question about the topic. Dimensions must be heterogeneous — each one looks at the topic from a fundamentally different angle.

TypePurposeRequired?
FactualWhat exists? Catalog, inventory, prices✅ Always
MechanisticHow does it work? Cause-effect, internals, specs✅ Always
CriticalWhere is data contested? Studies contradict? Spec vs reality?✅ Always
PracticalCan I actually use/buy this? Availability, service, costOptional
ContextualHistorical roots, cultural significance, market trendsOptional

Minimum viable set: 3 dimensions (factual + mechanistic + critical).

Key design choice (H2): Threads must be heterogeneous by type. Not all “find facts about X”. At least one thread targets criticism/contradictions, at least one targets mechanisms. This is what gave Extended Search an edge over Perplexity in our initial benchmark — Perplexity catalogued, but didn’t challenge.

Search Threads

Each dimension produces 1-2 search queries. The agent decides based on breadth:

  • 1 query if the dimension is narrow (e.g. “service centres in Ufa”)
  • 2 queries if broad (e.g. “neurobiology of chanting” → landscape query + specific evidence query)
  • Never more than 2 per dimension

Drill-down is NOT planned here. If specific products/models surface from the factual dimension — great, but we don’t pre-plan searches for “Bosch Serie 6 noise review”. That happens later, driven by gap analysis.

Language: determined by the query. English fallback only for academic dimensions where the query language yields insufficient results.

LLM calls: 2 (one for dimensions, one for threads). Output is structured JSON.

Example output for the kitchen task:

{
  "dimensions": [
    {"name": "Catalog", "type": "factual", "priority": "core",
     "question": "Which brands/models available in Ufa mid-range?"},
    {"name": "Specs", "type": "mechanistic", "priority": "core",
     "question": "How do mid-range hoods/ovens differ technically?"},
    {"name": "Reality check", "type": "critical", "priority": "core",
     "question": "Where do specs diverge from real-world performance?"},
    {"name": "Service", "type": "practical", "priority": "core",
     "question": "Authorised repair in Ufa, parts availability?"}
  ],
  "threads": [
    {"dimension": "Catalog", "query": "встраиваемая вытяжка индукция духовка комплект средний сегмент 2025 2026"},
    {"dimension": "Reality check", "query": "отзывы владельцев встраиваемой вытяжки шум реальный vs заявленный проблемы"},
    {"dimension": "Service", "query": "сервисный центр Уфа ремонт Bosch MAUNFELD Gorenje гарантия"}
  ]
}

Mechanical step — no LLM call. Each search thread is executed as a web_search(query). Top 10 results per thread are collected.

  • Execution: parallel (with rate-limit fallback to sequential, 5-10 sec pauses)
  • Deduplication: by URL across all threads
  • Total: 3-5 threads × 10 results = 30-50 URLs before filtering
  • Output: list of URLs with titles and snippets

Step 3: Relevance Filtering

The LLM evaluates every URL against the dimensions and decides: read deeply or skip?

Selection rules:

  • Select URLs that directly help answer the thread’s goal
  • Prefer: academic papers, official docs, detailed reviews, primary sources
  • Deprioritise: SEO aggregators, listicles with no depth, paywalls, product cards without reviews
  • If the same URL appears in multiple threads → select once, under the most relevant thread

Constraints:

ParameterValueReason
Min per thread2If less → flag as gap
Max per thread5Hard ceiling
Max total20Budget control

Why not read everything? Cost. Filtering is ~$0.01/URL; deep reading is ~$0.10/page. 50 pages = $5 in fetch + LLM processing.

LLM call: 1 prompt with all URLs + dimensions → filtered list with one-sentence justification per selection.


Step 4: Deep Reading

Mechanical step — no LLM call. web_fetch(url) for each filtered URL. HTML → text extraction.

  • Retry policy: 3 attempts, exponential backoff (10s, 20s, 40s)
  • Min content threshold: 200 characters (below = skip; likely a login wall or JS-only page)
  • Failed pages: logged as gaps for Step 5
  • Output: raw text per page

Step 5: Gap Analysis

This is the step that separates our pipeline from Perplexity Deep Research. Instead of just summarising, the LLM explicitly identifies what’s missing, where sources contradict, and what blind spots exist.

Two-pass process:

Pass 1: Per-thread summary (N LLM calls, one per thread)

Each thread’s gathered content is summarised structured by dimension:

  • 2-3 sentences of key findings per dimension
  • Specific data points found (numbers, names, dates — not general statements)
  • Contradictions: do any sources disagree?
  • What’s missing: what we hoped to find but didn’t

Pass 2: Cross-thread gap analysis (1 LLM call)

All summaries are combined. The LLM answers three questions:

  1. Uncovered dimensions — which dimensions have significant gaps? For each: what’s missing + suggested follow-up query
  2. Contradictions — where do sources explicitly disagree? Is this resolvable with more data, or a genuine disagreement?
  3. Blind spots — what important aspects did NO thread cover, even though they should have?

Output:

GAPS:
- {dimension}: {what's missing} → follow-up: "{query}"

CONTRADICTIONS:
- {topic}: {source A} vs {source B}

BLIND SPOTS:
- {thing we should have investigated but didn't}

VERDICT:
- Dimensions fully covered: N/total
- Critical gaps (must fill): [list]
- Optional gaps (nice to have): [list]

LLM calls: N + 1 (one summary per thread + one cross-thread analysis)


Targeted follow-up searches driven by gap analysis. Not random — only gaps that matter.

Prioritisation (1 LLM call): rank gaps by severity → assign to rounds:

ParameterRound 1Round 2
Max follow-up queries52
Max URLs per query2-31-2
TriggerAlways (if gaps exist)Only if a core dimension is STILL uncovered
Drill-down objectsTop 2-3

Drill-down criteria: if specific objects (models, brands, people) surfaced in Round 1, the agent picks the top 2-3 for deep investigation based on:

  • Frequency of mention across sources
  • Contradictoriness of data around the object
  • Category leadership (best-selling, most discussed)

The rest go into the catalogue as references, not deep-dives.

Stop condition: all core dimensions covered OR budget exhausted. What comes first.

Each round repeats search → relevance filter → deep reading, then updates the summaries.


Step 7: Synthesis

Three-pass process.

Pass 1: Draft (1 LLM call)

The LLM writes a narrative research report from all gathered content. Not a list of sections per dimension — a flowing article that weaves findings together.

Rules for the draft:

  • Specific data over general statements (“48-65 dB” not “varies”)
  • If sources contradict → present both sides, don’t pick a winner
  • Flag weak evidence (“only one study”, “manufacturer claim without independent verification”)
  • Structure: opening (why it matters) → body (organised by logic, not by dimension) → gaps (what we couldn’t find) → key takeaway

Pass 2: Critical review (1 LLM call, isolated sub-agent)

A separate agent with NO access to previous steps’ history — only the draft and dimensions. Cross-model verification is possible (default: same model; can switch to a different one).

The reviewer answers:

  • Coverage check: is each dimension substantively addressed with data? (scored 1-3: token / adequate / deep)
  • Weak claims: which claims lack support?
  • Logical gaps: where does the argument jump?
  • Missing perspective: what would a skeptical expert say is missing?
  • Specific suggestions: not “make it better” — “add data on X”, “the claim about Y needs a source”

Pass 3: Final revision (1 LLM call)

The main agent receives the review and rewrites, addressing specific suggestions.

LLM calls: 3 (draft + review + revision)


Architecture Comparison

AspectPerplexity Deep ResearchExtended Search (multi-agent)Single-Agent Pipeline
OrchestrationBlack boxSub-agents per stepOne continuous session
PlanningInternalDimensions → Threads (2 LLM calls)Same
Search~50 sources, parallel30-50 URLs, parallel, dedupedSame
Relevance filterInternalLLM judgment (1 call), 20 URL maxSame
Deep readingInternalweb_fetch + 3× retrySame
Gap analysis❌ None✅ Per-thread summary + cross-thread analysis (N+1 calls)✅ Same
IterationInternal, unknown depth2 rounds max, gap-driven, drill-down criteriaSame
SynthesisInternalDraft + isolated review + revision (3 calls)Same
Total LLM calls1 (opaque)N+10 (fully logged)N+10 (fully logged)
TransparencyNoneFull step logs, JSON state filesFull step logs
Context overheadN/A~$0.50 per sub-agent spawnGrows linearly

Key difference: Perplexity does something similar internally, but doesn’t expose gap analysis. Our pipeline’s advantage is the explicit “what are we missing?” step between search rounds, and the isolated critical review of the final draft.


DRACO Evaluation Framework

Every experiment was scored using the DRACO benchmark:

DimensionWeightWhat it measures
Factual Accuracy50%Are specific claims (prices, specs, model numbers) correct? Spot-checked by independent verification.
Breadth & Depth25%Did it cover all aspects? Surface-level mentions vs thorough analysis.
Presentation15%Is the output well-structured, readable, actionable?
Citation10%Are claims backed by sources? Are the sources real and accessible?

Scores were assigned by an independent review agent that had no access to pipeline internals — only the final output.


Cost Measurement

Costs are measured by provider balance delta (balance before → balance after each experiment), not by token counting.

Token-based estimates systematically undercount real cost by 3-22× because they miss:

  • Sub-agent spawn overhead: each spawn copies full context (~50K tokens × number of sub-agents)
  • Context accumulation: in long sessions, each subsequent LLM call is more expensive than the last
  • Cached tokens ≠ free: cheaper, but still billed
ExperimentToken estimateReal balance deltaDiscrepancy
Exp 2$0.08$1.7522×
Exp 3$0.21$2.0710×
Exp 4$0.50-0.80$2.033-4×
Exp 5$0.30$1.26
Exp 6$0.05-0.10$0.475-9×

Lesson: token-based cost formulas are off by an order of magnitude. Measure real balance changes.