Methodology: How We Compare Research Architectures

This page documents the full methodology behind the Deep Research Benchmark.

What Are We Comparing?

Three fundamentally different approaches to AI-powered research:

Perplexity Deep Research — a black-box hosted service (sonar-deep-research). You send one prompt, it autonomously plans, searches, reads, iterates and synthesises. ~50 sources, 2-5 minutes.
Extended Search Pipeline — a custom 7-step pipeline where we control every stage. Each step is a separate LLM call in a separate sub-agent. Full transparency, full control.
Single-Agent Pipeline — the same 7 steps, the same prompts, but executed in one continuous session without spawning sub-agents. Context grows, but there’s zero spawn overhead.

We also tested optimisations (cheap models, lightContext, minimal steps) that turned out to be dead ends.

The 7-Step Pipeline

Both Extended Search and Single-Agent share the same algorithm. The difference is orchestration (sub-agents vs single session), not the steps.

User query
     │
     ▼
┌──────────────────────────────┐
│ Step 1: PLANNING              │
│ Dimensions of Understanding   │
│     → Search Threads          │
│ LLM call: 1-2                │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 2: FIRST-PASS SEARCH     │
│ Top-10 results per thread     │
│ LLM call: 0 (mechanical)      │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 3: RELEVANCE FILTERING   │
│ LLM judges each URL           │
│ Min 2, max 5 per thread       │
│ LLM call: 1                   │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 4: DEEP READING          │
│ web_fetch + text extraction   │
│ Retry: 3× exponential backoff │
│ LLM call: 0 (mechanical)      │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 5: GAP ANALYSIS          │
│ Per-thread summary (N calls)  │
│ Cross-thread gaps (1 call)    │
│ Contradictions + blind spots  │
│ LLM calls: N + 1              │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 6: ITERATIVE SEARCH      │
│ Round 1: up to 5 follow-ups   │
│ Round 2: up to 2 (critical)   │
│ Drill-down top 2-3 objects    │
│ LLM call: 1 (prioritisation)  │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│ Step 7: SYNTHESIS             │
│ Draft (1 call)                │
│ Critical review (1 call)      │
│ Final revision (1 call)       │
│ LLM calls: 3                  │
└──────────────────────────────┘

Step 1: Planning — Dimensions & Threads

Two-level decomposition. The query is first broken into what to understand (dimensions), then into how to find it (search threads).

Dimensions of Understanding

A dimension is NOT a search query. NOT a section header. NOT “tell me about X”. It is a specific type of question about the topic. Dimensions must be heterogeneous — each one looks at the topic from a fundamentally different angle.

Type	Purpose	Required?
Factual	What exists? Catalog, inventory, prices	✅ Always
Mechanistic	How does it work? Cause-effect, internals, specs	✅ Always
Critical	Where is data contested? Studies contradict? Spec vs reality?	✅ Always
Practical	Can I actually use/buy this? Availability, service, cost	Optional
Contextual	Historical roots, cultural significance, market trends	Optional

Minimum viable set: 3 dimensions (factual + mechanistic + critical).

Key design choice (H2): Threads must be heterogeneous by type. Not all “find facts about X”. At least one thread targets criticism/contradictions, at least one targets mechanisms. This is what gave Extended Search an edge over Perplexity in our initial benchmark — Perplexity catalogued, but didn’t challenge.

Search Threads

Each dimension produces 1-2 search queries. The agent decides based on breadth:

1 query if the dimension is narrow (e.g. “service centres in Ufa”)
2 queries if broad (e.g. “neurobiology of chanting” → landscape query + specific evidence query)
Never more than 2 per dimension

Drill-down is NOT planned here. If specific products/models surface from the factual dimension — great, but we don’t pre-plan searches for “Bosch Serie 6 noise review”. That happens later, driven by gap analysis.

Language: determined by the query. English fallback only for academic dimensions where the query language yields insufficient results.

LLM calls: 2 (one for dimensions, one for threads). Output is structured JSON.

Example output for the kitchen task:

{
  "dimensions": [
    {"name": "Catalog", "type": "factual", "priority": "core",
     "question": "Which brands/models available in Ufa mid-range?"},
    {"name": "Specs", "type": "mechanistic", "priority": "core",
     "question": "How do mid-range hoods/ovens differ technically?"},
    {"name": "Reality check", "type": "critical", "priority": "core",
     "question": "Where do specs diverge from real-world performance?"},
    {"name": "Service", "type": "practical", "priority": "core",
     "question": "Authorised repair in Ufa, parts availability?"}
  ],
  "threads": [
    {"dimension": "Catalog", "query": "встраиваемая вытяжка индукция духовка комплект средний сегмент 2025 2026"},
    {"dimension": "Reality check", "query": "отзывы владельцев встраиваемой вытяжки шум реальный vs заявленный проблемы"},
    {"dimension": "Service", "query": "сервисный центр Уфа ремонт Bosch MAUNFELD Gorenje гарантия"}
  ]
}

Step 2: First-Pass Search

Mechanical step — no LLM call. Each search thread is executed as a web_search(query). Top 10 results per thread are collected.

Execution: parallel (with rate-limit fallback to sequential, 5-10 sec pauses)
Deduplication: by URL across all threads
Total: 3-5 threads × 10 results = 30-50 URLs before filtering
Output: list of URLs with titles and snippets

Step 3: Relevance Filtering

The LLM evaluates every URL against the dimensions and decides: read deeply or skip?

Selection rules:

Select URLs that directly help answer the thread’s goal
Prefer: academic papers, official docs, detailed reviews, primary sources
Deprioritise: SEO aggregators, listicles with no depth, paywalls, product cards without reviews
If the same URL appears in multiple threads → select once, under the most relevant thread

Constraints:

Parameter	Value	Reason
Min per thread	2	If less → flag as gap
Max per thread	5	Hard ceiling
Max total	20	Budget control

Why not read everything? Cost. Filtering is ~$0.01/URL; deep reading is ~$0.10/page. 50 pages = $5 in fetch + LLM processing.

LLM call: 1 prompt with all URLs + dimensions → filtered list with one-sentence justification per selection.

Step 4: Deep Reading

Mechanical step — no LLM call. web_fetch(url) for each filtered URL. HTML → text extraction.

Retry policy: 3 attempts, exponential backoff (10s, 20s, 40s)
Min content threshold: 200 characters (below = skip; likely a login wall or JS-only page)
Failed pages: logged as gaps for Step 5
Output: raw text per page

Step 5: Gap Analysis

This is the step that separates our pipeline from Perplexity Deep Research. Instead of just summarising, the LLM explicitly identifies what’s missing, where sources contradict, and what blind spots exist.

Two-pass process:

Pass 1: Per-thread summary (N LLM calls, one per thread)

Each thread’s gathered content is summarised structured by dimension:

2-3 sentences of key findings per dimension
Specific data points found (numbers, names, dates — not general statements)
Contradictions: do any sources disagree?
What’s missing: what we hoped to find but didn’t

Pass 2: Cross-thread gap analysis (1 LLM call)

All summaries are combined. The LLM answers three questions:

Uncovered dimensions — which dimensions have significant gaps? For each: what’s missing + suggested follow-up query
Contradictions — where do sources explicitly disagree? Is this resolvable with more data, or a genuine disagreement?
Blind spots — what important aspects did NO thread cover, even though they should have?

Output:

GAPS:
- {dimension}: {what's missing} → follow-up: "{query}"

CONTRADICTIONS:
- {topic}: {source A} vs {source B}

BLIND SPOTS:
- {thing we should have investigated but didn't}

VERDICT:
- Dimensions fully covered: N/total
- Critical gaps (must fill): [list]
- Optional gaps (nice to have): [list]

LLM calls: N + 1 (one summary per thread + one cross-thread analysis)

Step 6: Iterative Search

Targeted follow-up searches driven by gap analysis. Not random — only gaps that matter.

Prioritisation (1 LLM call): rank gaps by severity → assign to rounds:

Parameter	Round 1	Round 2
Max follow-up queries	5	2
Max URLs per query	2-3	1-2
Trigger	Always (if gaps exist)	Only if a core dimension is STILL uncovered
Drill-down objects	Top 2-3	—

Drill-down criteria: if specific objects (models, brands, people) surfaced in Round 1, the agent picks the top 2-3 for deep investigation based on:

Frequency of mention across sources
Contradictoriness of data around the object
Category leadership (best-selling, most discussed)

The rest go into the catalogue as references, not deep-dives.

Stop condition: all core dimensions covered OR budget exhausted. What comes first.

Each round repeats search → relevance filter → deep reading, then updates the summaries.

Step 7: Synthesis

Three-pass process.

Pass 1: Draft (1 LLM call)

The LLM writes a narrative research report from all gathered content. Not a list of sections per dimension — a flowing article that weaves findings together.

Rules for the draft:

Specific data over general statements (“48-65 dB” not “varies”)
If sources contradict → present both sides, don’t pick a winner
Flag weak evidence (“only one study”, “manufacturer claim without independent verification”)
Structure: opening (why it matters) → body (organised by logic, not by dimension) → gaps (what we couldn’t find) → key takeaway

Pass 2: Critical review (1 LLM call, isolated sub-agent)

A separate agent with NO access to previous steps’ history — only the draft and dimensions. Cross-model verification is possible (default: same model; can switch to a different one).

The reviewer answers:

Coverage check: is each dimension substantively addressed with data? (scored 1-3: token / adequate / deep)
Weak claims: which claims lack support?
Logical gaps: where does the argument jump?
Missing perspective: what would a skeptical expert say is missing?
Specific suggestions: not “make it better” — “add data on X”, “the claim about Y needs a source”

Pass 3: Final revision (1 LLM call)

The main agent receives the review and rewrites, addressing specific suggestions.

LLM calls: 3 (draft + review + revision)

Architecture Comparison

Aspect	Perplexity Deep Research	Extended Search (multi-agent)	Single-Agent Pipeline
Orchestration	Black box	Sub-agents per step	One continuous session
Planning	Internal	Dimensions → Threads (2 LLM calls)	Same
Search	~50 sources, parallel	30-50 URLs, parallel, deduped	Same
Relevance filter	Internal	LLM judgment (1 call), 20 URL max	Same
Deep reading	Internal	web_fetch + 3× retry	Same
Gap analysis	❌ None	✅ Per-thread summary + cross-thread analysis (N+1 calls)	✅ Same
Iteration	Internal, unknown depth	2 rounds max, gap-driven, drill-down criteria	Same
Synthesis	Internal	Draft + isolated review + revision (3 calls)	Same
Total LLM calls	1 (opaque)	N+10 (fully logged)	N+10 (fully logged)
Transparency	None	Full step logs, JSON state files	Full step logs
Context overhead	N/A	~$0.50 per sub-agent spawn	Grows linearly

Key difference: Perplexity does something similar internally, but doesn’t expose gap analysis. Our pipeline’s advantage is the explicit “what are we missing?” step between search rounds, and the isolated critical review of the final draft.

DRACO Evaluation Framework

Every experiment was scored using the DRACO benchmark:

Dimension	Weight	What it measures
Factual Accuracy	50%	Are specific claims (prices, specs, model numbers) correct? Spot-checked by independent verification.
Breadth & Depth	25%	Did it cover all aspects? Surface-level mentions vs thorough analysis.
Presentation	15%	Is the output well-structured, readable, actionable?
Citation	10%	Are claims backed by sources? Are the sources real and accessible?

Scores were assigned by an independent review agent that had no access to pipeline internals — only the final output.

Cost Measurement

Costs are measured by provider balance delta (balance before → balance after each experiment), not by token counting.

Token-based estimates systematically undercount real cost by 3-22× because they miss:

Sub-agent spawn overhead: each spawn copies full context (~50K tokens × number of sub-agents)
Context accumulation: in long sessions, each subsequent LLM call is more expensive than the last
Cached tokens ≠ free: cheaper, but still billed

Experiment	Token estimate	Real balance delta	Discrepancy
Exp 2	$0.08	$1.75	22×
Exp 3	$0.21	$2.07	10×
Exp 4	$0.50-0.80	$2.03	3-4×
Exp 5	$0.30	$1.26	4×
Exp 6	$0.05-0.10	$0.47	5-9×

Lesson: token-based cost formulas are off by an order of magnitude. Measure real balance changes.

What Are We Comparing?#

The 7-Step Pipeline#

Step 1: Planning — Dimensions & Threads#

Dimensions of Understanding#

Search Threads#

Step 2: First-Pass Search#

Step 3: Relevance Filtering#

Step 4: Deep Reading#

Step 5: Gap Analysis#

Pass 1: Per-thread summary (N LLM calls, one per thread)#

Pass 2: Cross-thread gap analysis (1 LLM call)#

Step 6: Iterative Search#

Step 7: Synthesis#

Pass 1: Draft (1 LLM call)#

Pass 2: Critical review (1 LLM call, isolated sub-agent)#

Pass 3: Final revision (1 LLM call)#

Architecture Comparison#

DRACO Evaluation Framework#

Cost Measurement#