# 🍳 Kitchen Appliances Research Benchmark

**Purpose:** Compare research architectures for appliance selection task  
**Date:** 2026-05-12  
**Experiments:** 6

---

## 📁 Structure

```
kitchen-appliances/
├── LEADERBOARD.md          # Final results & comparison
├── INDEX.md                # This file
├── kitchen-exp-01-perplexity/
│   ├── report.md           # Perplexity Deep Research output
│   ├── metrics.json        # Cost & timing data
│   └── REVIEW.md           # DRACO quality review
├── kitchen-exp-02-extended-old/
│   ├── FINAL-REPORT.md     # Multi-agent pipeline (old)
│   ├── REVIEW.md           # DRACO quality review
│   └── steps/              # Pipeline step outputs
├── kitchen-exp-03-extended-lightcontext/
│   ├── FINAL-REPORT.md     # With lightContext=true
│   ├── REVIEW.md           # DRACO quality review
│   └── steps/
├── kitchen-exp-04-extended-optimized/
│   ├── FINAL-REPORT.md     # With cheap models
│   ├── REVIEW.md           # DRACO quality review
│   └── steps/
├── kitchen-exp-05-single-agent/
│   ├── FINAL-REPORT.md     # Single-agent pipeline
│   ├── REVIEW.md           # DRACO quality review
│   └── steps/
└── kitchen-exp-06-minimal/
    ├── FINAL-REPORT.md     # Minimal pipeline (failed)
    ├── REVIEW.md           # DRACO quality review
    └── metrics.json
```

---

## 🏆 Quick Results

| Rank | Method | Cost | Quality | Verdict |
|------|--------|------|---------|---------|
| 1 | **Perplexity Deep Research** | $1.34 | 9.5/10 | ✅ Best |
| 2 | **Single-Agent Pipeline** | $1.26 | 8.8/10 | ✅ Best value |
| 3 | Extended (old) | $1.75 | 8.2/10 | Good |
| 4 | Extended + cheap | $2.03 | 7.9/10 | Good |
| 5 | Extended + lightContext | $2.07 | 7.8/10 | Good |
| - | Minimal Pipeline | $0.47 | 3.15/10 | ❌ Failed |

---

## 📊 Key Metrics

### Cost Comparison

| Architecture | Avg Cost | Reason |
|--------------|----------|--------|
| Perplexity DR | $1.34 | Fixed price |
| Single-Agent | $1.26 | No subagent overhead |
| Multi-Agent | $2.00+ | Subagent spawns |
| Minimal | $0.47 | Skipped steps |

### Quality Comparison

| Architecture | Avg Quality | Reason |
|--------------|-------------|--------|
| Perplexity DR | 9.5/10 | Optimised pipeline |
| Single-Agent | 8.8/10 | Full pipeline |
| Multi-Agent | 8/10 | Good but not better |
| Minimal | 3.15/10 | No verification |

---

## 🔑 Key Insights

### 1. Perplexity Deep Research = Default Choice
- Cheapest high-quality option ($1.34)
- Fastest (3.1 min)
- Highest quality (9.5/10)

### 2. Single-Agent = Best for Custom Pipelines
- Cheapest overall ($1.26)
- Full control
- High quality (8.8/10)

### 3. Multi-Agent = Only for Complex Logic
- Costs more ($2.00+)
- Same quality as single-agent
- Use only when isolation needed

### 4. Minimal Pipeline = Do Not Use
- 63% cost savings
- 65% quality loss
- Dangerous for decisions

### 5. Optimisations Don't Work
- lightContext: +18% cost
- Cheap models: +2% cost
- Neither saves money

---

## 📚 Methodology

### DRACO Evaluation

**4 dimensions:**
1. **Factual Accuracy** (50%) — Correct facts, no hallucinations
2. **Breadth & Depth** (25%) — Coverage, alternatives, depth
3. **Presentation** (15%) — Structure, clarity, actionable
4. **Citation** (10%) — Sources, dates, links

**Formula:**
```
Score = (Accuracy × 0.5) + (Breadth × 0.25) + (Presentation × 0.15) + (Citation × 0.10)
```

**Reference:** [DRACO Benchmark](https://research.perplexity.ai/articles/evaluating-deep-research-performance-in-the-wild-with-the-draco-benchmark)

---

## 🎯 Recommendations

### For Standard Research Tasks
→ **Use Perplexity Deep Research**

### For Custom Research Pipelines
→ **Use Single-Agent**

### For Multi-Step Complex Logic
→ **Use Multi-Agent** (only if needed)

### Never Use
→ **Minimal Pipeline** (quality too low)

---

## 📐 Token Tracking (для будущих экспериментов)

Добавить в каждый шаг pipeline:

```json
{
  "step": 1,
  "name": "Planning",
  "tokens": {
    "input_estimate": "~4K",
    "output_actual": 870,
    "cost": {
      "before": 6.83,
      "after": 6.82,
      "delta": 0.01
    }
  }
}
```

**Правила:**
- Input — примерная оценка длины промпта (через wc -c / 4)
- Output — точное количество токенов ответа (если API отдаёт) или wc -c / 4
- Cost — баланс z.ai до/после каждого шага
- Context growth — логировать cumulative context size
- Считаем по шагам, не постфактум

**Почему это важно:**
- Только balance до/после даёт реальную стоимость
- Token count помогает понять структуру затрат
- Context growth объясняет почему Single-Agent дороже чем кажется

## 📅 Timeline

| Date | Event |
|------|-------|
| 2026-05-12 | Exp 1-5 completed |
| 2026-05-12 | Exp 6 (Minimal) failed |
| 2026-05-12 | DRACO reviews completed |
| 2026-05-12 | Final leaderboard published |

---

*Benchmark completed: 2026-05-12*
