# 🍳 Kitchen Appliances Benchmark — Final Leaderboard

**Benchmark:** Research quality comparison for kitchen appliance selection  
**Date:** 2026-05-12  
**Method:** DRACO (Deep Research Accuracy, Completeness, Objectivity)  
**Experiments:** 6

---

## 🏆 Final Results

| Rank | Experiment | Cost | Time | Quality | Verdict |
|------|------------|------|------|---------|---------|
| 🥇 | **Perplexity Deep Research** | **$1.34** | 3.1 min | **9.5/10** | Production-ready |
| 🥈 | **Single-Agent Pipeline** | **$1.26** | 8 min | **8.8/10** | Production-ready |
| 3 | Extended (old) | $1.75 | 12 min | **8.2/10** | Good, minor gaps |
| 4 | Extended + cheap models | $2.03 | 4 min | **7.9/10** | Good, minor gaps |
| 5 | Extended + lightContext | $2.07 | 9.2 min | **7.8/10** | Good, minor gaps |
| ❌ | **Minimal Pipeline** | **$0.47** | 3 min | **3.15/10** | Unacceptable |

---

## 📊 Quality Scores (DRACO)

| Experiment | Accuracy (50%) | Breadth (25%) | Presentation (15%) | Citation (10%) | **TOTAL** |
|------------|----------------|---------------|--------------------|-----------------|-----------|
| Exp 1: Perplexity DR | 9.5/10 | 9/10 | 10/10 | 9/10 | **9.5/10** |
| Exp 5: Single-Agent | 9/10 | 9/10 | 9/10 | 8/10 | **8.8/10** |
| Exp 2: Extended (old) | 8/10 | 8/10 | 9/10 | 7/10 | **8.2/10** |
| Exp 4: Extended + cheap | 8/10 | 7/10 | 9/10 | 8/10 | **7.9/10** |
| Exp 3: Extended + lightContext | 8/10 | 7/10 | 9/10 | 8/10 | **7.8/10** |
| Exp 6: Minimal | 6/10 | 6/10 | 9/10 | 3/10 | **3.15/10** |

---

## 💰 Cost Efficiency

| Experiment | Cost | Quality | Cost per Quality Point | Rank |
|------------|------|---------|------------------------|------|
| **Perplexity DR** | $1.34 | 9.5 | $0.14 | 🥇 |
| Single-Agent | $1.26 | 8.8 | $0.14 | 🥈 |
| Extended (old) | $1.75 | 8.2 | $0.21 | 3 |
| Extended + cheap | $2.03 | 7.9 | $0.26 | 4 |
| Extended + lightContext | $2.07 | 7.8 | $0.27 | 5 |
| Minimal | $0.47 | 3.15 | $0.15 | ❌ |

**Insight:** Perplexity Deep Research = best value ($0.14/quality point)

---

## 🎯 Key Findings

### 1. Perplexity Deep Research = WINNER

**Why:**
- Fastest (3.1 min)
- Highest quality (9.5/10)
- Best cost efficiency ($0.14/point)
- No architecture complexity

**Best for:** Standard research tasks

---

### 2. Single-Agent Pipeline = RUNNER-UP

**Why:**
- Cheapest ($1.26)
- High quality (8.8/10)
- Full control over process
- No subagent overhead

**Best for:** Research with integrations, custom logic

---

### 3. Minimal Pipeline = FAILURE

**Why failed:**
- No fact verification (prices +41% error)
- No citations (0 sources)
- No dates for prices
- Quality dropped 65% for 63% cost savings

**Lesson:** Skipping steps = losing quality

---

## 📈 Architecture Insights

### Cost Drivers

| Factor | Impact | Evidence |
|--------|--------|----------|
| Subagent spawn | **~$0.50 each** | Exp 5 vs Multi-agent |
| LLM tokens (long session) | ~$1.00+ | Exp 5 context growth |
| Perplexity queries | ~$0.01 each | All experiments |
| lightContext | **+18% cost** | Exp 3 vs Exp 2 |
| Cheap models | **+2% cost** | Exp 4 vs Exp 3 |

### Quality Drivers

| Factor | Impact | Evidence |
|--------|--------|----------|
| Reading pages | +accuracy | Exp 6 missing |
| Citations | +traceability | Exp 6 missing |
| Planning depth | +coverage | Exp 1-5 vs Exp 6 |
| Iterative search | +completeness | Exp 6 missing |

---

## 🔧 Recommendations

### For Standard Research
→ **Use Perplexity Deep Research** ($1.34, 3 min, 9.5/10)

### For Custom Research Pipelines
→ **Use Single-Agent** ($1.26, 8 min, 8.8/10)

### For Budget Constraints
→ **Still use Perplexity DR** — $0.47 minimal = 3.15/10 (unusable)

### For Complex Multi-step Logic
→ **Use Multi-Agent** ($2.00+, 8/10) — only if needed

---

## 📁 Experiment Files

| Experiment | Location |
|------------|----------|
| Exp 1: Perplexity DR | `kitchen-exp-01-perplexity/` |
| Exp 2: Extended (old) | `kitchen-exp-02-extended-old/` |
| Exp 3: Extended + lightContext | `kitchen-exp-03-extended-lightcontext/` |
| Exp 4: Extended + cheap | `kitchen-exp-04-extended-optimized/` |
| Exp 5: Single-Agent | `kitchen-exp-05-single-agent/` |
| Exp 6: Minimal | `kitchen-exp-06-minimal/` |

---

## 📚 References

- [DRACO Benchmark](https://research.perplexity.ai/articles/evaluating-deep-research-performance-in-the-wild-with-the-draco-benchmark) — Perplexity, 2026
- [Rigorous Bench](https://arxiv.org/abs/2501.18528) — Multidimensional evaluation framework

---

*Benchmark completed: 2026-05-12*
