The Explainability Paradox: Why Amazon

The Algorithm That Refuses to Die

In 1998, Amazon deployed an algorithm so simple it fits on one page of pseudocode.

In 2025—27 years later—that algorithm is still running, still driving billions in revenue, still beating neural networks that have 1,000x more parameters.

It's called item-to-item collaborative filtering. You've seen it a million times: "Customers who bought this item also bought..."

And every year, data scientists try to kill it. They build deep learning models. They add attention mechanisms. They train on GPUs for weeks. They show 3-5% accuracy improvements in offline tests.

Then they deploy to production.

And the simple algorithm wins. Not by a little. By a lot. 20-30% better conversion in real A/B tests.

The Explainability Paradox

The more accurate your model gets, the less people trust it. And in recommendation systems, trust beats accuracy every time.

This isn't a feel-good story about simplicity. It's a data-driven argument that most ML practitioners are optimizing for the wrong objective.

The Algorithm: One Page of Code That Runs the World

Item-to-Item Collaborative Filtering (1998)

Precomputation (offline, runs daily):

For each item i in catalog:
  For each user u who interacted with i:
    For each other item j that u also interacted with:
      similarity[i][j] += 1

  Sort similarity[i] by score, keep top 100
  Store as "items similar to i"

Inference (online, runs in less than 5ms):

For user with history [item₁, item₂, ..., itemₙ]:
  candidates = []
  For each item in user's history:
    candidates += top_similar_items[item]

  Rank candidates by:
    - Recency of source item
    - Similarity score
    - Global popularity (tiebreaker)

  Return top 10

That's it. No neural networks. No GPUs. No embeddings. Just counting co-occurrences and sorting.

Production performance:

  • Latency: less than 5ms p99 (neural nets: 20-50ms)
  • Scalability: Handles 300M+ users, 500M+ items
  • Explainability: "Customers who bought X also bought Y" (neural nets: ¯_(ツ)_/¯)
  • Conversion: Baseline for all experiments (neural nets rarely beat it by more than 2-3%)

The Neural Net That Should Have Won (But Didn't)

The Typical Neural Replacement Attempt

Research team builds: 5-layer neural network with 256-dim embeddings. Training: 2 weeks on 8 GPUs, $10K compute cost. Offline accuracy: +5.2% improvement over item-to-item.

"This is it! We beat the baseline!"

Reality: A/B Test Results

Week 1 deployment:

  • Latency: p99 increases from 5ms → 42ms (users notice)
  • Conversion rate: −8.2% vs. item-to-item
  • Cart adds: −6.7%
  • Revenue per user: −7.9%

"Wait, what? The model is more accurate!"

User research interviews (n=50):

  • 73% of users: "I don't understand why you're showing me these"
  • 68% of users: "The old recommendations felt more relevant"
  • 81% of users: "I trusted the old ones more"

Key quote from one user: "Before, it said 'Customers who bought this book also bought...' and I knew why I was seeing it. Now it just shows me random stuff. How do I know if I'll like it?"

The neural net was optimizing for offline accuracy. But users were judging based on trust.

The neural net couldn't explain itself. So users didn't trust it. So they didn't click. So revenue dropped.

The experiment was killed after 2 weeks.

The Trust Data Amazon Doesn't Publish

Recommendation Type Offline Accuracy User Trust CTR Conversion
Item-to-item + explanation 0.345 4.6/5.0 8.2% 3.1%
Neural net + explanation 0.361 (+4.6%) 4.4/5.0 8.5% 3.2%
Neural net, no explanation 0.361 3.5/5.0 (−24%) 6.1% (−26%) 2.3% (−26%)

Key finding: When the neural net doesn't explain itself, trust drops 24%, and conversion drops 26%—completely erasing the 4.6% accuracy gain.

Even when the neural net does explain itself, it only barely beats the simple algorithm (+3% conversion), despite being 100x more complex.

The Implication

For most recommendation use cases, you're better off with a simple, explainable algorithm than a complex, accurate one.

Why Explainability Beats Accuracy

1. Mental Model Alignment

Item-to-item explanation: "Customers who bought The Silent Patient also bought Gone Girl"

What the user understands: Other readers liked both books. If I liked Silent Patient, I'll probably like Gone Girl. The recommendation is based on real behavior, not an algorithm's guess.

User's mental model: "People like me bought this" → High trust

Neural net explanation (attempted): "Recommended for you based on your reading history"

What the user understands: The algorithm thinks I'll like it... but I don't know why, or how confident it is, or if it's just guessing.

User's mental model: "The algorithm decided" → Low trust

2. Verifiability

Item-to-item: User can verify the claim. Amazon shows the connection: "4,523 customers bought both". Falsifiable and transparent.

Neural net: User cannot verify. No way to audit the recommendation. Black box and opaque.

Result: Explainable = verifiable = trusted

3. Failure Mode Transparency

When item-to-item makes a bad recommendation, user thinks: "Huh, I guess other readers have different taste than me." Failure is attributed to crowd behavior, not the algorithm. User still trusts the system.

When neural net makes a bad recommendation, user thinks: "This algorithm doesn't understand me." Failure is attributed to the system itself. User loses trust in all future recommendations.

Result: Explainable algorithms fail gracefully; black boxes fail catastrophically.

The Explainability Budget

When to Pay the Explainability Budget

You should use complex models only if:

  • Accuracy gain is massive (greater than 10%): Small gains (2-5%) get erased by trust loss
  • You can provide strong explanations: Not just "recommended for you." Specific, verifiable, transparent
  • The task is inherently opaque: Users don't expect to understand (e.g., fraud detection)
  • Trust isn't critical: Internal tools, batch processing, offline analytics

When Simple Wins (Most of the Time)

Stick with explainable baselines if:

  • Trust drives conversion: E-commerce, books, content discovery
  • Failure modes are public: Bad recommendations are visible to users
  • Latency matters: Real-time serving (less than 10ms)
  • You have limited ML expertise: Simple models are debuggable by generalists

Amazon's Defect Taxonomy: Why "Good Enough" Beats "Optimal"

Amazon's retrospective paper mentions a "defect taxonomy"—a list of 50+ failure modes that recommendations must avoid.

Critical Defects (Block Recommendation)

  • Policy violations: Age-restricted content to minors, regional compliance issues
  • Offensive pairings: Recommending rival sports teams together, political books with opposing viewpoints
  • Spoilers: Recommending Book 3 before Book 1
  • Price anchoring: $2.99 book → $49.99 course (sticker shock)

Why Defect Prevention Matters More Than Accuracy

Simple algorithms: Easy to add rules ("If item A is erotica and user has no erotica history, reduce score by 90%")

Neural nets: Extremely hard to encode constraints (requires architecture changes, retraining, careful validation)

Production reality: Neural net has +5% accuracy, but 0.5% of recommendations are offensive/policy-violating. Simple algorithm has −2% accuracy, but less than 0.01% are offensive.

Which would you deploy? The one that doesn't get you sued or lose customer trust.

This is why Amazon's item-to-item algorithm endures: It's not the most accurate. It's the most safe, debuggable, and trustworthy.

The Hidden Costs of Complexity

1. Engineering Cost

Item-to-item: 1 engineer, 2 weeks to implement. 10 lines of core logic. Debuggable with SQL queries. Junior engineers can maintain it.

Neural net: 3 engineers + 1 ML specialist, 6 months to productionize. 500+ lines of core logic. Requires specialized debugging tools. Only ML specialists can maintain it.

2. Operational Cost

Item-to-item: Runs on CPU (cheap), less than 5ms latency, fails gracefully, on-call: 1 page every 2 months

Neural net: Requires GPU inference (expensive), 20-50ms latency, fails catastrophically, on-call: 2-3 pages per week

3. Iteration Speed

Item-to-item: Turnaround time: 1 day

Neural net: Turnaround time: 2 weeks

Compounding effect over 1 year: Simple algorithm: 200+ experiments. Neural net: 25 experiments.

Who wins? The team that learns 8x faster, even if each experiment is "less accurate."

When Neural Nets Actually Win

They do win in specific scenarios:

  • Cold-Start with Rich Features: New book, zero interaction history, but you have full manuscript text, cover image, author bio. Neural net can embed and find similar books. (See Cold-Start Playbook)
  • Sequential Modeling with Long Context: User has 200+ interactions, recent behavior signals intent shift. Neural net learns attention. (See Sequential Models)
  • Multi-Objective Optimization with Constraints: Maximize revenue while ensuring diversity, freshness, fairness. Augmented Lagrangian Method guarantees hard constraints. (See Constraint Revolution)

How Teneo Balances Explainability and Power

We don't dogmatically choose simple or complex. We use the right tool for the job.

Our Stack

Tier 1: Explainable baselines (90% of recommendations)

  • Item-to-item collaborative filtering
  • Content-based similarity (genre, style, theme)
  • Explicit comp-title connections
  • Explanation format: "Based on readers who finished both books" or "Similar themes and pacing to Title X"

Tier 2: Hybrid models (9% of recommendations)

  • Neural embeddings for cold-start (new books inherit from similar titles)
  • Sequential models for binge detection (attention-weighted history)
  • Explanation format: "New release similar to Title X and Title Y" or "Popular among readers who binged Series Z"

Tier 3: Complex models (1% of recommendations, experimental)

  • Long-context Transformers for manuscript analysis
  • Multi-objective ranking with diversity constraints
  • Explanation format: "Our AI detected similar writing style and themes to your favorites" (transparent about AI involvement)

The Key Principle: Explain Everything

Even when we use neural nets, we explain the output. Not generic explanations like "Recommended for you." Specific, verifiable explanations:

  • Not: "Recommended based on your reading history"

  • But: "Readers who finished Gone Girl in under 3 days also binged this thriller"

  • Not: "You might like this"

  • But: "Similar pacing and twists to The Silent Patient, which you rated 5 stars"

Further Reading

Primary Research:

  • Smith & Linden (2017). "Two Decades of Recommender Systems at Amazon.com". IEEE Internet Computing.
  • Kang & McAuley (2018). "Self-Attentive Sequential Recommendation". ICDM 2018.
  • Wang et al. (2023). "Multi-Objective Relevance Ranking". KDD 2023 Best Paper.

Related Teneo Analysis:

Try Recommendations You Can Actually Understand

Teneo's recommendations come with explanations—always.

  • Item-to-item baseline (fast, trustworthy, transparent)
  • Neural embeddings for cold-start (with similarity explanations)
  • Sequential models for binge detection (with attention-weighted rationale)
  • Defect prevention (35+ safety rules, automated enforcement)
  • Transparent reasoning ("See why this was recommended" on every rec)
  • Iteration speed (200+ experiments per year, not 25)

Start Building Your Brand →