Taste, Trust, and Truth: How to Evaluate AI-Powered Discovery

Once a discovery system starts speaking, accuracy alone is not enough. You have to evaluate fit, trustworthiness, and product impact together.

Why the old evaluation model is not enough

For years, discovery systems were largely silent. Search results appeared. Recommendation rows updated. Ranking models made decisions behind the scenes. The product either surfaced something compelling or it did not, but the system itself was mostly quiet.

That is changing quickly.

AI-powered discovery experiences can now talk. They can explain why something was recommended. They can ask follow-up questions. They can compare options, refine preferences, and guide a user through a decision. This makes discovery more useful, but it also changes the evaluation problem.

Once the system starts speaking, correctness is no longer the only issue. Trust becomes visible.

A traditional recommendation row can be irrelevant. An AI-powered recommendation can be irrelevant, misleading, overconfident, or inappropriate, and it can present those failures in fluent prose. That is a different risk profile.

This is why I think evaluation in AI-powered discovery has to expand beyond the classic metrics.

Clicks matter. Starts matter. Completions matter. Conversion matters. Retention matters. Those signals remain essential. But they are no longer sufficient. If the system persuades a user to click through on a weak or misrepresented recommendation, short-term engagement can conceal long-term trust damage.

In discovery, that is a dangerous trade.

Truth comes first

When I think about evaluating AI-powered discovery, I keep coming back to three words: taste, trust, and truth.

Truth is the easiest place to begin, and the place teams often underestimate.

Is the item real? Is it available to the user? Is it described accurately? If the assistant says a show is a comedy-drama, has a strong female lead, or runs under two hours, is that actually correct? If it explains why something is being recommended, is that explanation grounded in real attributes and real signals, or is the model improvising a plausible story?

This sounds basic, but it is foundational.

The fastest way to undermine an AI-powered discovery experience is to let it be eloquently wrong.

Taste is not the same as relevance

Taste is more complex.

Discovery is not only about factual correctness. It is about fit. Did the system understand the user's request? Did it surface something that matches the intended mood, context, or constraint? Did it provide enough range to support exploration without becoming generic? Did it preserve diversity, novelty, and surprise?

Taste is hard because there is rarely one correct answer. There are better and worse answers, and the difference often sits in nuance. That is why human judgment remains so important. In a discovery product, a recommendation can be technically relevant and still feel wrong.

Trust sits across everything

Trust sits across both.

Does the assistant speak with appropriate confidence? Does it admit uncertainty when it should? Does it handle sensitive cases with care? Does it know when to ask a clarifying question and when to stop talking? Does it recover cleanly when the user says, "No, not like that"? Does it respect age, context, tone, and user control?

An AI discovery product earns trust when it is consistently useful and appropriately honest. It loses trust when it overclaims, fabricates, or becomes friction masquerading as intelligence.

A practical evaluation framework

To make this practical, I think teams should evaluate AI-powered discovery in four layers.

1. Factual Integrity: The first layer is grounding and factual integrity. This is the truth layer. Every generated recommendation, explanation, and comparison should be testable against catalog reality, policy constraints, and product state. If the experience is not grounded, nothing else matters.

2. User Fit: The second layer is user fit. This is the taste layer. Here the question is not just "Was the item relevant?" It is "Was this a good answer for this user, in this moment, given the request they actually made?" Fit should include relevance, nuance, diversity, novelty, and context sensitivity.

3. Interaction Trust: The third layer is interaction trust. This is where tone, calibration, clarity, and controllability matter. Was the assistant helpful without being verbose? Did it clarify only when needed? Did it let the user steer the experience without friction? Did it recover well when misunderstood?

4. Product Outcome: The fourth layer is product outcome. After all of the above, did the experience actually improve the business and user metrics we care about? Did users discover more satisfying content? Did they reformulate less? Did they abandon less? Did downstream engagement improve? Did trust signals hold over time?

All four layers matter. Too many evaluation programs jump directly to the fourth and ignore the first three.

That is how teams end up with a system that looks good in a dashboard and fragile in the real world.

How to operationalize the work

Operationally, this means evaluation needs to become more deliberate.

First, I would separate retrieval quality from response quality. If the candidate set is weak, a brilliant explanation does not save the experience. If the candidate set is strong but the assistant misrepresents it, the surface still fails. These are different failure modes and they should be measured separately.
Second, I would build a strong set of representative scenarios, not just generic prompts. In discovery, the hard cases matter: ambiguous requests, cold-start users, group viewing, kids and family contexts, long-tail content, multilingual queries, unavailable titles, contradictory preferences, and vague mood-based prompts. Systems that look strong on happy-path prompts often break on exactly the requests where users most need help.
Third, I would use human evaluation with a real rubric. That rubric should ask judges to score factual accuracy, fit to intent, usefulness of explanation, diversity of options, tone, and trustworthiness. Not every aspect of discovery can be reduced to a scalar model metric without losing what actually matters.
Fourth, I would review transcripts and sessions qualitatively. One of the most valuable things a team can do is read where the assistant got close but missed. Was the failure in understanding the request? In catalog grounding? In overconfidence? In asking too many questions? These patterns are product strategy signals, not just QA notes.
Fifth, I would run online experiments with real guardrails. It is not enough to measure engagement lift. You need to monitor reformulation behavior, dead ends, fallback usage, dissatisfaction signals, and any indication that the assistant is steering users into narrower or less trustworthy experiences over time.

Common mistakes

There are also a few common evaluation mistakes I would avoid.

One is treating eloquence as quality. Fluent language can make a weak recommendation feel stronger than it is. Teams can become overly impressed by how the system sounds and underinvest in whether it is actually right.
Another is over-indexing on click-through rate. In AI-powered discovery, a persuasive explanation may increase clicks without increasing satisfaction. If the system sounds good but sends users into poor-fit content, the metric will lie to you before the user tells you the truth.
A third is ignoring catalog hygiene. Many AI issues in discovery are not really model issues. They are metadata issues, availability issues, knowledge issues, or instrumentation issues. Conversation surfaces make those weaknesses more obvious, but they do not create them.
A fourth is letting the model evaluate itself. Self-grading can be useful as a diagnostic tool, but it is not a substitute for human judgment and real product outcomes, especially in domains where fit and trust matter.

The teams that win will measure what users actually feel

What good teams will do, in my view, is treat evaluation as part of product design, not as a final QA step.

They will maintain a living set of gold-standard scenarios. They will create failure taxonomies. They will test for both correctness and user confidence. They will distinguish between being helpful and being impressive. And they will remember that discovery is one of the few places where a product can feel deeply personal even when the surface area looks small.

In content discovery, a great recommendation is not only one that gets the click. It is one that makes the user think, "This system gets me." An AI-powered experience can strengthen that feeling. It can also damage it much faster than traditional interfaces because the system is now making explicit claims.

Once a product begins to explain itself, it begins to make promises. Those promises need to be evaluated with more rigor, not less.

The teams that win with AI-powered discovery will not be the ones with the most eloquent assistant. They will be the ones that build a system users learn to rely on because it is not just smart sounding.

It is grounded, tasteful, and trustworthy.

This article reflects my personal perspectives on product management, AI, and content discovery. It does not represent the official position of my employer or any affiliated organization.