Generative AI · Strategy Case Study·Retail · Women's Fashion E-commerce

CustomerFeedback Intelligence

Turning thousands of unstructured customer reviews into same-day operational decisions — what to fix, how urgent it is, and who should act.

MIT Professional Education · Applied AI & Data Science
Scroll
01 / The Challenge

It isn't volume.
It's ambiguity.

A retailer's reviews are full of mixed signals — praise and problems in the same breath. Traditional sentiment analysis flattens them to "neutral," and the most fixable issues stay buried in the text.

0%
of mid-rated reviews carry contrasting positive & negative signals
0%
still recommend the product — while reporting a real problem

I love this dress — but the brand runs large. I'd order a size down.

+ love the design – sizing runs large
A 3-class model reads this as neutral — and routes it nowhere. The sizing signal, and the customer, are lost.
→ 01

Fragmentation

Separate NLP models for sentiment, classification, summarisation — each with its own data and upkeep.

→ 02

Complexity

Cost and latency compound when review volume triples during seasonal peaks.

→ 03

Delay

Insight arrives in periodic reports, days after the signal — too late to act on.

02 / The Insight

Same rating.
Different customers.

Clustering on behaviour — not stars — reveals four customer types that traditional review metrics fail to distinguish. A single response strategy can't serve them all.

The highest-value insight isn't from angry customers. It's from the ones who still recommend — while telling you exactly what to fix.

03 / The Solution

One review in.
Five decisions out.

A single GenAI pass replaces a fragmented stack of models — and returns outputs teams can act on, not just analyse.

● Live transform · select a review
Input · raw review
"I love this top and was so happy when it came out in black. I'm typically an 8 on top and sized up to a 10, and it still feels slightly small across the shoulders and arms. It also requires an under tank because it's very sheer, even with the lining. Worth it though!"
★ 4 / 5Recommends: Yes
Output · decision-ready
01
Sentiment

Four-label, incl. mixed-negative — the nuance 3-class systems miss.

02
Category

Product division & department, for buying and merchandising.

03
Summary

The customer's core message in ≤25 words.

04
Message

A brand-toned, specific reply ready for the customer.

05
Insight + urgency

Root cause, owning team, action, and a four-tier urgency tag.

04 / The Operating Model

Every signal routed to
the team that owns it.

The real shift isn't the model — it's the redesigned flow around it. Each review is tagged by urgency and sent to the right team with a concrete action. Tap a tier.

Human-in-the-loop, by design. A validation gate sits before any automation — CX teams confirm sentiment, urgency and tone, logging every correction as a labelled example. The system proposes; people decide. Trust is earned before scale.

05 / The Value

From reactive support to proactive decisions.

Decision latency, collapsed.

Feedback-to-insight drops from days or weeks to same-day — early enough to act before issues scale.

Configuration-driven scalability.

New categories, departments, or classification schemes can be introduced through prompt and workflow configuration

Brand voice, at scale.

Every reply reflects the customer's specific experience while holding a consistent tone across thousands of responses.

A cost structure that invites experiments.

Testing a new category is effectively free — no budget-approval cycle to learn something.

Processing economics
Per 1,000 reviews$0.24
Full dataset · ~23K reviews~$5
Annual API ceiling<$100
The real investment is operational — embedding the workflow into teams — not the technology.
48%Fit & sizing
48% Fit & sizing
24% Fabric / material
18% Design
9% Quality

The system's own output names the primary cost driver: fit & sizing — the single highest-leverage fix in apparel retail.

06 / Under the Hood

Built for production, not demos.

Six prompt configurations — zero-shot, few-shot and chain-of-thought, each in baseline and business-aware versions — were scored against two LLM-as-judge rubrics.

Few-shot won on consistency, not peak score. Chain-of-thought reached the highest single scores but proved unstable under strict, business-aligned evaluation — the wrong trade-off for high-volume production. Choosing the reliable option over the flashy one is the call that matters.

Download the full case study (PDF)
0%
Recommendation-prediction accuracy on a held-out task the model was never optimised for.
0%
Recall on dissatisfaction. No genuine negative signal was missed.
SD 0
Lowest variance of any configuration tested — the basis for the production choice.