Architectural design of an Agent-Assisted Product Discovery Pipeline
A customer submits a need in natural language. The system finds, vets, prices, and lists the product. What does the AI layer actually look like — is it a single model, an ensemble, a fine-tuned LLM? A design-first analysis of each sub-problem and the approach it warrants.
This is a classic case of users’ legitimate and reasonable expectations — and simultaneously, a developer’s nightmare! The premise is straightforward to state and genuinely difficult to execute: a customer submits a free-text need — ‘I need heavy-duty blenders for a small restaurant kitchen, budget around 80,000 KES for three units’ — and the platform finds, vets, prices, and lists matching products for that customer without a sourcing agent in the loop.
This is the design question: what does the AI layer that makes this work actually look like? Is it a fine-tuned deep learning model? A classical ML pipeline? A prompted LLM? The answer is that it is none of these in isolation — and the interesting design work is in understanding why each sub-problem has a different character that warrants a different approach.
Decomposing the problem
The first mistake when building a system like this is treating it as a single ML problem. It is not. “Find me a product from a natural language description” is at least four distinct computational problems arranged in sequence, and conflating them leads to architectures that are simultaneously over-engineered for the easy parts and under-engineered for the hard ones.
The pipeline decomposes as follows:
- Intent and specification extraction — unstructured customer input → structured product specification
- Catalogue retrieval — structured spec → candidate products from supplier catalogues
- Vetting and ranking — candidate set → vetted, ranked shortlist
- Pricing — vetted product → landed cost estimate
- Listing generation — product + pricing → customer-facing content
Each of these has a different input type, a different output type, and a different error mode. Treating them as one problem produces a system that is hard to debug, hard to improve, and likely to fail in ways that are impossible to attribute.
Stage 1: Specification extraction
The customer’s input is natural language. The pipeline needs a structured object — something like a product category, a set of key attributes and constraints, a budget signal if present, and a quantity. For example:
{
"category": "commercial blender",
"attributes": {
"use_context": "restaurant kitchen",
"duty_cycle": "heavy-duty",
"quantity": 3
},
"budget_kes": 80000,
"budget_per_unit_kes": 26667
}
This is a named entity recognition (NER) and slot-filling problem. The classical ML approach would be a fine-tuned sequence labelling model. A viable approach in the contemporary technological era is a general-purpose LLM with a well-constructed extraction prompt and structured output mode (JSON schema enforcement).
The reasons are practical. A fine-tuned NER model requires labelled training data you almost certainly do not have at the volume needed, degrades on out-of-distribution phrasing, and needs retraining when product categories expand. A prompted LLM generalises to novel categories, handles multilingual input (Swahili/English code-switching is common in this market), and can be improved by refining a prompt rather than retraining a model.
The tradeoff is latency and cost per call, but extraction is a single call with a short prompt and bounded output — it is not the bottleneck in a pipeline that involves supplier API calls.
The design principle here: use a general model where the problem space is open-ended and the training data requirement would be prohibitive. Reserve custom-trained models for where you have data and the problem is closed enough to benefit from specialisation.
Stage 2: Catalogue retrieval
Given a structured specification, the system needs to retrieve candidate products from supplier catalogues — in practice, sources like 1688 or Alibaba, where catalogue data is in Chinese. This is a retrieval problem.
The architecture is: embed the specification, retrieve nearest neighbours from a pre-indexed catalogue embedding store, return the top- candidates. Formally, given a query embedding and a catalogue of item embeddings , retrieval is:
The critical design choice is the embedding model. Catalogue entries are in Chinese; customer specifications are in English or Swahili. A monolingual English embedding model will fail on cross-lingual retrieval. A multilingual model — LASER, mE5, or a multilingual variant of a sentence transformer — maps semantically equivalent text in different languages to nearby points in the embedding space, enabling the cross-lingual match without translation as a preprocessing step.
This stage uses no model you train. It is inference-only: a pre-trained multilingual embedding model producing vectors, an approximate nearest neighbour index (FAISS with an IVF structure for sub-linear search at catalogue scale), and a retrieval call. The “AI” here is entirely in the embedding model weights, which you inherit rather than learn.
The index is rebuilt on a cadence (nightly, or triggered by catalogue updates). Query embeddings are computed at runtime. Retrieval latency at catalogue sizes in the tens of millions of items is well under 100ms with a properly configured IVF index.
Stage 3: Vetting and ranking
Retrieval returns a candidate set. Not all candidates are suitable — some match the category but not the key constraints, some come from suppliers with poor fulfilment histories, some are priced as outliers relative to the category distribution. Vetting reduces the candidate set to a shortlist; ranking orders it.
This stage involves two different kinds of judgement that should not be conflated:
Specification matching — does this product actually satisfy the customer’s stated constraints? This is a semantic comparison between the structured specification from Stage 1 and the product attributes. Again, the right tool is a reranking LLM call: present the spec and the candidate product side by side and ask for a match score with reasoning. This is sometimes called an LLM-as-judge pattern. The output is a scalar score and optionally a brief explanation of where the product meets or fails the spec.
Supplier and price credibility — is this supplier reliable? Is this price reasonable? These questions are answered by structured signals, not language understanding: supplier transaction volume, rating, return rate, price relative to the category distribution. These signals feed a scoring function — potentially as simple as a weighted linear combination, or a gradient boosted classifier trained on historical sourcing outcomes (which suppliers led to successful deliveries, which price points were consistent with actual product quality).
The weights are initially set by judgement and refined over time as sourcing outcome data accumulates. As that data grows, the scoring function can be replaced with a learned model — but the architecture does not need to wait for that data to start functioning. Rule-based scoring with sensible priors is a legitimate v1.
The design principle: decompose vetting into a semantic component (LLM) and a structured-signal component (scoring model or rules). Do not ask an LLM to judge supplier credibility from unstructured text when you have structured metrics available. Do not ask a scoring model to judge specification compliance when the match requires language understanding.
Stage 4: Pricing
Landed cost in a cross-border import context is a function of the supplier’s CNY price, sea or air freight (a function of volume and/or weight, as appropriate), import duty (a function of HS code classification), local handling, and margin. Given accurate inputs, this is deterministic arithmetic — not a learned model.
The uncertainty is in two places. First, freight estimation when dimensional weight data is missing from the supplier listing — here a regression model trained on historical shipments of products in the same category can impute a reasonable estimate. Second, HS code classification from a product description, which determines the duty rate. This is a text classification problem with a fixed output space (HS codes) and sufficient public training data — a fine-tuned classifier on product descriptions to HS headings is tractable and useful here, as manual HS classification is error-prone and consequential (misclassification leads to underpaid or overpaid duty at customs).
Pricing is therefore almost entirely deterministic, with learned components only where inputs are genuinely uncertain. The design mistake to avoid is using an LLM to estimate price — LLMs are unreliable on arithmetic and have no grounding in current exchange rates or freight costs.
Stage 5: Listing generation
The final stage takes the vetted product, its pricing, and the supplier’s raw catalogue data (title, description, images) and produces a customer-facing listing: a clean product title, a translated and edited description, curated image selection, and formatted pricing.
Title and description are LLM generation tasks — translate from Chinese if necessary, rewrite in the platform’s voice, highlight attributes that match the customer’s stated need. Image selection is a ranking task: given a set of supplier images, prefer clean product shots over lifestyle or warehouse images. A simple image classifier trained on labelled examples of “good” versus “poor” product images handles this adequately.
What the AI layer actually is
Stepping back, the AI layer in this pipeline is not a single model. It is:
- A prompted general-purpose LLM for extraction, specification matching, and listing copy
- A pre-trained multilingual embedding model for cross-lingual retrieval, running inference only
- A structured scoring function (rule-based initially, learned over time) for supplier and price vetting
- Deterministic pricing logic with narrow ML components for input estimation (freight, HS codes)
- A lightweight image classifier for listing image curation
The architecture is a directed pipeline with learned components at specific stages, not an end-to-end trained system. This is the right design for several reasons. First, each stage is independently debuggable — when the system returns a wrong product, you can trace the failure to a specific stage rather than attributing it to an opaque model. Second, each stage can be improved independently — a better embedding model improves retrieval without touching the vetting logic. Third, the system is operational before you have training data, because the stages that require data (the scorer, the freight imputer) have sensible rule-based defaults.
The temptation in a project like this is to reach for a single powerful model and ask it to do everything. An LLM can, in principle, take a customer query written in natural language—for example, “I need three heavy-duty blenders for a small restaurant kitchen, budget around 80,000 KES”—and return a product recommendation in one call. That elegance, however, comes with steep trade-offs. The same model will confidently hallucinate products that do not exist, invent plausible but entirely fictitious prices, and offer no transparent chain of reasoning to help you understand why it made a particular suggestion. Debugging such a failure means tweaking prompts blindly or resampling outputs, neither of which guarantees a fix or prevents regressions on previously working cases.
The pipeline approach sacrifices one-call elegance in favor of reliability, debuggability, and long-term maintainability. The decomposition follows a natural information flow: first, intent and specification extraction turns unstructured customer input into a structured product specification (e.g., “heavy-duty blender, three units, ~80,000 KES total”). Second, catalogue retrieval uses that spec to pull candidate products from supplier catalogues. Third, a vetting and ranking module checks availability and specifications, then produces a shortlist. Fourth, a pricing engine applies landed cost estimates to each vetted product. Finally, listing generation assembles product details and pricing into clean, customer-facing content. Each component can be tested in isolation, logged with structured errors, and improved independently as real user data accumulates—tuning retrieval indices, updating vetting logic, swapping pricing rules, or refining listing templates without retraining everything else. The result is a system that may be less magical but is far more fit for production, especially where correctness and traceability matter more than compactness.
This design principle governs everything described above: match the tool to the sub-problem, not to the project’s desire for a single coherent architecture.