Scaling AI Applications with LLMs - Part 2: Empowering Choco AI to Learn

In Part 1: Lessons Learned of our series on scaling AI applications with LLMs, we shared our journey of building Choco AI. If you haven’t had a chance to read that post yet, we highly recommend starting there for the full backstory of our experiences and insights.
In this post, we’re diving into the next phase of our journey: how we built a system that empowers Choco AI to learn and improve over time, with a focus on LLM-based components.
In the literature, there are generally two approaches to improving the performance of LLMs on specific tasks: fine-tuning, or adapting the prompt through prompt engineering techniques and in-context learning approaches.
In our first iterations of Choco AI, we primarily relied on prompt-based improvement mechanisms such as few-shot learning—where task-specific examples are included in the prompt to enhance system quality. Our engineering efforts focused on building systems that continuously generate and inject relevant dynamically labeled examples, along with personalized, contextual information—such as domain-specific instructions, prior orders, distributor details, and customer information—directly into the prompt. These techniques helped refine Choco AI’s ability to process and extract information efficiently and over time. In the next sections, we will explore two examples in more detail, starting with few-shot learning for information extraction.
Few-Shot Learning for Information Extraction
One of the most effective mechanisms in our system has been the use of custom-labeled examples in the prompt to leverage LLMs’ few-shot learning abilities. Few-shot learning refers to providing examples in the prompt that demonstrate the desired task or response format, helping the model understand and replicate the pattern in its output. It is similar to how humans can learn new concepts from just a few examples–you demonstrate a task a few times until they can do it themselves.
One such example is our few-shot learning approach for information extraction. Order formats can vary significantly from one customer to another. Some restaurants place orders using shorthand text messages with abbreviations, while others rely on third-party tools that send semi-structured orders as PDF attachments via email. We also discussed examples of voicemail orders in our first blog post, where speech recognition introduces another layer of complexity.
Each order goes through a different processing step that extracts key details–such as product names, quantities and restaurant name– and converts them into a structured JSON format required for downstream tasks. We refer to this critical step as the “information extraction” task within our AI system.
Naturally, this step introduces a variety of potential errors: the system might extract data from the wrong column, misinterpret ambiguous information or fail to recognize certain abbreviations and product identifiers, just to name a few. To address these challenges, we label a few common order formats for each of our distributors' customers and retrieve them at inference when new orders arrive. These labeled examples are then passed into the prompt, allowing our LLMs to dynamically adapt and optimally process every incoming order, from every customer.
To streamline the labeling process, we developed a simple interface that enables our customer success teams to annotate orders directly within our order inspection tool. When a new order format or failure mode is identified, we can immediately add a new label, ensuring our system continuously improves its information extraction capabilities.

Scaling Few-Shot Learning with Semantic Retrieval
While few-shot learning generally improves information extraction, its effectiveness depends on retrieving the right labeled examples for the right incoming orders. Our primary approach has been metadata-based matching. While this method works well in many cases, it has its limitation when faced with variations in order layouts, ambiguous sender information, or unseen formats.
To address these limitations, we experimented with semantic retrieval using embeddings. Order formats often share structural similarities. For example, PDF orders from different restaurants that use the same tool may have the same layout. Another example is when distributors share a predefined order template with their customers. We hypothesized that we could use or fine-tune an embeddings model to successfully retrieve labeled samples with similar order templates, even when the information such as ordered products is different. Early experiments proved the feasibility of this approach.
By leveraging an embeddings-based retrieval system, we index labeled examples as vector representations in a database. At inference, we compute the embedding of the incoming order and perform a semantic search to retrieve labeled examples with similar structure. If the retrieved candidates meet a similarity threshold, they are used as few-shot examples in the prompt, improving adaptability without requiring manual labeling for every possible format.

Dynamic Context Injection
While retrieving relevant examples is crucial for improving information extraction, there are cases where even the best-matched few-shot examples are insufficient. Although we believe LLMs are extremely capable, we have learnt that the overall context is what matters the most. We inject supplementary information to provide the LLM with richer context for improved extractions. Consider the voicemail transcription component we outlined in the first blog post, which suffers from mistranscriptions like “heartache” 💔 vs. “haddock”.
Voicemail transcriptions in Choco AI are used for downstream tasks like information extraction. They’re also shown to the user to help with order correction so, unlike other intermediate components in the data science pipeline, there is little room for errors.
To mitigate mistranscriptions, we iterated on two solutions. The first was a simple and generic LLM prompt:

Whisper–one of the ASR models we use–famously hallucinates sentences at the end of the audio (“Go to ama.uk for more info”), likely due to its autoregressive behaviour’s reaction to gaps of silence. Adding that conditional removed those hallucinations completely.
Keeping the prompt generic allows the LLM to use its internal knowledge and abstract reasoning capabilities to come to conclusions on its own. This initial iteration worked surprisingly well, in some cases correcting every mistranscription – to the surprise of the team! It caught ambiguous, phonetic mistranscriptions like Salade masculine -> Salade mâche and Détrapes -> Échalotes.

However, as discussed in the section “Using LLMs for Narrow and Specific Tasks” in our first blog post, this generic “catch-all” prompt initially led to performance improvements but struggled with correcting brand names and uncommon products. Additionally, restaurants often order in a colloquial manner—for example, in the UK, 'blue milk' refers to milk identified by its bottle cap or carton color.
To capture further improvements, we dynamically injected additional context:
- The distributor’s assortment category (e.g., “This is a fish distributor”), stored as metadata.
- A list of previously ordered products for that restaurant.
- If the restaurant is a new customer, a list of the distributor’s most popular products.
Since restaurants often place repeat orders, we injected only the last few weeks of order history. Distributor catalogs have unique naming conventions; we initially processed product names via stop word and unit removal but found that passing the names as-is helps the most, especially for correcting units and brand names. If a restaurant referred to a product by its pack size (e.g. “two cases of …”) and the ASR model mistranscribed the product name, the pack size can provide a hint regarding which product in the catalog the restaurant was referring to.
Context-for-context is important! We also explain to the LLM what to do with the information provided:
- If a mistranscription is obvious (i.e. clearly a typo), correct it directly.
- If not, check the order history for the correct product names that match the referenced products.
- Not all products in the transcription may be in the order history, so do not force a match between the two.
Steering the LLM helps to make corrections where obvious and for cases where it is provided with the right context. This approach to transcription correction is conceptually similar to how ML engineers approach feature engineering: enriching input data to enhance model performance.
Balancing In-Context Learning, Dynamic Context Injection and Fine-Tuning
Prompt-based improvement techniques provide a structured and predictable way to improve system performance while avoiding the risks of fine-tuning, which can introduce unintended regressions. By dynamically injecting relevant context into prompts, we can make targeted, surgical improvements and maintain control over when, how, and what context is added. In contrast, modifying the base prompt to fix a specific failure mode can introduce unexpected regressions in areas that previously worked—an inherent challenge with LLMs. This adaptability has consistently led to higher accuracy while allowing seamless switching between LLMs as newer models become available.
Providing richer context also reduces reliance on large, expensive models. A smaller model can achieve comparable performance if given the right contextual information, reducing costs and increasing agility. Additionally, it creates opportunities to build user-friendly interfaces for our Customer Success team, enabling them to independently contribute to system improvements without deep ML expertise.
Fine-Tuning: A Slower but Complementary Approach
Our in-context learning system has evolved to dynamically generate custom prompts for different orders in a deterministic and automated way. Fine-tuning, on the other hand, is often slower and less precise. Adding examples to a training dataset does not guarantee a model will generalize correctly or prevent similar mistakes in the future. Additionally, changes in output schema require retraining, slowing deployment.
Despite its challenges, fine-tuning does remain valuable, particularly for tasks requiring deeper generalization beyond what in-context learning can provide. We found that prioritizing prompt-based methods allowed us to iterate faster while still benefiting from fine-tuning in the longer run. Avoiding dependence on a single custom LLM has also given us the freedom to quickly experiment and integrate the latest LLMs into our system, which also served as a good business decision given the rapid advancements in the last several years.
Towards Autonomous Systems
The promise of AI–and particularly Machine Learning–is to build systems that can learn and generalize to unseen scenarios. The system we are working towards is one that can handle complexity and novelty with minimal human intervention. The labeled examples we generate for few-shot learning hold immense value beyond the applications we outlined in this article. They serve as:
1. A foundation for fine-tuning.
2. A dataset for evaluating and refining our LLM components.
3. A scalable method for creating new labeled examples.
4. A resource to help LLMs autonomously identify errors and judge outputs.
Scaling our system hinges on these labeling strategies, which not only enhance reliability but also ensure continuous improvement and adaptability. Moving towards fully autonomous AI systems is not about eliminating human involvement. It's about designing systems that continuously learn from human expertise while minimizing manual intervention. By carefully structuring our AI system with the learnings we shared in the previous article, we ensure that every human interaction, whether through labeling, corrections, or metadata input, feeds directly into a feedback loop that improves the system.
This evolving dataset not only enhances Choco AI’s immediate performance but also lays the groundwork for autonomy. As this collection of human-annotated data grows, it becomes more than just a tool for incremental improvements; it transforms into the foundation upon which autonomous AI systems can be built.