Scaling AI Applications with LLMs

Part 1: Lessons Learned

02/06/2025 • Andy Truong

Two years ago, Choco reached a pivotal moment with the release of OpenAI's ChatGPT and its API models. We took this as an opportunity to pause and reflect on how these advancements might impact our work. We held an intensive, multi-day internal hackathon that marked the beginning of a transformative journey.

The output laid the foundation for Choco AI, an exciting product developed to digitize and automate the order intake process for food and beverage distributors. Within six months of its launch, Choco AI had not only been embraced by our customers but also became the main driver of our new revenue growth.

Its success was driven by its ability to automate complex workflows, reduce manual effort, save time, and provide highly accurate, customer-specific predictions.

We scaled to hundreds of new customers while continuously improving the quality of our AI system through strategic, technological, and operational decisions. A key focus was effectively utilizing Large Language Models (LLMs) in a production environment at scale. Choco AI has been live for over a year, and as the industry begins to explore and establish best practices for building and scaling AI applications with LLMs, we wanted to share key lessons from our journey.

This article is the first in a series where we share our insights from building and scaling Choco AI. It highlights the key decisions we made early on, framed as questions others can ask themselves when starting their own AI application projects. Future articles will examine specific technical and design challenges in depth.

What is Choco AI?

Choco AI is a system that automates and streamlines the ordering process between restaurants and food and beverage distributors by collecting unstructured orders (via email, voicemail, SMS, WhatsApp, etc.) and integrating them directly into distributors’ ERP systems. In one case, Choco AI helped a distributor reduce manual order entry time by 60%, allowing them to process 50% more orders daily without additional staffing. We use AI to automate and streamline a previously manual workflow that required human input at multiple steps due to the unstructured and varied ways restaurant orders are placed and received by distributors.

During our research and development phase, we worked closely with a couple of distributors on the first version. We analyzed their existing order intake processes, where orders arrived through various channels such as SMS, WhatsApp, voicemail, email, and—particularly in Germany—fax. Employees would manually interpret these orders and enter them into their ERP system, often switching between landline phones, smartphones, and printouts. Choco AI consolidates this process by directing all incoming orders into a unified interface. On a single screen, incoming orders are displayed on the left, while our system’s predicted order details are displayed on the right. Distributors can quickly review and edit mispredicted entries before clicking “Accept order” to seamlessly sync orders to their ERP system.

Choco AI is a composite AI system that performs multiple tasks to process these orders accurately. Some formats, like voicemails or scribbled handwritten notes, lack context and are thus particularly challenging to process. For example, a customer might request “2 kilos of tomatoes,” but with 40 tomato brands and various packaging options in the distributor’s catalog, identifying the exact product can be complex. Choco AI uniquely solves this by leveraging personalized data and contextual understanding to accurately identify and predict the exact product the customer expects, similar to how experienced order desk employees develop an intuitive understanding of their customers' preferences and behaviors over time. For instance, when Andrea requests 2 kilos of tomatoes, the employee knows she actually means canned San Marzano tomatoes, packaged in 500g cases, which amounts to 4 units.

Our system integrates multiple LLMs and ML models, each tailored for specific tasks, to generate these predictions. LLMs extract text from diverse order formats and identify relevant details. Our custom ML models then match this information to the correct products in the distributor’s catalog. For example, in processing a voicemail order, the system transcribes the audio to recognize the caller (e.g., "Andrea"), detects keywords like "2 kilos of tomatoes," and cross-references this data with the distributor’s catalog to pinpoint the exact product and quantity. This modular approach ensures high accuracy, with Choco AI currently achieving over 95% correct predictions.

Key Lessons Learned

Use LLMs for Narrow and Specific Tasks

AI systems are software systems. Adhering to established software engineering principles, such as modularity and the single responsibility principle, remains both valid and critical. Breaking down problems into the smallest possible components simplifies assessment, maintenance, and scalability. Isolating each part makes it easier to reason about potential solutions and evaluate whether an LLM is the right tool in the first place. This approach also ensures that the system remains robust, easier to debug, and scalable.

Although LLMs are powerful, overloading a single prompt with multiple tasks can result in overly complex, difficult-to-maintain solutions.

During our first hackathon, it was tempting to rely on a single LLM call to handle what traditionally required multiple systems. And in many cases, it worked surprisingly well. We would write prompts like:

Using LLMs in hackathon projects or proof of concepts (POCs), where they need to perform well for a specific use case, can be effective. However, as you move beyond the POC stage, it becomes clear that a catch-all prompt is not scalable.

Instead, separating tasks into focused steps allowed each part to be independently optimized. In Choco AI’s case, processing a voicemail order could have been handled in one step with a multimodal model. However, we chose to break this into multiple distinct model calls, each with a specific, testable responsibility—such as transcription, correction, and information extraction.

This modular approach allowed us to track exactly how each component contributed to the overall output, making it easier to pinpoint and address any issues that arose. By structuring tasks this way, we optimized each part independently—a concept we explore further in the next sections.

Evaluate Everything

Thorough evaluation is the cornerstone of a successful AI system deployment. Whether referred to as tests, validations, evals or evaluations, having robust test datasets for each AI or ML task has been non-negotiable for us. These datasets, along with rigorous evaluation metrics, provide detailed insights into how individual components and the overall system perform under production-like conditions. For example, evaluating the transcription task for voicemail orders using metrics such as word error rate (WER) gives us a way to assess how our system may be able to process voicemail orders overall. Much like traditional software requires rigorous testing before deployment, AI systems should not be released without comprehensive evaluation.

A comprehensive evaluation pipeline also speeds up innovation. Teams can quickly test hypotheses, iterate on models, and evaluate how these changes may affect the quality of the overall system. For instance, when a new speech-to-text model gets released, we can seamlessly evaluate it using our large dataset—composed of diverse audio samples of orders from different speakers, languages, accents and background noise and durations. The evaluation pipeline then tests its performance on the audio transcription task in isolation, as well as on immediate downstream tasks (such as information extraction) based on those samples, against a set of metrics. That way, we confirm that it offers measurable benefits without undermining the system's overall reliability, empowering AI engineers to deploy changes confidently.

When GPT-4o was released, we had it running in production within a week. We identified a specific task around handling very large orders that we believed could deliver an improvement. In two days, our team spun up a notebook to test it with a few anecdotal examples, ran evaluations against a test dataset, and confirmed that it would be an overall increment through running end-to-end tests.

Human Labeling Is a Critical Part of the System

Reliable ground truth data is fundamental to the quality of any AI model, especially for tasks that require human-level reasoning. To create such datasets, expert knowledge and judgment are often necessary.

In our voicemail order example, this involves addressing complexities such as ambiguous speech, regional accents, and colloquial phrases. Initially, we outsourced the task to an external agency, but the lack of domain expertise often led to unreliable results. Today, we rely more on our internal Customer Success teams to provide us with ground truth data, as they work hand-in-hand with our customers and thus have a deep understanding of their specific operations and needs. This close collaboration helps resolve ambiguities effectively.

Investing in the design and development of user-friendly internal tools and systems to streamline the labeling process was key. For example, our AI engineers can seamlessly add audio samples to a tool for transcription. Once complete, those transcripts are stored in a database, from where they can be easily retrieved for evaluation purposes.

Over time, we’ve manually labeled tens of thousands of examples for different tasks while maintaining stringent data privacy practices. This high volume is essential for creating robust evaluation datasets and training models to handle edge cases.

By investing into this foundational step from the start, we’ve ensured that our iterative process is built on reliable data. While online evaluations can provide insights, they’re often too slow or costly to rely on exclusively. Instead, our upfront investment in accurate labeling has accelerated our ability to experiment, iterate, and improve the system’s overall performance.

We’ve refrained from using LLMs to replace human judgment and labeling in this area so far. While automation might seem appealing, those LLMs themselves require careful oversight, maintenance and, you guessed it, evaluations with labeled data. Make sure you focus on making your AI system do its primary task well before expanding towards a machine learning flywheel through complex automations, such as generating synthetic data or using LLMs as evaluators.

Model Base Performance vs. Learning

During the development of Choco AI, we relied on a simple and straightforward framework. We first broke down the overall workflow into three distinct components: text extraction, information extraction and matching. We then followed the principle mentioned above and broke each component into smaller tasks. For each of these, we asked ourselves the following questions:

How can we solve each task?
How do we improve the performance of each solution?
Does it require a human or can it be automated?

To track progress, we measure two critical aspects: the ‘Day-0 performance’—the system’s baseline accuracy upon deployment’—and the ‘learning curve,’ which reflects how effectively it adapts based on feedback. Choco AI’s promise is to automate the order intake process. As such, we want to minimize the edits required per order in the review flow. This is measured on a customer level.

A first version for any solution must include some mechanism for continuous improvement, ideally one that is automated. Our evaluation metrics, both operational and business-focused, are designed to address these dual goals. From the outset, we are transparent with customers, acknowledging that the system may produce errors initially but will learn and improve through corrections over time. This is why we designed our product with a human review interface for the first initial orders. Borrowing the tomato example from earlier, this review UI teaches Choco AI which tomatoes Andrea expects from the 40 different varieties the distributor offers. If our system predicts the wrong tomato, the customer can make that correction in the UI, which Choco AI will automatically learn from so that it is more likely to get it right the next time around.

Designing your product so that user interactions generate training data is ideal, but not always feasible. For example, voicemail orders can lead to transcription errors with ambiguous words (e.g., mishearing “haddock” as “heartache” 💔). Downstream models may also misinterpret units or quantities. Because we treat these steps as “internals” of our AI system, we avoid exposing them directly to our users. Instead, we’ve built internal tools for flagging errors, allowing humans to identify problems, provide corrections and teach the AI how to improve.

Having multiple improvement mechanisms—whether through users, our Customer Success team, or internal tooling—was crucial as we scaled Choco AI, onboarding new distributors weekly. Without these systems, each failure could have turned into a support ticket for the AI Engineering team. Through building a productized feedback system (our review UI), internal labeling and training interfaces, and thorough observability, we built “self-service” error resolution and improvement mechanisms into the product, minimizing disruptions.

———————————

Looking Ahead

Building Choco AI has provided us with invaluable insights into designing and scaling complex LLM-based AI systems. With its modular design, comprehensive evaluations, and seamless user interactions to generate data for learning, we’ve laid a solid foundation—one which we’ll leverage to continuously improve the product’s performance and to position ourselves to tackle more ambitious challenges in the future.

In our next article, we’ll highlight how we actually used data generated through labeling or user interactions to continuously improve the performance of our system. We touch upon why we mainly rely on various in-context learning approaches, as opposed to fine-tuning, and how we designed our system around the concept of dynamically providing LLMs with personalized context and memory to perform or improve upon a task.