End of the Road for Lambda: Choco's Journey to Kubernetes

by Nishant Kumar, Alper Cugun-Gscheidel - Choco Platform Engineering

08/07/2024

Choco is a company that creates a service to digitize the food supply chain, make ordering more efficient and reduce waste along all stages of the process. Choco has long prided itself on being a serverless engineering organization but we moved away from that architecture at the end of 2023. Kubernetes proved to simplify things and allow us to control our own destiny.

Here we’ll detail some of the context that led to this transition and what we think are the factors that led to a successful outcome.

End of the Road for Lambda

The decision to migrate away from serverless to container-based architecture was prompted by a series of challenges we encountered with Lambda and DynamoDB. While Lambda initially offered the allure of low-maintenance serverless computing, we found ourselves grappling with its limitations:

Escalating costs (both in AWS but also in Datadog) scaling at least linearly with our high request volumes
Latency issues around cold starts (more than 100ms for Node functions) that makes Lambda fundamentally unsuitable for interactive applications
Limitations on resource usage for memory and execution time necessitating awkward workarounds
Lack of a standardized application framework and a substandard developer experience
A small and stagnant ecosystem

As we evaluated our options, we assessed that we had invested a lot in Lambda already but that it did not make sense to invest more. To add to that, the custom code and workarounds that we employed were challenging to understand and maintain. Our developers had created some 700 Lambda functions over the years across dozens of ‘services’ which were non-trivial to manage. We knew we were at a dead end.

Coming up with a Migration Strategy

Moving away from Lambda after so much investment went into it was not a decision we made overnight. We waited for the right moment and for the right key people to be in place so that we could execute our transition to Kubernetes with confidence.

The strategy to move away consists of the standard components:

Diagnosis: Are we sure that Lambda is unworkable for us?
Guiding prescription: What do we think that the solution will be?
Coherent action: Can we make a plan how to execute on the solution?

Diagnosis

We talked to people to figure out who else had dealt with serverless (for example: https://tier.engineering/Our-journey-to-faas-with-knative) and maybe already gone through a similar transition. We wanted to know what it would take to be successful with Lambda.

From our conversations with others we learned that Lambda would work better if:

You use fat lambdas and pack lots of functions in a single artifact
You write your Lambdas in Rust which makes cold starts irrelevant
You get rid of the custom middleware code

Making these changes would be a big investment and after that we’d still be in the same billing model, have a poor local development experience and no ecosystem to speak of. We decided we were better off investing that effort into something that has legs.

Prescription

We briefly considered alternatives to Kubernetes but quickly found that there is no real alternative in the industry at the moment. Nomad is relatively niche and something like ECS would be convenient but would most likely also be a dead end. Kubernetes has wide adoption which pays out in every possible respect: mindshare, talent, education resources, tooling, ecosystem, etc.

We did choose Amazon’s EKS service to provide Kubernetes for us. Right now, we do not want to put in the operational effort ourselves and we consider the lock-in to be manageable.

Going for the most established project in the room has been our guiding principle for other parts of the new platform as well. Alongside Kubernetes we chose Postgres (provided by RDS) to replace DynamoDB and we picked Nest.js as the web application framework. When it comes to ecosystems, bigger is better.

Turn into action

Being crystal clear about the problem being solved and the chosen solution made it much easier to spring to action quickly. When we saw that we had the right talent in our Platform team and enough breathing room to bootstrap a cluster, we kicked it off.

Using Amazon EKS Blueprints for CDK gave us fully bootstrapped EKS clusters, equipped with operational Kubernetes add-ons essential for deploying and managing workloads. Because we already had expertise managing infrastructure with AWS CDK, this sped up our time to production immensely. For deployments we used Helm which is widespread and smoothly integrated with CDK as well.

The tooling we chose was extremely convenient and a big factor in us being able to go from start to production-ready EKS clusters within a couple of months. At the same time we also chose the application framework and provisioned the new database instances. What looked like a big lift turned out to be a relatively linear effort.

The New Old Developer Experience

The result sees us coming back to a very classic developer experience directly in line with our engineering strategy to “Reduce tech real-estate, simplify existing solutions […]”. We are innovating in so many areas of our product that we don’t have a lot of time to deal with poorly-supported or half-baked tools.

Nest.js provides us the standard way of developing services that any web framework probably would. We wanted to stay with Typescript as our programming language of choice. Nest has a particular way of doing things but developers pick it up quickly and once you have the hang of it, developing things is an order of magnitude (!) faster than it was with Lambda.

Containerizing our backend services has simplified development, execution and testing across different environments. Our applications now are portable, scalable and have much clearer interfaces with their environments.

Kubernetes has proven to be not at all as scary as it was made out to be. We feel that by keeping things lean and standardized, we were able to get off the ground really quickly. When we need it later on, there will be plenty of space to grow out in both scale and complexity.

We can say without any exaggeration that this change has ushered in a new era of engineering at Choco.

Onwards and Upwards

As happy as we are with the change we made, we are only at the beginning of this journey. From the outset we knew that the biggest challenge was not going to be the technology. It was going to be building up a similar body of knowledge as we had amassed during years of working on Lambda and serverless architecture. Additionally, the move to provisioned infrastructure necessitates building operational muscle where previously we did not have to have any.

Education

Our education program consisted of running each team through our architecture choices and the developer experience that we had put together. Most developers know how to do container based development but how to ship and operate services varies from company to company. This program was well received and was a big part of taking product teams with us on this journey.

That notwithstanding, we have caught ourselves saying: “Our engineers should be more interested in learning Kubernetes!” The reality is that, along with their many other responsibilities, people won’t be learning new things without the need to do so. The best way to learn is to get people to use it

Migration

That brings us to our migration strategy to the new cluster. It consists of several phases:

New services are to be developed on Kubernetes
We rewrite a large part of our existing flow (for business reasons) and target Kubernetes
Existing services are rewritten opportunistically and move either as a whole or piece-wise
Remaindered services are shrink-wrapped into containers

This is working well for us so far. Prior experiences have shown us that it is nearly impossible to align capacities and priorities of product teams along any kind of migration roadmap. Some teams have an appetite to experiment and be on the bleeding edge while other teams are dealing with mission critical legacy services. Some services are migrated by a single engineer in a week while others need to be painstakingly decomposed and de-distributed. There’s no one size fits all solution.

The work we put in migrating compounds. With every step, our engineering organization gains more knowledge of the new architecture and confidence to develop for it. At the same time our knowledge of the old platform atrophies due to lack of attention. That is a development that puts a natural expiration date on our old stack.

Stability

As a final point, we as Platform have a lot of learning to do ourselves. More development and more traffic means strong requirements to ensure that the new platform is stable, performant and has a good developer experience. It helps to have people on the team with the experience to be able to deliver on this while we are investing in growing everybody with the new architecture.

Conclusion

All in all I think we can say that our transition from serverless technology to provisioned has been a success and I would recommend others to not get drawn too deep into the serverless billing and lock-in model in the first place. Even if you have, there’s nothing stopping you from getting out. Moving to proven technologies is nothing to be afraid of.