Step Functions at Choco: Introduction

Software Engineers Alex and Oscar share how we use Step Functions as a new file-processing service.

02/20/2023

Before we get started, please note that this article assumes some level of familiarity with AWS Step Functions. As this is the first part of a series, keep a lookout as we continue to expand on this topic further in future articles.

At Choco, we built an invite system so that suppliers and buyers can invite each other onto the platform. When a supplier invites a buyer, we prepare the onboarding before the “invitee” signs in for the first time. This ensures they can connect with the “inviter” and start ordering straight away. This onboarding preparation relies on us creating several entities in our database.

The problem we needed to solve

In the past, creating entities has been a manual process. Our amazing Operations teams first gather the necessary data points from existing suppliers. They then use an internal web app to create entities by entering those data points in sign-up forms. Although this process works, it doesn't scale well and is error-prone. So our Operations team asked if there was any way to automate the task.

Our internal tool teams tackled this new challenge. Our calculations showed that an end-to-end flow could save our Operations team up to 75 hours of work a month. We decided to approach this task by designing a new service from scratch. The main steps in the service workflow were as follows. We receive a list of data points, run a series of checks, create entities with the data points upon successful checks, and share the results.

Input data points are shared as CSV files, where each row contains the necessary amount of data to create the entities relevant to one new user.

To build our service MVP we chose to use AWS Step Functions. The product combines several features that made it well-suited for the job.

What are AWS Step Functions?

Step Functions help to build a State machine that defines a workflow. The Step Function is a set of States and unidirectional edges transitioning between these States. Each State represents a unit of work within the State machine. States fall into two categories. One is used to define control flow for the State machine: the other, whose members we know as Tasks, is used to do work. Every State has defined input and output. When transitioning along an edge from one State to another, the output from the State that was just executed is the input to the subsequent State.

How we benefit from Step Functions

Step Functions offer a data layer in the form of a State. Although scoped to the execution of one workflow, we can store and use the results of any tasks to determine what to do next. This was ideal for us as we can use the State to track the execution status for different row operations without relying on a separate tool.

Step Function tasks support integrations with many other AWS services. So it was trivial to connect our new service to the existing resources needed, like Lambda Function (to process a data point) and SNS Topics (to communicate outcomes).

Step Functions also provide highly configurable constructs for control flow, with options for parallel operations or conditional logic. They can also be easily expanded should the scope of the service change. This is great for our use case, as we need to support different outcomes for a given input, and are likely to add functionality in further iterations.

Contrasting AWS Step Functions Provide with Queues and Lambdas

AWS Lambda is often combined with SQS Queues to define workflows. These workflows tend to be a non-branching series of actions. Systems designed with these components are cost-effective, easy to maintain, and reliable. We use them all the time at Choco.

So why did we choose AWS Step Functions when we already have a set of tools for defining workflows? It’s a matter of complexity. Step Functions are a better option as soon as a workflow has a significant amount of branching and/or control flow. In this case, the abstractions and developer experience they provide are superior. For simple and non-branching flows, Queues and Lambdas work just fine.

Developing AWS Step Functions

We chose to use the AWS-CDK (learn more about the tool here) to leverage the benefits of infrastructure as code to define our Step Function. This allowed us to quickly spin up instances of our service in test accounts during development and testing. In the case of the Step Functions, we also found that it made our infrastructure code more readable as we could abstract common patterns.

Learnings and methodology

As mentioned, we found Step Functions great for dealing with complex workflows. That said, it's a complex tool itself. Let’s take a look at the method we used that helped us keep our code clean.

State management

A Step Function contains two layers of information:

The workflow: what State invokes which is seen as the infrastructure layer. This persists between workflows.
The data: input and output data for the different States, which we see as the data layer. It scopes to workflow execution.

Both grow in complexity with the creation of more States. For example, as the result of more branching being introduced. To organize our data layer, we defined contracts for each Task in our flow. This enforces the input and output shape at every step. Contracts allowed us to keep the “data” State uniform throughout the workflow.

Another complication arises from a Step Function limitation: the greatest data layer size. The State is limited to 256kb, and executions reaching this size might end.

To keep the State size to a minimum, we employed three different methods:

Cleanup: every State overwrites the ResultPath on its input to manage the size of the resulting objects. This way we ensure only the required data passes to the next tasks.
Leverage s3: as suggested by the AWS documentation, you can't rely on loading all contents in the State when using files as the base of a data layer. In our case, we constructed a flow to map over a fixed number of rows from the input file at a time. One of our reasons to select Step Function was to cut the reliance on external tools for a data layer. But this seemed like an acceptable compromise to design a scalable system.
Control flow: when the amount of information to load in the data layer is consequent, we can use certain States to split it. Map States for example act as sub-Step Functions with their own State size limits.

Business logic in the infra

We could leverage some handy built-in features by moving as much logic into the Step Function States rather than keeping it in the handler code. For Lambdas that make calls to external services, we were able to define retry policies on the task level for a certain range of errors. This is a strong argument for keeping the work that each task does granular. Doing so enables it to best deal with all possible outcomes in the infrastructure layer.

Abstract the infra code

Using the CDK to build our infrastructure allowed us to abstract reused code. We packaged it as Step Function fragments or any combination of chainable States. We defined common patterns for Lambda Invoke with specific retries. Having a consistent output shape allows us to move Choice States in some of these fragments.

Creating those intermediary building blocks made our infra code cleaner. We're also more able to accommodate different file processing requirements down the line.

A case against Step Functions

Although the tool is versatile, it needs a lot of boiler plating. So, for simpler flows that don't need much branching, we recommend other options, like the Topic-Queue-Lambda pattern.

It's also difficult to test a workflow or partial workflow execution. We relied heavily on running Step Functions locally. But maintaining the mocks for each task for each workflow to test is a significant amount of work.

Overall, Step Function is a great tool to organize complex workflows and lends itself well to file processing. We will explore testing options for Step Functions in more detail further in this series - so stay tuned!

We're hiring! Learn more about our engineering team, tech radar, and open positions here!