beingsde.com | Master System Design

Patterns # Multi-step Processes Learn about multi-step processes and how to handle them in your system design with distributed transactions, sagas, workflow systems, and durable execution.Multi-step ProcessesUpdated Jun 11, 2026 ⚙️ Real production systems must survive failures, retries, and long-running operations spanning hours or days. Often they take the form of multi-step processes or sagas which involve the coordination of multiple services and systems. This is a continual source of operational and design challenges for engineers, with a variety of different solutions. ## The Problem Building reliable multi-step processes in distributed systems is startlingly hard. While clean systems like databases often get to deal with a single "write" or "read", real applications often need to coordinate dozens of (flaky) services and systems to do the user's bidding, and doing this quickly and reliably is a common challenge. Jimmy Bogard has a great talk about this titled "Six Little Lines of Fail" with the premise that distributed systems make even a simple sequence of steps like this surprisingly hard (if you haven't had to deal with systems like this before, it's a great watch). Consider an e-commerce order fulfillment workflow: charge payment, reserve inventory, create a shipping label, wait for a warehouse worker to pick the item, and send the confirmation email. Each step involves calling different services or waiting on humans, any of which might fail or timeout. Some steps require us to call out to external systems (like a payment gateway) and wait for them to complete. During the orchestration, your server might crash or be deployed to. And maybe you want to make a change to the ordering or nature of steps! The messy complexity of business needs and real-world infrastructure quickly breaks down our otherwise pure flow chart of steps. Order Fulfillment Nightmare There are, of course, patches we can organically make to processes like this. We can fortify each service to handle failures, use delay queues and hooks to handle waits and human tasks, etc. but overall each of these things makes the system more complex and brittle. We interweave system-level concerns (crashes, retries, failures) with business-level concerns (what happens if we can't find the item?). Not a great design! Workflow systems, event-driven sagas, and durable execution are the solutions to this problem, and they show up in many system design interviews, particularly when there is a lot of state and a lot of failure handling. AI system design interviews lean on this even harder, since agent pipelines are long chains of exactly these flaky, stateful steps. Interviewers love to ask questions about this because it dominates the oncall rotation for many production teams and gets at the heart of what makes many distributed systems hard to build well. In this article, we'll cover what they are, how they work, and how to use them in your interviews. Problem Breakdowns with Multi-step Processes Pattern Uber Payment System ## Solutions ### Single Server Primitives ### The Saga Pattern ### Event-Driven Choreography ### Workflow Orchestration Durable Execution EnginesHow Durable Execution WorksManaged Workflow SystemsImplementations ## When to Use in Interviews ### Common interview scenarios ### When NOT to use it in an interview ## Common Deep Dives ### "What happens if the process running your saga crashes partway through?" ### "How will you handle updates to the workflow?" Workflow VersioningWorkflow Migrations ### "How do we keep the workflow state size in check?" ### "How do we deal with external events?" ### "How can we ensure X step runs exactly once?" ## Conclusion

Multi-step Processes

Prerequisites

Evaluation Workbench