Betterment Engineering Blog | Operating software

Introducing “Delayed”: Resilient Background Jobs on Rails

In the past 24 hours, a Ruby on Rails application at Betterment performed somewhere on the ...

Introducing “Delayed”: Resilient Background Jobs on Rails In the past 24 hours, a Ruby on Rails application at Betterment performed somewhere on the order of 10 million asynchronous tasks. While many of these tasks merely sent a transactional email, or fired off an iOS or Android push notification, plenty involved the actual movement of money—deposits, withdrawals, transfers, rollovers, you name it—while others kept Betterment’s information systems up-to-date—syncing customers’ linked account information, logging events to downstream data consumers, the list goes on. What all of these tasks had in common (aside from being, well, really important to our business) is that they were executed via a database-backed job-execution framework called Delayed, a newly-open-sourced library that we’re excited to announce… right now, as part of this blog post! And, yes, you heard that right. We run millions of these so-called “background jobs” daily using a SQL-backed queue—not Redis, or RabbitMQ, or Kafka, or, um, you get the point—and we’ve very intentionally made this choice, for reasons that will soon be explained! But first, let’s back up a little and answer a few basic questions. Why Background Jobs? In other words, what purpose do these background jobs serve? And how does running millions of them per day help us? Well, when building web applications, we (as web application developers) strive to build pages that respond quickly and reliably to web requests. One might say that this is the primary goal of any webapp—to provide a set of HTTP endpoints that reliably handle all the success and failure cases within a specified amount of time, and that don’t topple over under high-traffic conditions. This is made possible, at least in part, by the ability to perform units of work asynchronously. In our case, via background jobs. At Betterment, we rely on said jobs extensively, to limit the amount of work performed during the “critical path” of each web request, and also to perform scheduled tasks at regular intervals. Our reliance on background jobs even allows us to guarantee the eventual consistency of our distributed systems, but more on that later. First, let’s take a look at the underlying framework we use for enqueuing and executing said jobs. Frameworks Galore! And, boy howdy, are there plenty of available frameworks for doing this kind of thing! Ruby on Rails developers have the choice of resque, sidekiq, que, good_job, delayed_job, and now... delayed, Betterment’s own flavor of job queue! Thankfully, Rails provides an abstraction layer on top of these, in the form of the Active Job framework. This, in theory, means that all jobs can be written in more or less the same way, regardless of the job-execution backend. Write some jobs, pick a queue backend with a few desirable features (priorities, queues, etc), run some job worker processes, and we’re off to the races! Sounds simple enough! Unfortunately, if it were so simple we wouldn’t be here, several paragraphs into a blog post on the topic. In practice, deciding on a job queue is more complicated than that. Quite a bit more complicated, because each backend framework provides its own set of trade-offs and guarantees, many of which will have far-reaching implications in our codebase. So we’ll need to consider carefully! How To Choose A Job Framework The delayed rubygem is a fork of both delayed_job and delayed_job activerecord, with several targeted changes and additions, including numerous performance & scalability optimizations that we’ll cover towards the end of this post. But first, in order to explain how Betterment arrived where we did, we must explain what it is that we need our job queue to be capable of, starting with the jobs themselves. You see, a background job essentially represents a tiny contract. Each consists of some action being taken for / by / on behalf of / in the interest of one or more of our customers, and that must be completed within an appropriate amount of time. Betterment’s engineers decided, therefore, that it was critical to our mission that we be capable of handling each and every contract as reliably as possible. In other words, every job we attempt to enqueue must, eventually, reach some form of resolution. Of course, job “resolution” doesn’t necessarily mean success. Plenty of jobs may complete in failure, or simply fail to complete, and may require some form of automated or manual intervention. But the point is that jobs are never simply dropped, or silently deleted, or lost to the cyber-aether, at any point, from the moment we enqueue them to their eventual resolution. This general property—the ability to enqueue jobs safely and ensure their eventual resolution—is the core feature that we have optimized for. Let’s call it resilience. Optimizing For Resilience Now, you might be thinking, shouldn’t all of these ActiveJob backends be, at the very least, safe to use? Isn’t “resilience” a basic feature of every backend, except maybe the test/development ones? And, yeah, it’s a fair question. As the author of this post, my tactful attempt at an answer is that, well, not all queue backends optimize for the specific kind of end-to-end resilience that we look for. Namely, the guarantee of at-least-once execution. Granted, having “exactly-once” semantics would be preferable, but if we cannot be sure that our jobs run at least once, then we must ask ourselves: how would we know if something didn’t run at all? What kind of monitoring would be necessary to detect such a failure, across all the features of our app, and all the types of jobs it might try to run? These questions open up an entirely different can of worms, one that we would prefer remained firmly sealed. Remember, jobs are contracts. A web request was made, code was executed, and by enqueuing a job, we said we'd eventually do something. Not doing it would be... bad. Not even knowing we didn't do it... very bad. So, at the very least, we need the guarantee of at-least-once execution. Building on at-least-once guarantees If we know for sure that we’ll fully execute all jobs at least once, then we can write our jobs in such a way that makes the at-least-once approach reliable and resilient to failure. Specifically, we’ll want to make our jobs idempotent—basically, safely retryable, or resumable—and that is on us as application developers to ensure on a case-by-case basis. Once we solve this very solvable idempotency problem, then we’re on track for the same net result as an “exactly-once” approach, even if it takes a couple extra attempts to get there. Furthermore, this combination of at-least-once execution and idempotency can then be used in a distributed systems context, to ensure the eventual consistency of changes across multiple apps and databases. Whenever a change occurs in one system, we can enqueue idempotent jobs notifying the other systems, and retry them until they succeed, or until we are left with stuck jobs that must be addressed operationally. We still concern ourselves with other distributed systems pitfalls like event ordering, but we don’t have to worry about messages or events disappearing without a trace due to infrastructure blips. So, suffice it to say, at-least-once semantics are crucial in more ways than one, and not all ActiveJob backends provide them. Redis-based queues, for example, can only be as durable (the “D” in “ACID”) as the underlying datastore, and most Redis deployments intentionally trade-off some durability for speed and availability. Plus, even when running in the most durable mode, Redis-based ActiveJob backends tend to dequeue jobs before they are executed, meaning that if a worker process crashes at the wrong moment, or is terminated during a code deployment, the job is lost. These frameworks have recently begun to move away from this LPOP-based approach, in favor of using RPOPLPUSH (to atomically move jobs to a queue that can then be monitored for orphaned jobs), but outside of Sidekiq Pro, this strategy doesn’t yet seem to be broadly available. And these job execution guarantees aren’t the only area where a background queue might fail to be resilient. Another big resilience failure happens far earlier, during the enqueue step. Enqueues and Transactions See, there’s a major “gotcha” that may not be obvious from the list of ActiveJob backends. Specifically, it’s that some queues rely on an app’s primary database connection—they are “database-backed,” against the app’s own database—whereas others rely on a separate datastore, like Redis. And therein lies the rub, because whether or not our job queue is colocated with our application data will greatly inform the way that we write any job-adjacent code. More precisely, when we make use of database transactions (which, when we use ActiveRecord, we assuredly do whether we realize it or not), a database-backed queue will ensure that enqueued jobs will either commit or roll back with the rest of our ActiveRecord-based changes. This is extremely convenient, to say the least, since most jobs are enqueued as part of operations that persist other changes to our database, and we can in turn rely on the all-or-nothing nature of transactions to ensure that neither the job nor the data mutation is persisted without the other. Meanwhile, if our queue existed in a separate datastore, our enqueues will be completely unaware of the transaction, and we’d run the risk of enqueuing a job that acts on data that was never committed, or (even worse) we’d fail to enqueue a job even when the rest of the transactional data was committed. This would fundamentally undermine our at-least-once execution guarantees! We already use ACID-compliant datastores to solve these precise kinds of data persistence issues, so with the exception of really, really high volume operations (where a lot of noise and data loss can—or must—be tolerated), there’s really no reason not to enqueue jobs co-transactionally with other data changes. And this is precisely why, at Betterment, we start each application off with a database-backed queue, co-located with the rest of the app’s data, with the guarantee of at-least-once job execution. By the way, this is a topic I could talk about endlessly, so I’ll leave it there for now. If you’re interested in hearing me say even more about resilient data persistence and job execution, feel free to check out Can I break this?, a talk I gave at RailsConf 2021! But in addition to the resiliency guarantees outlined above, we’ve also given a lot of attention to the operability and the scalability of our queue. Let’s cover operability first. Maintaining a Queue in the Long Run Operating a queue means being able to respond to errors and recover from failures, and also being generally able to tell when things are falling behind. (Essentially, it means keeping our on-call engineers happy.) We do this in two ways: with dashboards, and with alerts. Our dashboards come in a few parts. Firstly, we host a private fork of delayedjobweb, a web UI that allows us to see the state of our queues in real time and drill down to specific jobs. We’ve extended the gem with information on “erroring” jobs (jobs that are in the process of retrying but have not yet permanently failed), as well as the ability to filter by additional fields such as job name, priority, and the owning team (which we store in an additional column). We also maintain two other dashboards in our cloud monitoring service, DataDog. These are powered by instrumentation and continuous monitoring features that we have added directly to the delayed gem itself. When jobs run, they emit ActiveSupport::Notification events that we subscribe to and then forward along to a StatsD emitter, typically as “distribution” or “increment” metrics. Additionally, we’ve included a continuous monitoring process that runs aggregate queries, tagged and grouped by queue and priority, and that emits similar notifications that become “gauge” metrics. Once all of these metrics make it to DataDog, we’re able to display a comprehensive timeboard that graphs things like average job runtime, throughput, time spent waiting in the queue, error rates, pickup query performance, and even some top 10 lists of slowest and most erroring jobs. On the alerting side, we have DataDog monitors in place for overall queue statistics, like max age SLA violations, so that we can alert and page ourselves when queues aren’t working off jobs quickly enough. Our SLAs are actually defined on a per-priority basis, and we’ve added a feature to the delayed gem called “named priorities” that allows us to define priority-specific configs. These represent integer ranges (entirely orthogonal to queues), and default to “interactive” (0-9), “user visible” (10-19), “eventual” (20-29), and “reporting” (30+), with default alerting thresholds focused on retry attempts and runtime. There are plenty of other features that we’ve built that haven’t made it into the delayed gem quite yet. These include the ability for apps to share a job queue but run separate workers (i.e. multi-tenancy), team-level job ownership annotations, resumable bulk orchestration and batch enqueuing of millions of jobs at once, forward-scheduled job throttling, and also the ability to encrypt the inputs to jobs so that they aren’t visible in plaintext in the database. Any of these might be the topic for a future post, and might someday make their way upstream into a public release! But Does It Scale? As we've grown, we've had to push at the limits of what a database-backed queue can accomplish. We’ve baked several improvements into the delayed gem, including a highly optimized, SKIP LOCKED-based pickup query, multithreaded workers, and a novel “max percent of max age” metric that we use to automatically scale our worker pool up to ~3x its baseline size when queues need additional concurrency. Eventually, we could explore ways of feeding jobs through to higher performance queues downstream, far away from the database-backed workers. We already do something like this for some jobs with our journaled gem, which uses AWS Kinesis to funnel event payloads out to our data warehouse (while at the same time benefiting from the same at-least-once delivery guarantees as our other jobs!). Perhaps we’d want to generalize the approach even further. But the reality of even a fully "scaled up" queue solution is that, if it is doing anything particularly interesting, it is likely to be database-bound. A Redis-based queue will still introduce DB pressure if its jobs execute anything involving ActiveRecord models, and solutions must exist to throttle or rate limit these jobs. So even if your queue lives in an entirely separate datastore, it can be effectively coupled to your DB's IOPS and CPU limitations. So does the delayed approach scale? To answer that question, I’ll leave you with one last takeaway. A nice property that we’ve observed at Betterment, and that might apply to you as well, is that the number of jobs tends to scale proportionally with the number of customers and accounts. This means that when we naturally hit vertical scaling limits, we could, for example, shard or partition our job table alongside our users table. Then, instead of operating one giant queue, we’ll have broken things down to a number of smaller queues, each with their own worker pools, emitting metrics that can be aggregated with almost the same observability story we have today. But we’re getting into pretty uncharted territory here, and, as always, your mileage may vary! Try it out! If you’ve read this far, we’d encourage you to take the leap and test out the delayed gem for yourself! Again, it combines both DelayedJob and its ActiveRecord backend, and should be more or less compatible with Rails apps that already use ActiveJob or DelayedJob. Of course, it may require a bit of tuning on your part, and we’d love to hear how it goes! We’ve also built an equivalent library in Java, which may also see a public release at some point. (To any Java devs reading this: let us know if that interests you!) Already tried it out? Any features you’d like to see added? Let us know what you think!

WebValve – The Magic You Need for HTTP Integration

Struggling with HTTP integrations locally? Use WebValve to define HTTP service fakes and ...

WebValve – The Magic You Need for HTTP Integration Struggling with HTTP integrations locally? Use WebValve to define HTTP service fakes and toggle between real and fake services in non-production environments. When I started at Betterment (the company) five years ago, Betterment (the platform) was a monolithic Java application. As good companies tend to do, it began growing—not just in terms of users, but in terms of capabilities. And our platform needed to grow along with it. At the time, our application had no established patterns or tooling for the kinds of third-party integrations that customers were increasingly expecting from fintech products (e.g., like how Venmo connects to your bank to directly deposit and withdraw money). We were also feeling the classic pain points of a growing team contributing to a single application. To keep the momentum going, we needed to transition towards a service-oriented architecture that would allow the engineers of different business units to run in parallel against their specific business goals, creating even more demand for repeatable solutions to service integration. This brought up another problem (and the starting point for this blog post): in order to ensure tight feedback loops, we strongly believed that our devs should be able to do their work on a modern, modestly-specced laptop without internet connectivity. That meant no guaranteed connection to a cloud service mesh. And unfortunately, it’s not possible to run a local service mesh on a laptop without it melting. In short, our devs needed to be able to run individual services in isolation; by default they were set to communicate with one another, meaning an engineer would have to run all of the services locally in order to work on any one service. To solve this problem, we developed WebValve—a tool that allows us to define and register fake implementations of HTTP services and toggle between real and fake services in non-production environments. I’m going to walk you through how we got there. Start with the test Here’s a look at what a test would look like to see if a deposit from a bank was initiated: The five lines of code on the bottom is the meat of the test. Easy right? Not quite. Notice the two WebMock stub_requests calls at the top. The second one has the syntax you’d expect to execute the test itself. But take a look at the first one—notice the 100+ lines of (omitted) code. Without getting into the gory details, this essentially requires us, for every test we write, to stub a request for user data—with differences across minor things like ID values, we can’t share these stubs between tests. In short it’s a sloppy feature spec. So how do we narrow this feature spec down to something like this? Through the magic of libraries. First things first—defining our view of the problem space. The success of projects like these don’t come down to the code itself—it comes down to the ‘design’ of the solution based on its specific needs. In this case, it meant paring the conditions down to making it work using just rails. Those come to life in four major principles, which guide how we engage with the problem space for our shift to a service-oriented architecture: We use HTTP & REST to communicate with collaborator services We define the boundaries and limit the testing of integrations with contract tests We don't share code across service boundaries Engineers must remain nimble and building features must remain enjoyable. A little bit of color on each, starting with HTTP and REST. For APIs that we build for ourselves (e.g. internal services) we have full control over how we build them, so using HTTP and REST is no issue. We have a strong preference to use a single integration pattern for both internal and external service integrations; this reduces cognitive overhead for devs. When we’re communicating with external services, we have less control, but HTTP is the protocol of the web and REST has been around since 2000—the dawn of modern web applications— so the majority of integrations we build will use them. REST is semantic, evolvable, limber, and very familiar to us as Rails developers —a natural ‘other side of the coin’ for HTTP to make up the lingua franca of the web. Secondly, we need to define the boundaries in terms of ‘contracts.’ Contracts are a point of exchange between the consumption side (the app) and producer side (the collaborator service). The contract defines the expectations of input and output for the exchange. They’re an alternative to the kind of high-level systems integration tests that would include a critical mass of components that would render the test slow and non-repeatable. Thirdly, we don't want to have shared code across service boundaries. Shared code between services creates shared ownership, and shared ownership leads to undesirable coupling. We want the API provider to own and version their APIs, and we want the API consumer to own their integration with each version of a collaborator service's API. If we were willing to accept tight coupling between our services, specifically in their API contracts, we'd be well-served by a tool like Pact. With Pact, you create a contract file based on the consumer's expectations of an API and you share it with the provider. The contract files themselves are about the syntax and structure of requests and responses rather than the interpretation. There's a human conversation and negotiation to be had about these contracts, and you can fool yourself into thinking you don't need to have that conversation if you've got a file that guarantees that you and your collaborator service are speaking the same language; you may be speaking the same words, but you might not infer the same meaning. Pact's docs encourage these human conversations, but as a tool it doesn't require them. By avoiding shared code between services, we force ourselves to have a conversation about every API we build with the consumers of those APIs. Finally, these tests’ effectiveness is directly related to how we can apply them to reality, so we need to be simple—we want to be able to test and build features without connections to other features. We want them to be able to work without an internet connection, and if we do want to integrate with a real service in local development, we should be able to do that—meaning we should be able to test and integrate locally at will, without having to rely on cumbersome, extra-connected services (think Docker, Kubernetes; anything that pairs cloud features with the local environment.) Straightforward tests are easy to write, read, and maintain. That keeps us moving fast and not breaking things. So, to recap, there are four principles that will drive our solution: Service interactions happen over HTTP & REST Contract tests ensure that service interactions behave as expected Providing an API contract requires no shared code Building features remains fast and fun Okay, okay, but how? So we’ve established that we don’t want to hit external services in tests, which we can do through WebMock or similar libraries. The challenge becomes: how do we replicate the integration environment without the integration environment? Through fakes. We’ll fake the integration by using Sinatra to build a rack app that quacks like the real thing. In the rack app, we define the routes we care about for the things we normally would have stubbed in the tests. From here, we do the things we couldn’t do before—pull real parameters out of the requests and feed them back into the fake response to make it more realistic. Additionally, we can use things like ActiveRecord to make these fake responses even more realistic based on the data stored in our actual database. So what does the fake look like? It's a class with a route defined for each URL we care about faking. We can use WebMock to wire the fake to requests that match a certain pattern. If we receive a request for a URL we didn't define, it will 404. Simple. However, this doesn’t allow us to solve all the things we were working for. What’s missing? First, an idiomatic setup stance. We want to be able to define fakes in a single place, so when we add a new one, we can easily find it and change it. In the same vein, we want to be able to answer similar questions about registering fakes in one spot. Finally, convention over configuration—if we can load, register, and wire-up a fake based on its name, for example, that would be handy. Secondly, it’s missing environment-specific behavior, which in this case, translates into the ability to toggle the library on and off and separately toggle the connection to specific collaborator services on and off. We need to be able to have the library active when running tests or doing local development, but do not want to have it running in a production environment—if it remains active in a real environment, it might affect real customer accounts, which we cannot afford. But, there will also be times when we're running in a local development environment and we want to communicate with a real collaborator service to do some true integration testing. Thirdly, we want to be able to autoload our fakes. If they’re in our codebase, we should be able to iterate on the fakes without having to restart our server; the behavior isn’t always right the first time, and restarting is tedious and it's not the Rails Way. Finally, to bolt this on to an IRL application, we need the ability to define fakes incrementally and migrate them into existing integrations that we have, one by one. Okay brass tacks. No existing library allows us to integrate this way and map HTTP requests to in-process fakes for integration and development. Hence, WebValve. TL;DR—WebValve is an open-source gem that uses Sinatra and WebMock to provide fake HTTP service behavior. The special sauce is that it works for more than just your tests. It allows you to run your fakes in your dev environment as well, providing functionality akin to real environments with the toggles we need to access the real thing when we need to. Let’s run it through the gauntlet to show how it works and how it solves for all our requirements. First we add the gem to our Gemfile and run bundle install. With the gem installed, we can use the generator rails g webvalve:install to bootstrap a default config file where we can register our fakes. Then we can generate a fake for our "trading" collaborator service using rails generate webvalve:fake_service Trading. This gives us a class in a conventional location that inherits from WebValve::FakeService. This looks very similar to a Sinatra app, and that's because it is one—with some additional magic baked in. To make this fake work, all we have to do is define the conventionally-named environment variable, TRADINGAPIURL. That tells WebValve what requests to intercept and route to this fake. By inheriting from this WebValve class, we gain the ability to toggle the fake behavior on or off based on another conventionally-named environment variable, in this case TRADING_ENABLED. So let’s take our feature spec. First, we configure out test suite to use WebValve with the RSpec config helper require 'webvalve/rspec'. Then, we look at the user API call—we define a new route for user, in FakeTrading. Then we flesh out that fake route by scooping out our json from the test file and probably making it a little more dynamic when we drop it into the fake. Then we do the same for the deposit API call. And now our test, which doesn't care about the specifics of either of those API calls, is much clearer. It looks just like our ideal spec from before: We leverage all the power of WebMock and Sinatra through our conventions and the teeniest configuration to provide all the same functionality as before, but we can write cleaner tests, we get the ability to use these fakes in local development instead of the real services—and we can enable a real service integration without missing a beat. We’ve achieved our goal—we’ve allowed for all the functionality of integration without the threats of actual integration. Check it out on GitHub. This article is part of Engineering at Betterment.

Focusing on What Matters: Using SLOs to Pursue User Happiness

Proper reliability is the greatest operational requirement for any service. If the service ...

Focusing on What Matters: Using SLOs to Pursue User Happiness Proper reliability is the greatest operational requirement for any service. If the service doesn’t work as intended, no user (or engineer) will be happy. This is where SLOs come in. The umbrella term “observability” covers all manner of subjects, from basic telemetry to logging, to making claims about longer-term performance in the shape of service level objectives (SLOs) and occasionally service level agreements (SLAs). Here I’d like to discuss some philosophical approaches to defining SLOs, explain how they help with prioritization, and outline the tooling currently available to Betterment Engineers to make this process a little easier. What is an SLO? At a high level, a service level objective is a way of measuring the performance of, correctness of, validity of, or efficacy of some component of a service over time by comparing the functionality of specific service level indicators (metrics of some kind) against a target goal. For example, 99.9% of requests complete with a 2xx, 3xx or 4xx HTTP code within 2000ms over a 30 day period The service level indicator (SLI) in this example is a request completing with a status code of 2xx, 3xx or 4xx and with a response time of at most 2000ms. The SLO is the target percentage, 99.9%. We reach our SLO goal if, during a 30 day period, 99.9% of all requests completed with one of those status codes and within that range of latency. If our service didn’t succeed at that goal, the violation overflow — called an “error budget” — shows us by how much we fell short. With a goal of 99.9%, we have 40 minutes and 19 seconds of downtime available to us every 28 days. Check out more error budget math here. If we fail to meet our goals, it’s worthwhile to step back and understand why. Was the error budget consumed by real failures? Did we notice a number of false positives? Maybe we need to reevaluate the metrics we’re collecting, or perhaps we’re okay with setting a lower target goal because there are other targets that will be more important to our customers. It’s all about the customer This is where the philosophy of defining and keeping track of SLOs comes into play. It starts with our users - Betterment users - and trying to provide them with a certain quality of service. Any error budget we set should account for our fiduciary responsibilities, and should guarantee that we do not cause an irresponsible impact to our customers. We also assume that there is a baseline degree of software quality baked-in, so error budgets should help us prioritize positive impact opportunities that go beyond these baselines. Sometimes there are a few layers of indirection between a service and a Betterment customer, and it takes a bit of creativity to understand what aspects of the service directly affects them. For example, an engineer on a backend or data-engineering team provides services that a user-facing component consumes indirectly. Or perhaps the users for a service are Betterment engineers, and it’s really unclear how that work affects the people who use our company’s products. It isn’t that much of a stretch to claim that an engineer’s level of happiness does have some effect on the level of service they’re capable of providing a Betterment customer! Let’s say we’ve defined some SLOs and notice they are falling behind over time. We might take a look at the metrics we’re using (the SLIs), the failures that chipped away at our target goal, and, if necessary, re-evaluate the relevancy of what we’re measuring. Do error rates for this particular endpoint directly reflect an experience of a user in some way - be it a customer, a customer-facing API, or a Betterment engineer? Have we violated our error budget every month for the past three months? Has there been an increase in Customer Service requests to resolve problems related to this specific aspect of our service? Perhaps it is time to dedicate a sprint or two to understanding what’s causing degradation of service. Or perhaps we notice that what we’re measuring is becoming increasingly irrelevant to a customer experience, and we can get rid of the SLO entirely! Benefits of measuring the right things, and staying on target The goal of an SLO based approach to engineering is to provide data points with which to have a reasonable conversation about priorities (a point that Alex Hidalgo drives home in his book Implementing Service Level Objectives). In the case of services not performing well over time, the conversation might be “focus on improving reliability for service XYZ.” But what happens if our users are super happy, our SLOs are exceptionally well-defined and well-achieved, and we’re ahead of our roadmap? Do we try to get that extra 9 in our target - or do we use the time to take some creative risks with the product (feature-flagged, of course)? Sometimes it’s not in our best interest to be too focused on performance, and we can instead “use up our error budget” by rolling out some new A/B test, or upgrading a library we’ve been putting off for a while, or testing out a new language in a user-facing component that we might not otherwise have had the chance to explore. The tools to get us there Let’s dive into some tooling that the SRE team at Betterment has built to help Betterment engineers easily start to measure things. Collecting the SLIs and Creating the SLOs The SRE team has a web-app and CLI called coach that we use to manage continuous integration (CI) and continuous delivery (CD), among other things. We’ve talked about Coach in the past here and here. At a high level, the Coach CLI generates a lot of yaml files that are used in all sorts of places to help manage operational complexity and cloud resources for consumer-facing web-apps. In the case of service level indicators (basically metrics collection), the Coach CLI provides commands that generate yaml files to be stored in GitHub alongside application code. At deploy time, the Coach web-app consumes these files and idempotently create Datadog monitors, which can be used as SLIs (service level indicators) to inform SLOs, or as standalone alerts that need immediate triage every time they're triggered. In addition to Coach explicitly providing a config-driven interface for monitors, we’ve also written a couple handy runtime specific methods that result in automatic instrumentation for Rails or Java endpoints. I’ll discuss these more below. We also manage a separate repository for SLO definitions. We left this outside of application code so that teams can modify SLO target goals and details without having to redeploy the application itself. It also made visibility easier in terms of sharing and communicating different team’s SLO definitions across the org. Monitors in code Engineers can choose either StatsD or Micrometer to measure complicated experiences with custom metrics, and there’s various approaches to turning those metrics directly into monitors within Datadog. We use Coach CLI driven yaml files to support metric or APM monitor types directly in the code base. Those are stored in a file named .coach/datadog_monitors.yml and look like this: monitors: - type: metric metric: "coach.ci_notification_sent.completed.95percentile" name: "coach.ci_notification_sent.completed.95percentile SLO" aggregate: max owner: sre alert_time_aggr: on_average alert_period: last_5m alert_comparison: above alert_threshold: 5500 - type: apm name: "Pull Requests API endpoint violating SLO" resource_name: api::v1::pullrequestscontroller_show max_response_time: 900ms service_name: coach page: false slack: false It wasn’t simple to make this abstraction intuitive between a Datadog monitor configuration and a user interface. But this kind of explicit, attribute-heavy approach helped us get this tooling off the ground while we developed (and continue to develop) in-code annotation approaches. The APM monitor type was simple enough to turn into both a Java annotation and a tiny domain specific language (DSL) for Rails controllers, giving us nice symmetry across our platforms. . This owner method for Rails apps results in all logs, error reports, and metrics being tagged with the team’s name, and at deploy time it's aggregated by a Coach CLI command and turned into latency monitors with reasonable defaults for optional parameters; essentially doing the same thing as our config-driven approach but from within the code itself class DeploysController < ApplicationController owner "sre", max_response_time: "10000ms", only: [:index], slack: false end For Java apps we have a similar interface (with reasonable defaults as well) in a tidy little annotation. @Sla @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.METHOD) public @interface Sla { @AliasFor(annotation = Sla.class) long amount() default 25_000; @AliasFor(annotation = Sla.class) ChronoUnit unit() default ChronoUnit.MILLIS; @AliasFor(annotation = Sla.class) String service() default "custody-web"; @AliasFor(annotation = Sla.class) String slackChannelName() default "java-team-alerts"; @AliasFor(annotation = Sla.class) boolean shouldPage() default false; @AliasFor(annotation = Sla.class) String owner() default "java-team"; } Then usage is just as simple as adding the annotation to the controller: @WebController("/api/stuff/v1/service_we_care_about") public class ServiceWeCareAboutController { @PostMapping("/search") @CustodySla(amount = 500) public SearchResponse search(@RequestBody @Valid SearchRequest request) {...} } At deploy time, these annotations are scanned and converted into monitors along with the config-driven definitions, just like our Ruby implementation. SLOs in code Now that we have our metrics flowing, our engineers can define SLOs. If an engineer has a monitor tied to metrics or APM, then they just need to plug in the monitor ID directly into our SLO yaml interface. - last_updated_date: "2021-02-18" approval_date: "2021-03-02" next_revisit_date: "2021-03-15" category: latency type: monitor description: This SLO covers latency for our CI notifications system - whether it's the github context updates on your PRs or the slack notifications you receive. tags: - team:sre thresholds: - target: 99.5 timeframe: 30d warning_target: 99.99 monitor_ids: - 30842606 The interface supports metrics directly as well (mirroring Datadog’s SLO types) so an engineer can reference any metric directly in their SLO definition, as seen here: # availability - last_updated_date: "2021-02-16" approval_date: "2021-03-02" next_revisit_date: "2021-03-15" category: availability tags: - team:sre thresholds: - target: 99.9 timeframe: 30d warning_target: 99.99 type: metric description: 99.9% of manual deploys will complete successfully over a 30day period. query: # (total_events - bad_events) over total_events == good_events/total_events numerator: sum:trace.rack.request.hits{service:coach,env:production,resource_name:deployscontroller_create}.as_count()-sum:trace.rack.request.errors{service:coach,env:production,resource_name:deployscontroller_create}.as_count() denominator: sum:trace.rack.request.hits{service:coach,resource_name:deployscontroller_create}.as_count() We love having these SLOs defined in GitHub because we can track who's changing them, how they're changing, and get review from peers. It's not quite the interactive experience of the Datadog UI, but it's fairly straightforward to fiddle in the UI and then extract the resulting configuration and add it to our config file. Notifications When we merge our SLO templates into this repository, Coach will manage creating SLO resources in Datadog and accompanying SLO alerts (that ping slack channels of our choice) if and when our SLOs violate their target goals. This is the slightly nicer part of SLOs versus simple monitors - we aren’t going to be pinged for every latency failure or error rate spike. We’ll only be notified if, over 7 days or 30 days or even longer, they exceed the target goal we’ve defined for our service. We can also set a “warning threshold” if we want to be notified earlier when we’re using up our error budget. Fewer alerts means the alerts should be something to take note of, and possibly take action on. This is a great way to get a good signal while reducing unnecessary noise. If, for example, our user research says we should aim for 99.5% uptime, that’s 3h 21m 36s of downtime available per 28 days. That’s a lot of time we can reasonably not react to failures. If we aren’t alerting on those 3 hours of errors, and instead just once if we exceed that limit, then we can direct our attention toward new product features, platform improvements, or learning and development. The last part of defining our SLOs is including a date when we plan to revisit that SLO specification. Coach will send us a message when that date rolls around to encourage us to take a deeper look at our measurements and possibly reevaluate our goals around measuring this part of our service. What if SLOs don’t make sense yet? It’s definitely the case that a team might not be at the level of operational maturity where defining product or user-specific service level objectives is in the cards. Maybe their on-call is really busy, maybe there are a lot of manual interventions needed to keep their services running, maybe they’re still putting out fires and building out their team’s systems. Whatever the case may be, this shouldn’t deter them from collecting data. They can define what is called an “aspirational” SLO - basically an SLO for an important component in their system - to start collecting data over time. They don’t need to define an error budget policy, and they don’t need to take action when they fail their aspirational SLO. Just keep an eye on it. Another option is to start tracking the level of operational complexity for their systems. Perhaps they can set goals around "Bug Tracker Inbox Zero" or "Failed Background Jobs Zero" within a certain time frame, a week or a month for example. Or they can define some SLOs around types of on-call tasks that their team tackles each week. These aren’t necessarily true-to-form SLOs but engineers can use this framework and tooling provided to collect data around how their systems are operating and have conversations on prioritization based on what they discover, beginning to build a culture of observability and accountability Conclusion Betterment is at a point in its growth where prioritization has become more difficult and more important. Our systems are generally stable, and feature development is paramount to business success. But so is reliability and performance. Proper reliability is the greatest operational requirement for any service2. If the service doesn’t work as intended, no user (or engineer) will be happy. This is where SLOs come in. SLOs should align with business objectives and needs, which will help Product and Engineering Managers understand the direct business impact of engineering efforts. SLOs will ensure that we have a solid understanding of the state of our services in terms of reliability, and they empower us to focus on user happiness. If our SLOs don’t align directly with business objectives and needs, they should align indirectly via tracking operational complexity and maturity. So, how do we choose where to spend our time? SLOs (service level objectives) - including managing their error budgets - will permit us - our product engineering teams - to have the right conversations and make the right decisions about prioritization and resourcing so that we can balance our efforts spent on reliability and new product features, helping to ensure the long term happiness and confidence of our users (and engineers). 2 Alex Hidalgo, Implementing Service Level Objectives

Operating Software

Featured articles