Focusing on What Matters: Using SLOs to Pursue User Happiness
Focusing on What Matters: Using SLOs to Pursue User Happiness Aug 4, 2021 12:00:00 AM Proper reliability is the greatest operational requirement for any service. If the service doesn’t work as intended, no user (or engineer) will be happy. This is where SLOs come in. The umbrella term “observability” covers all manner of subjects, from basic telemetry to logging, to making claims about longer-term performance in the shape of service level objectives (SLOs) and occasionally service level agreements (SLAs). Here I’d like to discuss some philosophical approaches to defining SLOs, explain how they help with prioritization, and outline the tooling currently available to Betterment Engineers to make this process a little easier. What is an SLO? At a high level, a service level objective is a way of measuring the performance of, correctness of, validity of, or efficacy of some component of a service over time by comparing the functionality of specific service level indicators (metrics of some kind) against a target goal. For example, 99.9% of requests complete with a 2xx, 3xx or 4xx HTTP code within 2000ms over a 30 day period The service level indicator (SLI) in this example is a request completing with a status code of 2xx, 3xx or 4xx and with a response time of at most 2000ms. The SLO is the target percentage, 99.9%. We reach our SLO goal if, during a 30 day period, 99.9% of all requests completed with one of those status codes and within that range of latency. If our service didn’t succeed at that goal, the violation overflow — called an “error budget” — shows us by how much we fell short. With a goal of 99.9%, we have 40 minutes and 19 seconds of downtime available to us every 28 days. Check out more error budget math here. If we fail to meet our goals, it’s worthwhile to step back and understand why. Was the error budget consumed by real failures? Did we notice a number of false positives? Maybe we need to reevaluate the metrics we’re collecting, or perhaps we’re okay with setting a lower target goal because there are other targets that will be more important to our customers. It’s all about the customer This is where the philosophy of defining and keeping track of SLOs comes into play. It starts with our users - Betterment users - and trying to provide them with a certain quality of service. Any error budget we set should account for our fiduciary responsibilities, and should guarantee that we do not cause an irresponsible impact to our customers. We also assume that there is a baseline degree of software quality baked-in, so error budgets should help us prioritize positive impact opportunities that go beyond these baselines. Sometimes there are a few layers of indirection between a service and a Betterment customer, and it takes a bit of creativity to understand what aspects of the service directly affects them. For example, an engineer on a backend or data-engineering team provides services that a user-facing component consumes indirectly. Or perhaps the users for a service are Betterment engineers, and it’s really unclear how that work affects the people who use our company’s products. It isn’t that much of a stretch to claim that an engineer’s level of happiness does have some effect on the level of service they’re capable of providing a Betterment customer! Let’s say we’ve defined some SLOs and notice they are falling behind over time. We might take a look at the metrics we’re using (the SLIs), the failures that chipped away at our target goal, and, if necessary, re-evaluate the relevancy of what we’re measuring. Do error rates for this particular endpoint directly reflect an experience of a user in some way - be it a customer, a customer-facing API, or a Betterment engineer? Have we violated our error budget every month for the past three months? Has there been an increase in Customer Service requests to resolve problems related to this specific aspect of our service? Perhaps it is time to dedicate a sprint or two to understanding what’s causing degradation of service. Or perhaps we notice that what we’re measuring is becoming increasingly irrelevant to a customer experience, and we can get rid of the SLO entirely! Benefits of measuring the right things, and staying on target The goal of an SLO based approach to engineering is to provide data points with which to have a reasonable conversation about priorities (a point that Alex Hidalgo drives home in his book Implementing Service Level Objectives). In the case of services not performing well over time, the conversation might be “focus on improving reliability for service XYZ.” But what happens if our users are super happy, our SLOs are exceptionally well-defined and well-achieved, and we’re ahead of our roadmap? Do we try to get that extra 9 in our target - or do we use the time to take some creative risks with the product (feature-flagged, of course)? Sometimes it’s not in our best interest to be too focused on performance, and we can instead “use up our error budget” by rolling out some new A/B test, or upgrading a library we’ve been putting off for a while, or testing out a new language in a user-facing component that we might not otherwise have had the chance to explore. The tools to get us there Let’s dive into some tooling that the SRE team at Betterment has built to help Betterment engineers easily start to measure things. Collecting the SLIs and Creating the SLOs The SRE team has a web-app and CLI called coach that we use to manage continuous integration (CI) and continuous delivery (CD), among other things. We’ve talked about Coach in the past here and here. At a high level, the Coach CLI generates a lot of yaml files that are used in all sorts of places to help manage operational complexity and cloud resources for consumer-facing web-apps. In the case of service level indicators (basically metrics collection), the Coach CLI provides commands that generate yaml files to be stored in GitHub alongside application code. At deploy time, the Coach web-app consumes these files and idempotently create Datadog monitors, which can be used as SLIs (service level indicators) to inform SLOs, or as standalone alerts that need immediate triage every time they're triggered. In addition to Coach explicitly providing a config-driven interface for monitors, we’ve also written a couple handy runtime specific methods that result in automatic instrumentation for Rails or Java endpoints. I’ll discuss these more below. We also manage a separate repository for SLO definitions. We left this outside of application code so that teams can modify SLO target goals and details without having to redeploy the application itself. It also made visibility easier in terms of sharing and communicating different team’s SLO definitions across the org. Monitors in code Engineers can choose either StatsD or Micrometer to measure complicated experiences with custom metrics, and there’s various approaches to turning those metrics directly into monitors within Datadog. We use Coach CLI driven yaml files to support metric or APM monitor types directly in the code base. Those are stored in a file named .coach/datadog_monitors.yml and look like this: monitors: - type: metric metric: "coach.ci_notification_sent.completed.95percentile" name: "coach.ci_notification_sent.completed.95percentile SLO" aggregate: max owner: sre alert_time_aggr: on_average alert_period: last_5m alert_comparison: above alert_threshold: 5500 - type: apm name: "Pull Requests API endpoint violating SLO" resource_name: api::v1::pullrequestscontroller_show max_response_time: 900ms service_name: coach page: false slack: false It wasn’t simple to make this abstraction intuitive between a Datadog monitor configuration and a user interface. But this kind of explicit, attribute-heavy approach helped us get this tooling off the ground while we developed (and continue to develop) in-code annotation approaches. The APM monitor type was simple enough to turn into both a Java annotation and a tiny domain specific language (DSL) for Rails controllers, giving us nice symmetry across our platforms. . This owner method for Rails apps results in all logs, error reports, and metrics being tagged with the team’s name, and at deploy time it's aggregated by a Coach CLI command and turned into latency monitors with reasonable defaults for optional parameters; essentially doing the same thing as our config-driven approach but from within the code itself class DeploysController < ApplicationController owner "sre", max_response_time: "10000ms", only: [:index], slack: false end For Java apps we have a similar interface (with reasonable defaults as well) in a tidy little annotation. @Sla @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.METHOD) public @interface Sla { @AliasFor(annotation = Sla.class) long amount() default 25_000; @AliasFor(annotation = Sla.class) ChronoUnit unit() default ChronoUnit.MILLIS; @AliasFor(annotation = Sla.class) String service() default "custody-web"; @AliasFor(annotation = Sla.class) String slackChannelName() default "java-team-alerts"; @AliasFor(annotation = Sla.class) boolean shouldPage() default false; @AliasFor(annotation = Sla.class) String owner() default "java-team"; } Then usage is just as simple as adding the annotation to the controller: @WebController("/api/stuff/v1/service_we_care_about") public class ServiceWeCareAboutController { @PostMapping("/search") @CustodySla(amount = 500) public SearchResponse search(@RequestBody @Valid SearchRequest request) {...} } At deploy time, these annotations are scanned and converted into monitors along with the config-driven definitions, just like our Ruby implementation. SLOs in code Now that we have our metrics flowing, our engineers can define SLOs. If an engineer has a monitor tied to metrics or APM, then they just need to plug in the monitor ID directly into our SLO yaml interface. - last_updated_date: "2021-02-18" approval_date: "2021-03-02" next_revisit_date: "2021-03-15" category: latency type: monitor description: This SLO covers latency for our CI notifications system - whether it's the github context updates on your PRs or the slack notifications you receive. tags: - team:sre thresholds: - target: 99.5 timeframe: 30d warning_target: 99.99 monitor_ids: - 30842606 The interface supports metrics directly as well (mirroring Datadog’s SLO types) so an engineer can reference any metric directly in their SLO definition, as seen here: # availability - last_updated_date: "2021-02-16" approval_date: "2021-03-02" next_revisit_date: "2021-03-15" category: availability tags: - team:sre thresholds: - target: 99.9 timeframe: 30d warning_target: 99.99 type: metric description: 99.9% of manual deploys will complete successfully over a 30day period. query: # (total_events - bad_events) over total_events == good_events/total_events numerator: sum:trace.rack.request.hits{service:coach,env:production,resource_name:deployscontroller_create}.as_count()-sum:trace.rack.request.errors{service:coach,env:production,resource_name:deployscontroller_create}.as_count() denominator: sum:trace.rack.request.hits{service:coach,resource_name:deployscontroller_create}.as_count() We love having these SLOs defined in GitHub because we can track who's changing them, how they're changing, and get review from peers. It's not quite the interactive experience of the Datadog UI, but it's fairly straightforward to fiddle in the UI and then extract the resulting configuration and add it to our config file. Notifications When we merge our SLO templates into this repository, Coach will manage creating SLO resources in Datadog and accompanying SLO alerts (that ping slack channels of our choice) if and when our SLOs violate their target goals. This is the slightly nicer part of SLOs versus simple monitors - we aren’t going to be pinged for every latency failure or error rate spike. We’ll only be notified if, over 7 days or 30 days or even longer, they exceed the target goal we’ve defined for our service. We can also set a “warning threshold” if we want to be notified earlier when we’re using up our error budget. Fewer alerts means the alerts should be something to take note of, and possibly take action on. This is a great way to get a good signal while reducing unnecessary noise. If, for example, our user research says we should aim for 99.5% uptime, that’s 3h 21m 36s of downtime available per 28 days. That’s a lot of time we can reasonably not react to failures. If we aren’t alerting on those 3 hours of errors, and instead just once if we exceed that limit, then we can direct our attention toward new product features, platform improvements, or learning and development. The last part of defining our SLOs is including a date when we plan to revisit that SLO specification. Coach will send us a message when that date rolls around to encourage us to take a deeper look at our measurements and possibly reevaluate our goals around measuring this part of our service. What if SLOs don’t make sense yet? It’s definitely the case that a team might not be at the level of operational maturity where defining product or user-specific service level objectives is in the cards. Maybe their on-call is really busy, maybe there are a lot of manual interventions needed to keep their services running, maybe they’re still putting out fires and building out their team’s systems. Whatever the case may be, this shouldn’t deter them from collecting data. They can define what is called an “aspirational” SLO - basically an SLO for an important component in their system - to start collecting data over time. They don’t need to define an error budget policy, and they don’t need to take action when they fail their aspirational SLO. Just keep an eye on it. Another option is to start tracking the level of operational complexity for their systems. Perhaps they can set goals around "Bug Tracker Inbox Zero" or "Failed Background Jobs Zero" within a certain time frame, a week or a month for example. Or they can define some SLOs around types of on-call tasks that their team tackles each week. These aren’t necessarily true-to-form SLOs but engineers can use this framework and tooling provided to collect data around how their systems are operating and have conversations on prioritization based on what they discover, beginning to build a culture of observability and accountability Conclusion Betterment is at a point in its growth where prioritization has become more difficult and more important. Our systems are generally stable, and feature development is paramount to business success. But so is reliability and performance. Proper reliability is the greatest operational requirement for any service2. If the service doesn’t work as intended, no user (or engineer) will be happy. This is where SLOs come in. SLOs should align with business objectives and needs, which will help Product and Engineering Managers understand the direct business impact of engineering efforts. SLOs will ensure that we have a solid understanding of the state of our services in terms of reliability, and they empower us to focus on user happiness. If our SLOs don’t align directly with business objectives and needs, they should align indirectly via tracking operational complexity and maturity. So, how do we choose where to spend our time? SLOs (service level objectives) - including managing their error budgets - will permit us - our product engineering teams - to have the right conversations and make the right decisions about prioritization and resourcing so that we can balance our efforts spent on reliability and new product features, helping to ensure the long term happiness and confidence of our users (and engineers). 2 Alex Hidalgo, Implementing Service Level Objectives