Engineering at Betterment
High quality code. Beautiful, practical design. Innovative problem solving. Explore our engineering community and nerd out with us on all things tech.
Recent articles
-
End-to-end-ish tests using fake HTTP in Flutter
End-to-end-ish tests using fake HTTP in Flutter true We write tests in order to prove our features work as intended and we run those tests consistently to prove that our features don't stop working as intended. Writing end-to-end tests is pretty expensive. Typically, they use real devices or sometimes a simulator/emulator and real backend services. That usually means that they end up being pretty slow and they tend to be somewhat flaky. That isn't to say that they're not worth it for some teams or for a subset of the features in your app. However, I'm here to tell you (or maybe just remind you) that tests and test coverage aren't the goal in and of themselves. We write tests in order to prove our features work as intended and we run those tests consistently to prove that our features don't stop working as intended. That means that our goal when writing tests should be to figure out how to achieve our target level of confidence that our features work as intended as affordably as possible. It’s a spectrum. On the one end is 100% test coverage using all the different kinds of tests: solitary unit tests, sociable more-integrated tests, and end-to-end tests; all features, fully covered, no exceptions. On the other end of the spectrum there are no tests at all; YOLO, just ship-it. At Betterment, we definitely prefer to be closer to the 100% coverage end of the spectrum, but we know that in practice that's not really a feasible end state if we want to ship changes quickly and deliver rapid feedback to our engineers about their proposed changes. So what do we do? Well, we aim to find an affordable, maintainable spot on that testing spectrum a la Justin Searls' advice. We focus on writing expressive, fast, and reliable solitary unit tests, some sociable integrated tests of related units, and some "end-to-end-ish" tests. It's that last bucket of tests that's the most interesting and it's what the rest of this post will focus on. What are "end-to-end-ish" tests? They're an answer to the question "how can we approximate end-to-end tests for a fraction of the cost?" In Flutter, the way to write end-to-end tests is with flutter_driver and the integration_test package. These tests use the same widgetTester API that regular Widget tests use but they are designed to run on a simulator, emulator, or preferably a real device. These tests are pretty easy to write (just as easy as regular widget tests) but hard-ish to debug and very slow to run. Where a widget test will run in a fraction of a second to a second, one of these integration tests will take many seconds. We love the idea of these tests, the level of confidence they'd give us that our app works as intended, and how they'd eliminate manual QA testing, but we loathe the cost of running them, both in terms of time and actual $$$ of CI execution. So, we decided that we really only want to write these flutter_driver end-to-end tests for a tiny subset of our features, almost like a "smoke testing" suite that would signal us if something was seriously wrong with our app. That might include a single happy-path test apiece for features like log-in and sign-up. But that leaves us with a pretty large gap where it's way too easy for us to accidentally create a feature that depends on some Provider that's not provided and our app blows up at runtime in a user's hands. Yuck! Enter, end-to-end-ish tests (patent pending 😉). These tests are as close to end-to-end tests as we can get without actually running on a real device using flutter_driver. They look just like widget tests (because they are just widget tests) but they boot up our whole app, run all the real initialization code, and rely on all our real injected dependencies with a few key exceptions (more on that next). This gives us the confidence that all our code is configured properly, all our dependencies are provided, our navigation works, and the user can tap on whatever and see what they'd expect to see. You can read more about this approach here. "With a few key exceptions" If the first important distinction of end-to-end-ish tests is that they don't run on a real device with flutter_driver, the second important distinction is that they don't rely on a real backend API. That is, most apps rely on one (or sometimes a few) backend APIs, typically powered by HTTP. Our app is one of those apps. So, the second major difference is that we inject a fake HTTP configuration into our network stack so that we can run nearly all of our code for real but cut out the other unreliable and costly dependency. The last important hurdle is native plugins. Because widget tests aren't typically run on a real device or a simulator/emulator, they run in a context in which we should assume the underlying platform doesn't support using real plugins. This means that we have to also inject fake implementations of any plugins we use. What I mean by fake plugins is really simple. When we set up a new plugin and we wrap it in a class that we inject into our app. Making a fake implementation of that plugin is typically as easy as making another class, prefixing its name with Fake and having it implement the public contract of the regular plugin class with suitably real but not quite real behavior. It's a standard test double, and it does the trick. It's definitely a bummer that we can't exercise that real plugin code, but when you think about it, that plugin code is tested in the plugin's test suite. A lot of the time, the plugin code is also integration tested as well because the benefits outweigh the costs for many plugins, e.g. the shared preferences plugin can use a single integration test to provide certainty that it works as intended. Ultimately, using fake plugins works well and makes this a satisfyingly functional testing solution. About that fake HTTP thing One of the most interesting bits of this solution is the way we inject a fake HTTP configuration into our network stack. Before building anything ourselves, we did some research to figure out what the community had already done. Unfortunately, our google-fu was bad and we didn't find anything until after we went and implemented something ourselves. Points for trying though, right? Eventually, we found nock. It's similar to libraries for other platforms that allow you to define fake responses for HTTP requests using a nice API and then inject those fake responses into your HTTP client. It relies on the dart:io HttpOverrides feature. It actually configures the current Zone's HTTP client builder to return its special client so that any code in your project that finds its way to using the dart:io HTTP client to make a request will end up routed right into the fake responses. It's clever and great. I highly recommend using it. We, however, are not using it. How we wrote our own fake HTTP Client Adapter As I said, we didn't find nock until after we wrote our own solution. Fortunately, it was a fun experience and it really took very little time! This also meant that we ended up with an API that fit our exact needs rather than having to reframe our approach to fit what nock was able to offer us. The solution we came up with is called charlatan and it's open-source and available on pub.dev. Both libraries are great and each is designed for a specific challenge, check both of them out and decide which one works for your needs. Here's what our API looks like and how we use it to set up a fake HTTP client for our tests. Here you can see how to construct an instance of the Charlatan class and then use its methods like whenGet to configure it with fake responses that we want to see when we make requests to the configured URLs. We've also created an extension method withDefaults that allows us to configure a bunch of common, default responses so that we don't have to specify those in each and every test case. This is useful for API calls that always behave the same way, like POSTs that return no body, and to provide a working foundation of responses. When a test case cares about the specifics of a response, it can override that default. The last important step is to make sure to convert the Charlatan instance into an adapter and pass that into our HTTP client so that the client will use it to fulfill requests. Here's a peek inside of the Charlatan API. It's just collecting fake responses and organizing them so that they're easy to access later.As you can see, the internals are pretty tiny. We provide a class that exposes the developer-friendly configuration API for fake responses, and we implement the HttpClientAdapter interface provided by dio. In our app we use dio and not dart:io's built-in HTTP client mostly due to preference and slight feature set differences. For the most part, the code collects fake responses and then smartly spits them back out when requested. The key functionality (Ahem! Magic ✨) is only a few lines of code. We use the uri package to support matching templated URLs rather than requiring developers to pass in exactly matching strings for requests their tests will make. We store fake responses with a URI template, a status code, and a body. If we find a match, we return it, if we don't then we throw a helpful exception to guide the developer on how to fix the issue. Takeaways Testing software is important, but it's not trivial to write a balanced test suite for your app's needs. Sometimes, it's a good idea to think outside the box in order to strike the right balance of test coverage, confidence, and maintainability. That's what we do here at Betterment, come join us! -
Stability through Randomness
Stability through Randomness true We only recently enabled test randomization and as a result found that some of our tests were failing. Through fixing the tests, we learned lessons that could help others have a less painful migration themselves. If you’re writing tests for your Flutter application, it’s safe to assume that your goal is to build a robust, reliable piece of software that you can be confident in. However, if your tests aren’t run in random order, you may have a false sense of confidence that the assertions you’re making in them are actually accurate. By default, running flutter test will run your tests in the order they’re written within your test file. In other words, the following test file will always exit successfully, despite the fact that there are obvious issues with how it’s set up. In this case, our second test is relying on the side effects of the first test. Since the first test will always run before the second test, we’re not privy to this dependency. However, in more complex testing scenarios, this dependency won’t be as obvious. In order to avoid test inter-dependency issues, we can instead run our tests in a random order (per file) by passing the --test-randomize-ordering-seed flag to flutter test. The flag takes a seed that can be one of two things, either a 32 bit unsigned integer or the word “random”. To ensure true randomness, always pass “random” as the seed. The benefit of having the option to pass an integer as a seed becomes apparent once you come across a test that fails when run in an order other than that which it was defined. The test runner will print the seed it chose at the beginning of test execution, and you can reliably use that seed to reproduce the failure and be confident in your fix once the test begins passing. If you have been using the randomization flag since the inception of your codebase, you’re in a fantastic position and can be confident in your tests! If you haven’t, there’s no better time to start than now. Of course, introducing the flag may cause some tests to begin failing. Whether you choose to skip those tests while you work on fixing them so the rest of your team can keep chugging away, or address the issues immediately, the following tips should help you quickly identify where the issues are coming from and how to resolve them. Tip 1: Assume every test within a test file will run first The first snippet above highlights the anti-pattern of assuming a consistent test execution order. We can rewrite this test so that each test would pass if it were run first. Hopefully it’s easy to look past the trivial nature of using an int and imagine how this might apply to a more complex test case. Tip 2: Keep all initialization & configuration code inside of setUp() methods While it may be tempting to set up certain test objects directly in your main function, this can cause sneaky issues to crop up, especially when mocking or using mutable objects. Don’t Did you know that even when run sequentially, this will print A,B,D,C,E? This is because code in the body of the main function and the bodies of groups only runs once and it does so immediately. This can introduce sneaky testing bugs that may not surface until the tests themselves run in random order. Do This will correctly print A,B,C,A,D,E (A prints twice because setUp is run before each test) Tip 3: Scope test objects as closely as possible to the tests that need them In the same way that we prefer to keep shared state as low in the Widget tree as possible, keep your test objects close to the tests that utilize them. Not only does this increase test readability (each set up method will set up only the dependencies needed for the tests below it and within the same scope in the testing tree), but this reduces the scope for potential problems. Don’t Do By keeping test dependencies tightly scoped to where they’re used, we avoid the possibility that a test will be added or changed in such a way that impacts the tests previously consuming the dependency. Instead, when a new test is introduced that requires that dependency, the decision can be made to share it in such a way that its state gets reset prior to each test or to not share it at all and have each test create and set up the dependency itself. Keep in mind, descriptive group names go a long way in adding clarity to what dependencies that bucket relies upon. For example, a group named “when a user is logged in” tells me that the group of tests relies upon a user in the authenticated state. If I add another group named “when a user is logged out”, I would expect both groups to have setUp() methods that correctly create or set up the user model to have the correct authentication state. Following the above tips should put you well on your way to fixing existing problems in your test suite or otherwise preventing them all together! So if you haven’t already, make sure to enable test randomization in your Flutter codebase today! -
Finding a Middle Ground Between Screen and UI Testing in Flutter
Finding a Middle Ground Between Screen and UI Testing in Flutter true We outline the struggles we had testing our flutter app, our approaches to those challenges, and the solutions we arrived at to solve those problems. Flutter provides good solutions for both screen testing and UI testing, but what about the middle-ground? With integration testing being a key level of the testing pyramid, we needed to find a way to test how features in our app interacted without the overhead involved with setting up UI tests. I’m going to take you through our testing journey from a limited native automated testing suite and heavy dependence on manual testing, to trying flutter’s integration testing solutions, to ultimately deciding to build out our own framework to increase confidence in the integration of our components. The beginning of our Flutter testing journey Up until early 2020, our mobile app was entirely native with separate android and iOS codebases. At the onset of our migration to flutter, the major testing pain point was that a large amount of manual regression testing was required in order to approve each release. This manual testing was tedious and time consuming for engineers, whose time is expensive. Alongside this manual testing pain, the automated testing in the existing iOS and android codebases was inconsistent. iOS had a larger unit testing suite than android did, but neither had integration tests. iOS also had some tests that were flaky, causing CI builds to fail unexpectedly. As we transitioned to flutter, we made unit/screen testing and code testability a high priority, pushing for thorough coverage. That said, we still relied heavily on the manual testing checklist to ensure the user experience was as expected. This led us to pursue an integration testing solution for flutter. In planning out integration testing, we had a few key requirements for our integration testing suite: Easily runnable in CI upon each commit An API that would be familiar to developers who are used to writing flutter screen tests The ability to test the integration between features within the system without needing to set up the entire app. The Flutter integration testing landscape At the very beginning of our transition to flutter, we started trying to write integration tests for our features using flutter’s solution at the time: flutter_driver. The benefit we found in flutter_driver was that we could run it in our production-like environment against preset test users. This meant there was minimal test environment setup. We ran into quite a few issues with flutter_driver though. Firstly, there wasn’t a true entry point we could launch the app into because our app is add-to-app, meaning that the flutter code is embedded into our iOS and Android native applications rather than being a pure flutter app runnable from a main.dart entry point. Second, flutter_driver is more about UI/E2E testing rather than integration testing, meaning we’d need to run an instance of the app on a device, navigate to a flow we wanted to test, and then test the flow. Also, the flutter_driver API worked differently than the screen testing API and was generally more difficult to use. Finally, flutter_driver is not built to run a suite of tests or to run easily in CI. While possible to run in CI, it would be incredibly costly to run on each commit since the tests need to run on actual devices. These barriers led us to not pursue flutter_driver tests as our solution. We then pivoted to investigating Flutter’s newer replacement for flutter_driver : integation_test. Unfortunately integration_test was very similar to flutter_driver, in that it took the same UI/E2E approach, which meant that it had the same benefits and drawbacks that flutter_driver had. The one additional advantage of integration_test is that it uses the same API as screen tests do, so writing tests with it feels more familiar for developers experienced with writing screen tests. Regardless, given that it has the same problems that flutter_driver does, we decided not to pursue integration_test as our framework. Our custom solution to integration testing After trying flutter’s solutions fruitlessly, we decided to build out a solution of our own. Before we dive into how we built it, let’s revisit our requirements from above: Easily runnable in CI upon each commit An API that would be familiar to developers who are used to writing flutter screen tests The ability to test the integration between features within the system without needing to set up the entire app. Given those requirements, we took a step back to make a few overarching design decisions. First, we needed to decide what pieces of code we were interested in testing and which parts we were fine with stubbing. Because we didn’t want to run the whole app with these tests in order to keep the tests lightweight enough to run on each commit, we decided to stub out a few problem areas. The first was our flutter/native boundary. With our app being add-to-app and utilizing plugins, we didn’t want to have to run anything native in our testing. We stubbed out the plugins by writing lightweight wrappers around them then providing them to the app at a high level that we could easily override with fakes for the purpose of integration testing. The add-to-app boundary was similar. The second area we wanted to stub out was the network. In order to do this, we built out a fake http client that allows us to configure network responses for given requests. We chose to fake the http client since it is the very edge of our network layer. Faking it left as much of our code as possible under test. The next thing we needed to decide was what user experiences we actually wanted to test with our integration tests. Because integration tests are more expensive to write and maintain than screen tests, we wanted to make sure the flows we were testing were the most impactful. Knowing this, we decided to focus on “happy paths” of flows. Happy paths are non-exceptional flows (flows not based on bad user state or input). On top of being less impactful, these sad paths usually give feedback on the same screen as the input, meaning those sad path cases are usually better tested at the screen test level anyway. From here, we set out to break down responsibilities of the components of our integration tests. We wanted to have a test harness that we could use to set up the app under test and the world that the app would run in, however we knew this configuration code would be mildly complicated and something that would be in flux. We also wanted a consistent framework by which we could write these tests. In order to ensure changes to our test harness didn’t have far reaching effects on the underlying framework, we decided to split out the testing framework into an independent package that is completely agnostic to how our app operates. This keeps the tests feeling familiar to normal screen tests since the exposed interface is very similar to how widget tests are written. The remaining test harness code was put in our normal codebase where it can be iterated on freely. The other separation we wanted to make was between the screen interactions and the tests themselves. For this we used a modified version of Very Good Venture's robot testing pattern that would allow us to reuse screen interactions across multiple tests while also making our tests very readable from even a non-engineering perspective. In order to fulfill two of our main requirements: being able to run as part of our normal test suite in CI and having a familiar API, we knew we’d need to build our framework on top of flutter’s existing screen test framework. Being able to integrate (ba dum tss) these new tests into our existing test suite is excellent because it meant that we would get quick feedback when code breaks while developing. The last of our requirements was to be able to launch into a specific feature rather than having to navigate through the whole app. We were able to do this by having our app widget that handles dependency setup take a child, then pumping the app widget wrapped around whatever feature widget we wanted to test. With all these decisions made, we arrived at a well-defined integration testing framework that isolated our concerns and fulfilled our testing requirements. The Nitty Gritty Details In order to describe how our integration tests work, let's start by describing an example app that we may want to test. Let's imagine a simple social network app, igrastam, that has an activity feed screen, a profile screen, a flow for updating your profile information, and a flow for posting images. For this example, we’ll say we’re most interested in testing the profile information edit flows to start. First, how would we want to make a test harness for this app? We know it has some sort of network interactions for fetching profile info and posts as well as for posting images and editing a profile. For that, our app has a thin wrapper around the http package called HttpClient. We may also have some interactions with native code through a plugin such as image_cropper. In order to have control over that plugin, this app has also made a thin wrapper service for that. This leaves our app looking something like this: Given that this is approximately what the app looks like, the test harness needs to grant control of the HttpClient and the ImageCropperService. We can do that by just passing our own fake versions into the app. Awesome, now that we have an app and a harness we can use to test it, how are the tests actually written? Let’s start out by exploring that robot testing technique I mentioned earlier. Say that we want to start by testing the profile edit flow. One path through this flow contains a screen for changing your name and byline, then it bounces out to picking and cropping a profile image, then allows you to choose a preset border to put on your profile picture. For the screen for changing your name and byline, we can build a robot to interact with the screen that looks something like this: By using this pattern, we are able to reuse test code pertaining to this screen across many tests. It also keeps the test file clean of WidgetTester interaction, making the tests read more like a series of human actions rather than a series of code instructions. Okay, we’ve got an app, a test harness, and robots to interact with the screens. Let’s put it all together now into an actual test. The tests end up looking incredibly simple once all of these things are in place(which was the goal!) This test would go on to have a few more steps detailing the interactions on the subsequent screens. With that, we’ve been able to test the integration of all the components for a given flow, all written in widget-test-like style without needing to build out the entire app. This test could be added into our suite of other tests and run with each commit. Back to the bigger picture Integration testing in flutter can be daunting due to how heavy the flutter_driver/integration_test solutions are with their UI testing strategies. We were able to overcome this and begin filling out the middle level of our testing pyramid by adding structure on top of the widget testing API that allows us to test full flows from start to finish. When pursuing this ourselves, we found it valuable to evaluate our testing strategy deficits, identify clear-cut boundaries around what code we wanted to test, and establish standards around what flows through the app should be tested. By going down the path of integration testing, we’ve been able to increase confidence in everyday changes as well as map out a plan for eliminating our manual test cases. -
Why (And How) Betterment Is Using Julia
Why (And How) Betterment Is Using Julia true Betterment is using Julia to solve our own version of the “two-language problem." At Betterment, we’re using Julia to power the projections and recommendations we provide to help our customers achieve their financial goals. We’ve found it to be a great solution to our own version of the “two-language problem”–the idea that the language in which it is most convenient to write a program is not necessarily the language in which it makes the most sense to run that program. We’re excited to share the approach we took to incorporating it into our stack and the challenges we encountered along the way. Working behind the scenes, the members of our Quantitative Investing team bring our customers the projections and recommendations they rely on for keeping their goals on-track. These hard-working and talented individuals spend a large portion of their time developing models, researching new investment ideas and maintaining our research libraries. While they’re not engineers, their jobs definitely involve a good amount of coding. Historically, the team has written code mostly in a research environment, implementing proof-of-concept models that are later translated into production code with help from the engineering team. Recently, however, we’ve invested significant resources in modernizing this research pipeline by converting our codebase from R to Julia and we’re now able to ship updates to our quantitative models quicker, and with less risk of errors being introduced in translation. Currently, Julia powers all the projections shown inside our app, as well as a lot of the advice we provide to our customers. The Julia library we built for this purpose serves around 18 million requests per day, and very efficiently at that. Examples of projections and recommendations at Betterment. Does not reflect any actual portfolio and is not a guarantee of performance. Why Julia? At QCon London 2019, Steve Klabnik gave a great talk on how the developers of the Rust programming language view tradeoffs in programming language design. The whole talk is worth a watch, but one idea that really resonated with us is that programming language design—and programming language choice—is a reflection of what the end-users of that language value and not a reflection of the objective superiority of one language over another. Julia is a newer language that looked like a perfect fit for the investing team for a number of reasons: Speed. If you’ve heard one thing about Julia, it’s probably about it’s blazingly fast performance. For us, speed is important as we need to be able to provide real-time advice to our customers by incorporating their most up-to-date financial scenario in our projections and recommendations. It is also important in our research code where the iterative nature of research means we often have to re-run financial simulations or models multiple times with slight tweaks. Dynamicism. While speed of execution is important, we also require a dynamic language that allows us to test out new ideas and prototype rapidly. Julia ticks the box for this requirement as well by using a just-in-time compiler that accommodates both interactive and non-interactive workflows well. Julia also has a very rich type system where researchers can build prototypes without type declarations, and then later refactoring the code where needed with type declarations for dispatch or clarity. In either case, Julia is usually able to generate performant compiled code that we can run in production. Relevant ecosystem. While the nascency of Julia as a language means that the community and ecosystem is much smaller than those of other languages, we found that the code and community oversamples on the type of libraries that we care about. Julia has excellent support for technical computing and mathematical modelling. Given these reasons, Julia is the perfect language to serve as a solution to the “two-language problem”. This concept is oft-quoted in Julian circles and is perfectly exemplified by the previous workflow of our team: Investing Subject Matter Experts (SMEs) write domain-specific code that’s solely meant to serve as research code, and that code then has to be translated into some more performant language for use in production. Julia solves this issue by making it very simple to take a piece of research code and refactor it for production use. Our approach We decided to build our Julia codebase inside a monorepo, with separate packages for each conceptual project we might work on, such as interest rate models, projections, social security amount calculations and so on. This works well from a development perspective, but we soon faced the question of how best to integrate this code with our production code, which is mostly developed in Ruby. We identified two viable alternatives: Build a thin web service that will accept HTTP requests, call the underlying Julia functions, and then return a HTTP response. Compile the Julia code into a shared library, and call it directly from Ruby using FFI. Option 1 is a very common pattern, and actually quite similar to what had been the status quo at Betterment, as most of the projections and recommendation code existed in a JavaScript service. It may be surprising then to learn that we actually went with Option 2. We were deeply attracted to the idea of being able to fully integration-test our projections and recommendations working within our actual app (i.e. without the complication of a service boundary). Additionally, we wanted an integration that we could spin-up quickly and with low ongoing cost; there’s some fixed cost to getting a FFI-embed working right—but once you do, it’s an exceedingly low cost integration to maintain. Fully-fledged services require infrastructure to run and are (ideally) supported by a full team of engineers. That said, we recognize the attractive properties of the more well-trodden Option 1 path and believe it could be the right solution in a lot of scenarios (and may become the right solution for us as our usage of Julia continues to evolve). Implementation Given how new Julia is, there was minimal literature on true interoperability with other programming languages (particularly high-level languages–Ruby, Python, etc). But we saw that the right building blocks existed to do what we wanted and proceeded with the confidence that it was theoretically possible. As mentioned earlier, Julia is a just-in-time compiled language, but it’s possible to compile Julia code ahead-of-time using PackageCompiler.jl. We built an additional package into our monorepo whose sole purpose was to expose an API for our Ruby application, as well as compile that exposed code into a C shared library. The code in this package is the glue between our pure Julia functions and the lower level library interface—it’s responsible for defining the functions that will be exported by the shared library and doing any necessary conversions on input/output. As an example, consider the following simple Julia function which sorts an array of numbers using the insertion sort algorithm: In order to be able to expose this in a shared library, we would wrap it like this: Here we’ve simplified memory management by requiring the caller to allocate memory for the result, and implemented primitive exception handling (see Challenges & Pitfalls below). On the Ruby end, we built a gem which wraps our Julia library and attaches to it using Ruby-FFI. The gem includes a tiny Julia project with the API library as it’s only dependency. Upon gem installation, we fetch the Julia source and compile it as a native extension. Attaching to our example function with Ruby-FFI is straightforward: From here, we could begin using our function, but it wouldn’t be entirely pleasant to work with–converting an input array to a pointer and processing the result would require some tedious boilerplate. Luckily, we can use Ruby’s powerful metaprogramming abilities to abstract all that away–creating a declarative way to wrap an arbitrary Julia function which results in a familiar and easy-to-use interface for Ruby developers. In practice, that might look something like this: Resulting in a function for which the fact that the underlying implementation is in Julia has been completely abstracted away: Challenges & Pitfalls Debugging an FFI integration can be challenging; any misconfiguration is likely to result in the dreaded segmentation fault–the cause of which can be difficult to hunt down. Here are a few notes for practitioners about some nuanced issues we ran into, that will hopefully save you some headaches down the line: The Julia runtime has to be initialized before calling the shared library. When loading the dynamic library (whether through Ruby-FFI or some other invocation of `dlopen`), make sure to pass the flags `RTLD_LAZY` and `RTLD_GLOBAL` (`ffi_lib_flags :lazy, :global` in Ruby-FFI). If embedding your Julia library into a multi-threaded application, you’ll need additional tooling to only initialize and make calls into the Julia library from a single thread, as multiple calls to `jl_init` will error. We use a multi-threaded web server for our production application, and so when we make a call into the Julia shared library, we push that call onto a queue where it gets picked up and performed by a single executor thread which then communicates the result back to the calling thread using a promise object. Memory management–if you’ll be passing anything other than primitive types back from Julia to Ruby (e.g. pointers to more complex objects), you’ll need to take care to ensure the memory containing the data you’re passing back isn’t cleared by the Julia garbage collector prior to being read on the Ruby side. Different approaches are possible. Perhaps the simplest is to have the Ruby side allocate the memory into which the Julia function should write it’s result (and pass the Julia function a pointer to that memory). Alternatively, if you want to actually pass complex objects out, you’ll have to ensure Julia holds a reference to the objects beyond the life of the function, in order to keep them from being garbage collected. And then you’ll probably want to expose a way for Ruby to instruct Julia to clean up that reference (i.e. free the memory) when it’s done with it (Ruby-FFI has good support for triggering a callback when an object goes out-of-scope on the Ruby side). Exception handling–conveying unhandled exceptions across the FFI boundary is generally not possible. This means any unhandled exception occurring in your Julia code will result in a segmentation fault. To avoid this, you’ll probably want to implement catch-all exception handling in your shared library exposed functions that will catch any exceptions that occur and return some context about the error to the caller (minimally, a boolean indicator of success/failure). Tooling To simplify development, we use a lot of tooling and infrastructure developed both in-house and by the Julia community. Since one of the draws of using Julia in the first place is the performance of the code, we make sure to benchmark our code during every pull request for potential performance regressions using the BenchmarkTools.jl package. To facilitate versioning and sharing of our Julia packages internally (e.g. to share a version of the Ruby-API package with the Ruby gem which wraps it) we also maintain a private package registry. The registry is a separate Github repository, and we use tooling from the Registrator.jl package to register new versions. To process registration events, we maintain a registry server on an EC2 instance provisioned through Terraform, so updates to the configuration are as easy as running a single `terraform apply` command. Once a new registration event is received, the registry server opens a pull request to the Julia registry. There, we have built in automated testing that resolves the version of the package that is being tested, looks up any reverse dependencies of that package, resolves the compatibility bounds of those packages to see if the newly registered version could lead to a breaking change, and if so, runs the full test suites of the reverse dependencies. By doing this, we can ensure that when we release a patch or minor version of one of our packages, we can ensure that it won’t break any packages that depend on it at registration time. If it would, the user is instead forced to either fix the changes that lead to a downstream breakage, or to modify the registration to be a major version increase. Takeaways Though our venture into the Julia world is still relatively young compared to most of the other code at Betterment, we have found Julia to be a perfect fit in solving our two-language problem within the Investing team. Getting the infrastructure into a production-ready format took a bit of tweaking, but we are now starting to realize a lot of the benefits we hoped for when setting out on this journey, including faster development of production ready models, and a clear separation of responsibilities between the SMEs on the Investing team who are best suited for designing and specifying the models, and the engineering team who have the knowledge on how to scale that code into a production-grade library. The switch to Julia has allowed us not only to optimize and speed up our code by multiple orders of magnitude, but also has given us the environment and ecosystem to explore ideas that would simply not be possible in our previous implementations. -
Introducing “Delayed”: Resilient Background Jobs on Rails
Introducing “Delayed”: Resilient Background Jobs on Rails true In the past 24 hours, a Ruby on Rails application at Betterment performed somewhere on the order of 10 million asynchronous tasks. While many of these tasks merely sent a transactional email, or fired off an iOS or Android push notification, plenty involved the actual movement of money—deposits, withdrawals, transfers, rollovers, you name it—while others kept Betterment’s information systems up-to-date—syncing customers’ linked account information, logging events to downstream data consumers, the list goes on. What all of these tasks had in common (aside from being, well, really important to our business) is that they were executed via a database-backed job-execution framework called Delayed, a newly-open-sourced library that we’re excited to announce… right now, as part of this blog post! And, yes, you heard that right. We run millions of these so-called “background jobs” daily using a SQL-backed queue—not Redis, or RabbitMQ, or Kafka, or, um, you get the point—and we’ve very intentionally made this choice, for reasons that will soon be explained! But first, let’s back up a little and answer a few basic questions. Why Background Jobs? In other words, what purpose do these background jobs serve? And how does running millions of them per day help us? Well, when building web applications, we (as web application developers) strive to build pages that respond quickly and reliably to web requests. One might say that this is the primary goal of any webapp—to provide a set of HTTP endpoints that reliably handle all the success and failure cases within a specified amount of time, and that don’t topple over under high-traffic conditions. This is made possible, at least in part, by the ability to perform units of work asynchronously. In our case, via background jobs. At Betterment, we rely on said jobs extensively, to limit the amount of work performed during the “critical path” of each web request, and also to perform scheduled tasks at regular intervals. Our reliance on background jobs even allows us to guarantee the eventual consistency of our distributed systems, but more on that later. First, let’s take a look at the underlying framework we use for enqueuing and executing said jobs. Frameworks Galore! And, boy howdy, are there plenty of available frameworks for doing this kind of thing! Ruby on Rails developers have the choice of resque, sidekiq, que, good_job, delayed_job, and now... delayed, Betterment’s own flavor of job queue! Thankfully, Rails provides an abstraction layer on top of these, in the form of the Active Job framework. This, in theory, means that all jobs can be written in more or less the same way, regardless of the job-execution backend. Write some jobs, pick a queue backend with a few desirable features (priorities, queues, etc), run some job worker processes, and we’re off to the races! Sounds simple enough! Unfortunately, if it were so simple we wouldn’t be here, several paragraphs into a blog post on the topic. In practice, deciding on a job queue is more complicated than that. Quite a bit more complicated, because each backend framework provides its own set of trade-offs and guarantees, many of which will have far-reaching implications in our codebase. So we’ll need to consider carefully! How To Choose A Job Framework The delayed rubygem is a fork of both delayed_job and delayed_job activerecord, with several targeted changes and additions, including numerous performance & scalability optimizations that we’ll cover towards the end of this post. But first, in order to explain how Betterment arrived where we did, we must explain what it is that we need our job queue to be capable of, starting with the jobs themselves. You see, a background job essentially represents a tiny contract. Each consists of some action being taken for / by / on behalf of / in the interest of one or more of our customers, and that must be completed within an appropriate amount of time. Betterment’s engineers decided, therefore, that it was critical to our mission that we be capable of handling each and every contract as reliably as possible. In other words, every job we attempt to enqueue must, eventually, reach some form of resolution. Of course, job “resolution” doesn’t necessarily mean success. Plenty of jobs may complete in failure, or simply fail to complete, and may require some form of automated or manual intervention. But the point is that jobs are never simply dropped, or silently deleted, or lost to the cyber-aether, at any point, from the moment we enqueue them to their eventual resolution. This general property—the ability to enqueue jobs safely and ensure their eventual resolution—is the core feature that we have optimized for. Let’s call it resilience. Optimizing For Resilience Now, you might be thinking, shouldn’t all of these ActiveJob backends be, at the very least, safe to use? Isn’t “resilience” a basic feature of every backend, except maybe the test/development ones? And, yeah, it’s a fair question. As the author of this post, my tactful attempt at an answer is that, well, not all queue backends optimize for the specific kind of end-to-end resilience that we look for. Namely, the guarantee of at-least-once execution. Granted, having “exactly-once” semantics would be preferable, but if we cannot be sure that our jobs run at least once, then we must ask ourselves: how would we know if something didn’t run at all? What kind of monitoring would be necessary to detect such a failure, across all the features of our app, and all the types of jobs it might try to run? These questions open up an entirely different can of worms, one that we would prefer remained firmly sealed. Remember, jobs are contracts. A web request was made, code was executed, and by enqueuing a job, we said we'd eventually do something. Not doing it would be... bad. Not even knowing we didn't do it... very bad. So, at the very least, we need the guarantee of at-least-once execution. Building on at-least-once guarantees If we know for sure that we’ll fully execute all jobs at least once, then we can write our jobs in such a way that makes the at-least-once approach reliable and resilient to failure. Specifically, we’ll want to make our jobs idempotent—basically, safely retryable, or resumable—and that is on us as application developers to ensure on a case-by-case basis. Once we solve this very solvable idempotency problem, then we’re on track for the same net result as an “exactly-once” approach, even if it takes a couple extra attempts to get there. Furthermore, this combination of at-least-once execution and idempotency can then be used in a distributed systems context, to ensure the eventual consistency of changes across multiple apps and databases. Whenever a change occurs in one system, we can enqueue idempotent jobs notifying the other systems, and retry them until they succeed, or until we are left with stuck jobs that must be addressed operationally. We still concern ourselves with other distributed systems pitfalls like event ordering, but we don’t have to worry about messages or events disappearing without a trace due to infrastructure blips. So, suffice it to say, at-least-once semantics are crucial in more ways than one, and not all ActiveJob backends provide them. Redis-based queues, for example, can only be as durable (the “D” in “ACID”) as the underlying datastore, and most Redis deployments intentionally trade-off some durability for speed and availability. Plus, even when running in the most durable mode, Redis-based ActiveJob backends tend to dequeue jobs before they are executed, meaning that if a worker process crashes at the wrong moment, or is terminated during a code deployment, the job is lost. These frameworks have recently begun to move away from this LPOP-based approach, in favor of using RPOPLPUSH (to atomically move jobs to a queue that can then be monitored for orphaned jobs), but outside of Sidekiq Pro, this strategy doesn’t yet seem to be broadly available. And these job execution guarantees aren’t the only area where a background queue might fail to be resilient. Another big resilience failure happens far earlier, during the enqueue step. Enqueues and Transactions See, there’s a major “gotcha” that may not be obvious from the list of ActiveJob backends. Specifically, it’s that some queues rely on an app’s primary database connection—they are “database-backed,” against the app’s own database—whereas others rely on a separate datastore, like Redis. And therein lies the rub, because whether or not our job queue is colocated with our application data will greatly inform the way that we write any job-adjacent code. More precisely, when we make use of database transactions (which, when we use ActiveRecord, we assuredly do whether we realize it or not), a database-backed queue will ensure that enqueued jobs will either commit or roll back with the rest of our ActiveRecord-based changes. This is extremely convenient, to say the least, since most jobs are enqueued as part of operations that persist other changes to our database, and we can in turn rely on the all-or-nothing nature of transactions to ensure that neither the job nor the data mutation is persisted without the other. Meanwhile, if our queue existed in a separate datastore, our enqueues will be completely unaware of the transaction, and we’d run the risk of enqueuing a job that acts on data that was never committed, or (even worse) we’d fail to enqueue a job even when the rest of the transactional data was committed. This would fundamentally undermine our at-least-once execution guarantees! We already use ACID-compliant datastores to solve these precise kinds of data persistence issues, so with the exception of really, really high volume operations (where a lot of noise and data loss can—or must—be tolerated), there’s really no reason not to enqueue jobs co-transactionally with other data changes. And this is precisely why, at Betterment, we start each application off with a database-backed queue, co-located with the rest of the app’s data, with the guarantee of at-least-once job execution. By the way, this is a topic I could talk about endlessly, so I’ll leave it there for now. If you’re interested in hearing me say even more about resilient data persistence and job execution, feel free to check out Can I break this?, a talk I gave at RailsConf 2021! But in addition to the resiliency guarantees outlined above, we’ve also given a lot of attention to the operability and the scalability of our queue. Let’s cover operability first. Maintaining a Queue in the Long Run Operating a queue means being able to respond to errors and recover from failures, and also being generally able to tell when things are falling behind. (Essentially, it means keeping our on-call engineers happy.) We do this in two ways: with dashboards, and with alerts. Our dashboards come in a few parts. Firstly, we host a private fork of delayedjobweb, a web UI that allows us to see the state of our queues in real time and drill down to specific jobs. We’ve extended the gem with information on “erroring” jobs (jobs that are in the process of retrying but have not yet permanently failed), as well as the ability to filter by additional fields such as job name, priority, and the owning team (which we store in an additional column). We also maintain two other dashboards in our cloud monitoring service, DataDog. These are powered by instrumentation and continuous monitoring features that we have added directly to the delayed gem itself. When jobs run, they emit ActiveSupport::Notification events that we subscribe to and then forward along to a StatsD emitter, typically as “distribution” or “increment” metrics. Additionally, we’ve included a continuous monitoring process that runs aggregate queries, tagged and grouped by queue and priority, and that emits similar notifications that become “gauge” metrics. Once all of these metrics make it to DataDog, we’re able to display a comprehensive timeboard that graphs things like average job runtime, throughput, time spent waiting in the queue, error rates, pickup query performance, and even some top 10 lists of slowest and most erroring jobs. On the alerting side, we have DataDog monitors in place for overall queue statistics, like max age SLA violations, so that we can alert and page ourselves when queues aren’t working off jobs quickly enough. Our SLAs are actually defined on a per-priority basis, and we’ve added a feature to the delayed gem called “named priorities” that allows us to define priority-specific configs. These represent integer ranges (entirely orthogonal to queues), and default to “interactive” (0-9), “user visible” (10-19), “eventual” (20-29), and “reporting” (30+), with default alerting thresholds focused on retry attempts and runtime. There are plenty of other features that we’ve built that haven’t made it into the delayed gem quite yet. These include the ability for apps to share a job queue but run separate workers (i.e. multi-tenancy), team-level job ownership annotations, resumable bulk orchestration and batch enqueuing of millions of jobs at once, forward-scheduled job throttling, and also the ability to encrypt the inputs to jobs so that they aren’t visible in plaintext in the database. Any of these might be the topic for a future post, and might someday make their way upstream into a public release! But Does It Scale? As we've grown, we've had to push at the limits of what a database-backed queue can accomplish. We’ve baked several improvements into the delayed gem, including a highly optimized, SKIP LOCKED-based pickup query, multithreaded workers, and a novel “max percent of max age” metric that we use to automatically scale our worker pool up to ~3x its baseline size when queues need additional concurrency. Eventually, we could explore ways of feeding jobs through to higher performance queues downstream, far away from the database-backed workers. We already do something like this for some jobs with our journaled gem, which uses AWS Kinesis to funnel event payloads out to our data warehouse (while at the same time benefiting from the same at-least-once delivery guarantees as our other jobs!). Perhaps we’d want to generalize the approach even further. But the reality of even a fully "scaled up" queue solution is that, if it is doing anything particularly interesting, it is likely to be database-bound. A Redis-based queue will still introduce DB pressure if its jobs execute anything involving ActiveRecord models, and solutions must exist to throttle or rate limit these jobs. So even if your queue lives in an entirely separate datastore, it can be effectively coupled to your DB's IOPS and CPU limitations. So does the delayed approach scale? To answer that question, I’ll leave you with one last takeaway. A nice property that we’ve observed at Betterment, and that might apply to you as well, is that the number of jobs tends to scale proportionally with the number of customers and accounts. This means that when we naturally hit vertical scaling limits, we could, for example, shard or partition our job table alongside our users table. Then, instead of operating one giant queue, we’ll have broken things down to a number of smaller queues, each with their own worker pools, emitting metrics that can be aggregated with almost the same observability story we have today. But we’re getting into pretty uncharted territory here, and, as always, your mileage may vary! Try it out! If you’ve read this far, we’d encourage you to take the leap and test out the delayed gem for yourself! Again, it combines both DelayedJob and its ActiveRecord backend, and should be more or less compatible with Rails apps that already use ActiveJob or DelayedJob. Of course, it may require a bit of tuning on your part, and we’d love to hear how it goes! We’ve also built an equivalent library in Java, which may also see a public release at some point. (To any Java devs reading this: let us know if that interests you!) Already tried it out? Any features you’d like to see added? Let us know what you think! -
Focusing on What Matters: Using SLOs to Pursue User Happiness
Focusing on What Matters: Using SLOs to Pursue User Happiness true Proper reliability is the greatest operational requirement for any service. If the service doesn’t work as intended, no user (or engineer) will be happy. This is where SLOs come in. The umbrella term “observability” covers all manner of subjects, from basic telemetry to logging, to making claims about longer-term performance in the shape of service level objectives (SLOs) and occasionally service level agreements (SLAs). Here I’d like to discuss some philosophical approaches to defining SLOs, explain how they help with prioritization, and outline the tooling currently available to Betterment Engineers to make this process a little easier. What is an SLO? At a high level, a service level objective is a way of measuring the performance of, correctness of, validity of, or efficacy of some component of a service over time by comparing the functionality of specific service level indicators (metrics of some kind) against a target goal. For example, 99.9% of requests complete with a 2xx, 3xx or 4xx HTTP code within 2000ms over a 30 day period The service level indicator (SLI) in this example is a request completing with a status code of 2xx, 3xx or 4xx and with a response time of at most 2000ms. The SLO is the target percentage, 99.9%. We reach our SLO goal if, during a 30 day period, 99.9% of all requests completed with one of those status codes and within that range of latency. If our service didn’t succeed at that goal, the violation overflow — called an “error budget” — shows us by how much we fell short. With a goal of 99.9%, we have 40 minutes and 19 seconds of downtime available to us every 28 days. Check out more error budget math here. If we fail to meet our goals, it’s worthwhile to step back and understand why. Was the error budget consumed by real failures? Did we notice a number of false positives? Maybe we need to reevaluate the metrics we’re collecting, or perhaps we’re okay with setting a lower target goal because there are other targets that will be more important to our customers. It’s all about the customer This is where the philosophy of defining and keeping track of SLOs comes into play. It starts with our users - Betterment users - and trying to provide them with a certain quality of service. Any error budget we set should account for our fiduciary responsibilities, and should guarantee that we do not cause an irresponsible impact to our customers. We also assume that there is a baseline degree of software quality baked-in, so error budgets should help us prioritize positive impact opportunities that go beyond these baselines. Sometimes there are a few layers of indirection between a service and a Betterment customer, and it takes a bit of creativity to understand what aspects of the service directly affects them. For example, an engineer on a backend or data-engineering team provides services that a user-facing component consumes indirectly. Or perhaps the users for a service are Betterment engineers, and it’s really unclear how that work affects the people who use our company’s products. It isn’t that much of a stretch to claim that an engineer’s level of happiness does have some effect on the level of service they’re capable of providing a Betterment customer! Let’s say we’ve defined some SLOs and notice they are falling behind over time. We might take a look at the metrics we’re using (the SLIs), the failures that chipped away at our target goal, and, if necessary, re-evaluate the relevancy of what we’re measuring. Do error rates for this particular endpoint directly reflect an experience of a user in some way - be it a customer, a customer-facing API, or a Betterment engineer? Have we violated our error budget every month for the past three months? Has there been an increase in Customer Service requests to resolve problems related to this specific aspect of our service? Perhaps it is time to dedicate a sprint or two to understanding what’s causing degradation of service. Or perhaps we notice that what we’re measuring is becoming increasingly irrelevant to a customer experience, and we can get rid of the SLO entirely! Benefits of measuring the right things, and staying on target The goal of an SLO based approach to engineering is to provide data points with which to have a reasonable conversation about priorities (a point that Alex Hidalgo drives home in his book Implementing Service Level Objectives). In the case of services not performing well over time, the conversation might be “focus on improving reliability for service XYZ.” But what happens if our users are super happy, our SLOs are exceptionally well-defined and well-achieved, and we’re ahead of our roadmap? Do we try to get that extra 9 in our target - or do we use the time to take some creative risks with the product (feature-flagged, of course)? Sometimes it’s not in our best interest to be too focused on performance, and we can instead “use up our error budget” by rolling out some new A/B test, or upgrading a library we’ve been putting off for a while, or testing out a new language in a user-facing component that we might not otherwise have had the chance to explore. The tools to get us there Let’s dive into some tooling that the SRE team at Betterment has built to help Betterment engineers easily start to measure things. Collecting the SLIs and Creating the SLOs The SRE team has a web-app and CLI called coach that we use to manage continuous integration (CI) and continuous delivery (CD), among other things. We’ve talked about Coach in the past here and here. At a high level, the Coach CLI generates a lot of yaml files that are used in all sorts of places to help manage operational complexity and cloud resources for consumer-facing web-apps. In the case of service level indicators (basically metrics collection), the Coach CLI provides commands that generate yaml files to be stored in GitHub alongside application code. At deploy time, the Coach web-app consumes these files and idempotently create Datadog monitors, which can be used as SLIs (service level indicators) to inform SLOs, or as standalone alerts that need immediate triage every time they're triggered. In addition to Coach explicitly providing a config-driven interface for monitors, we’ve also written a couple handy runtime specific methods that result in automatic instrumentation for Rails or Java endpoints. I’ll discuss these more below. We also manage a separate repository for SLO definitions. We left this outside of application code so that teams can modify SLO target goals and details without having to redeploy the application itself. It also made visibility easier in terms of sharing and communicating different team’s SLO definitions across the org. Monitors in code Engineers can choose either StatsD or Micrometer to measure complicated experiences with custom metrics, and there’s various approaches to turning those metrics directly into monitors within Datadog. We use Coach CLI driven yaml files to support metric or APM monitor types directly in the code base. Those are stored in a file named .coach/datadog_monitors.yml and look like this: monitors: - type: metric metric: "coach.ci_notification_sent.completed.95percentile" name: "coach.ci_notification_sent.completed.95percentile SLO" aggregate: max owner: sre alert_time_aggr: on_average alert_period: last_5m alert_comparison: above alert_threshold: 5500 - type: apm name: "Pull Requests API endpoint violating SLO" resource_name: api::v1::pullrequestscontroller_show max_response_time: 900ms service_name: coach page: false slack: false It wasn’t simple to make this abstraction intuitive between a Datadog monitor configuration and a user interface. But this kind of explicit, attribute-heavy approach helped us get this tooling off the ground while we developed (and continue to develop) in-code annotation approaches. The APM monitor type was simple enough to turn into both a Java annotation and a tiny domain specific language (DSL) for Rails controllers, giving us nice symmetry across our platforms. . This owner method for Rails apps results in all logs, error reports, and metrics being tagged with the team’s name, and at deploy time it's aggregated by a Coach CLI command and turned into latency monitors with reasonable defaults for optional parameters; essentially doing the same thing as our config-driven approach but from within the code itself class DeploysController < ApplicationController owner "sre", max_response_time: "10000ms", only: [:index], slack: false end For Java apps we have a similar interface (with reasonable defaults as well) in a tidy little annotation. @Sla @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.METHOD) public @interface Sla { @AliasFor(annotation = Sla.class) long amount() default 25_000; @AliasFor(annotation = Sla.class) ChronoUnit unit() default ChronoUnit.MILLIS; @AliasFor(annotation = Sla.class) String service() default "custody-web"; @AliasFor(annotation = Sla.class) String slackChannelName() default "java-team-alerts"; @AliasFor(annotation = Sla.class) boolean shouldPage() default false; @AliasFor(annotation = Sla.class) String owner() default "java-team"; } Then usage is just as simple as adding the annotation to the controller: @WebController("/api/stuff/v1/service_we_care_about") public class ServiceWeCareAboutController { @PostMapping("/search") @CustodySla(amount = 500) public SearchResponse search(@RequestBody @Valid SearchRequest request) {...} } At deploy time, these annotations are scanned and converted into monitors along with the config-driven definitions, just like our Ruby implementation. SLOs in code Now that we have our metrics flowing, our engineers can define SLOs. If an engineer has a monitor tied to metrics or APM, then they just need to plug in the monitor ID directly into our SLO yaml interface. - last_updated_date: "2021-02-18" approval_date: "2021-03-02" next_revisit_date: "2021-03-15" category: latency type: monitor description: This SLO covers latency for our CI notifications system - whether it's the github context updates on your PRs or the slack notifications you receive. tags: - team:sre thresholds: - target: 99.5 timeframe: 30d warning_target: 99.99 monitor_ids: - 30842606 The interface supports metrics directly as well (mirroring Datadog’s SLO types) so an engineer can reference any metric directly in their SLO definition, as seen here: # availability - last_updated_date: "2021-02-16" approval_date: "2021-03-02" next_revisit_date: "2021-03-15" category: availability tags: - team:sre thresholds: - target: 99.9 timeframe: 30d warning_target: 99.99 type: metric description: 99.9% of manual deploys will complete successfully over a 30day period. query: # (total_events - bad_events) over total_events == good_events/total_events numerator: sum:trace.rack.request.hits{service:coach,env:production,resource_name:deployscontroller_create}.as_count()-sum:trace.rack.request.errors{service:coach,env:production,resource_name:deployscontroller_create}.as_count() denominator: sum:trace.rack.request.hits{service:coach,resource_name:deployscontroller_create}.as_count() We love having these SLOs defined in GitHub because we can track who's changing them, how they're changing, and get review from peers. It's not quite the interactive experience of the Datadog UI, but it's fairly straightforward to fiddle in the UI and then extract the resulting configuration and add it to our config file. Notifications When we merge our SLO templates into this repository, Coach will manage creating SLO resources in Datadog and accompanying SLO alerts (that ping slack channels of our choice) if and when our SLOs violate their target goals. This is the slightly nicer part of SLOs versus simple monitors - we aren’t going to be pinged for every latency failure or error rate spike. We’ll only be notified if, over 7 days or 30 days or even longer, they exceed the target goal we’ve defined for our service. We can also set a “warning threshold” if we want to be notified earlier when we’re using up our error budget. Fewer alerts means the alerts should be something to take note of, and possibly take action on. This is a great way to get a good signal while reducing unnecessary noise. If, for example, our user research says we should aim for 99.5% uptime, that’s 3h 21m 36s of downtime available per 28 days. That’s a lot of time we can reasonably not react to failures. If we aren’t alerting on those 3 hours of errors, and instead just once if we exceed that limit, then we can direct our attention toward new product features, platform improvements, or learning and development. The last part of defining our SLOs is including a date when we plan to revisit that SLO specification. Coach will send us a message when that date rolls around to encourage us to take a deeper look at our measurements and possibly reevaluate our goals around measuring this part of our service. What if SLOs don’t make sense yet? It’s definitely the case that a team might not be at the level of operational maturity where defining product or user-specific service level objectives is in the cards. Maybe their on-call is really busy, maybe there are a lot of manual interventions needed to keep their services running, maybe they’re still putting out fires and building out their team’s systems. Whatever the case may be, this shouldn’t deter them from collecting data. They can define what is called an “aspirational” SLO - basically an SLO for an important component in their system - to start collecting data over time. They don’t need to define an error budget policy, and they don’t need to take action when they fail their aspirational SLO. Just keep an eye on it. Another option is to start tracking the level of operational complexity for their systems. Perhaps they can set goals around "Bug Tracker Inbox Zero" or "Failed Background Jobs Zero" within a certain time frame, a week or a month for example. Or they can define some SLOs around types of on-call tasks that their team tackles each week. These aren’t necessarily true-to-form SLOs but engineers can use this framework and tooling provided to collect data around how their systems are operating and have conversations on prioritization based on what they discover, beginning to build a culture of observability and accountability Conclusion Betterment is at a point in its growth where prioritization has become more difficult and more important. Our systems are generally stable, and feature development is paramount to business success. But so is reliability and performance. Proper reliability is the greatest operational requirement for any service2. If the service doesn’t work as intended, no user (or engineer) will be happy. This is where SLOs come in. SLOs should align with business objectives and needs, which will help Product and Engineering Managers understand the direct business impact of engineering efforts. SLOs will ensure that we have a solid understanding of the state of our services in terms of reliability, and they empower us to focus on user happiness. If our SLOs don’t align directly with business objectives and needs, they should align indirectly via tracking operational complexity and maturity. So, how do we choose where to spend our time? SLOs (service level objectives) - including managing their error budgets - will permit us - our product engineering teams - to have the right conversations and make the right decisions about prioritization and resourcing so that we can balance our efforts spent on reliability and new product features, helping to ensure the long term happiness and confidence of our users (and engineers). 2 Alex Hidalgo, Implementing Service Level Objectives -
Finding and Preventing Rails Authorization Bugs
Finding and Preventing Rails Authorization Bugs true This article walks through finding and fixing common Rails authorization bugs. At Betterment, we build public facing applications without an authorization framework by following three principles, discussed in another blog post. Those three principles are: Authorization through Impossibility Authorization through Navigability Authorization through Application Boundaries This post will explore the first two principles and provide examples of common patterns that can lead to vulnerabilities as well as guidance for how to fix them. We will also cover the custom tools we’ve built to help avoid these patterns before they can lead to vulnerabilities. If you’d like, you can skip ahead to the tools before continuing on to the rest of this post. Authorization through Impossibility This principle might feel intuitive, but it’s worth reiterating that at Betterment we never build endpoints that allow users to access another user’s data. There is no /api/socialsecuritynumbers endpoint because it is a prime target for third-party abuse and developer error. Similarly, even our authorized endpoints never allow one user to peer into another user’s object graph. This principle keeps us from ever having the opportunity to make some of the mistakes addressed in our next section. We acknowledge that many applications out there can’t make the same design decisions about users’ data, but as a general principle we recommend reducing the ways in which that data can be accessed. If an application absolutely needs to be able to show certain data, consider structuring the endpoint in a way such that a client can’t even attempt to request another user’s data. Authorization through Navigability Rule #1: Authorization should happen in the controller and should emerge naturally from table relationships originating from the authenticated user, i.e. the “trust root chain”. This rule is applicable for all controller actions and is a critical component of our security story. If you remember nothing else, remember this. What is a “trust root chain”? It’s a term we’ve co-opted from ssl certificate lingo, and it’s meant to imply a chain of ownership from the authenticated user to a target resource. We can enforce access rules by using the affordances of our relational data without the need for any additional “permission” framework. Note that association does not imply authorization, and the onus is on the developer to ensure that associations are used properly. Consider the following controller: So long as a user is authenticated, they can perform the show action on any document (including documents belonging to others!) provided they know or can guess its ID - not great! This becomes even more dangerous if the Documents table uses sequential ids, as that would make it easy for an attacker to start combing through the entire table. This is why Betterment has a rule requiring UUIDs for all new tables. This type of bug is typically referred to as an Insecure Direct Object Reference vulnerability. In short, these bugs allow attackers to access data directly using its unique identifiers – even if that data belongs to someone else – because the application fails to take authorization into account. We can use our database relationships to ensure that users can only see their own documents. Assuming a User has many Documents then we would change our controller to the following: Now any document_id that doesn’t exist in the user’s object graph will raise a 404 and we’ve provided authorization for this endpoint without a framework - easy peezy. Rule #2: Controllers should pass ActiveRecord models, rather than ids, into the model layer. As a corollary to Rule #1, we should ensure that all authorization happens in the controller by disallowing model initialization with *_id attributes. This rule speaks to the broader goal of authorization being obvious in our code. We want to minimize the hops and jumps required to figure out what we’re granting access to, so we make sure that it all happens in the controller. Consider a controller that links attachments to a given document. Let’s assume that a User has many Attachments that can be attached to a Document they own. Take a minute and review this controller - what jumps out to you? At first glance, it looks like the developer has taken the right steps to adhere to Rule #1 via the document method and we’re using strong params, is that enough? Unfortunately, it’s not. There’s actually a critical security bug here that allows the client to specify any attachment_id, even if they don’t own that attachment - eek! Here’s simple way to resolve our bug: Now before we create a new AttachmentLink, we verify that the attachment_id specified actually belongs to the user and our code will raise a 404 otherwise - perfect! By keeping the authorization up front in the controller and out of the model, we’ve made it easier to reason about. If we buried the authorization within the model, it would be difficult to ensure that the trust-root chain is being enforced – especially if the model is used by multiple controllers that handle authorization inconsistently. Reading the AttachmentLink model code, it would be clear that it takes an attachment_id but whether authorization has been handled or not would remain a bit of a mystery. Automatically Detecting Vulnerabilities At Betterment, we strive to make it easy for engineers to do the right thing – especially when it comes to security practices. Given the formulaic patterns of these bugs, we decided static analysis would be a worthwhile endeavor. Static analysis can help not only with finding existing instances of these vulnerabilities, but also prevent new ones from being introduced. By automating detection of these “low hanging fruit” vulnerabilities, we can free up engineering effort during security reviews and focus on more interesting and complex issues. We decided to lean on RuboCop for this work. As a Rails shop, we already make heavy use of RuboCop. We like it because it’s easy to introduce to a codebase, violations break builds in clear and actionable ways, and disabling specific checks requires engineers to comment their code in a way that makes it easy to surface during code review. Keeping rules #1 and #2 in mind, we’ve created two cops: Betterment/UnscopedFind and Betterment/AuthorizationInController; these will flag any models being retrieved and created in potentially unsafe ways, respectively. At a high level, these cops track user input (via params.permit et al.) and raise offenses if any of these values get passed into methods that could lead to a vulnerability (e.g. model initialization, find calls, etc). You can find these cops here. We’ve been using these cops for over a year now and have had a lot of success with them. In addition to these two, the Betterlint repository contains other custom cops we’ve written to enforce certain patterns -- both security related as well as more general ones. We use these cops in conjunction with the default RuboCop configurations for all of our Ruby projects. Let’s run the first cop, Betterment/UnscopedFind against DocumentsController from above: $ rubocop app/controllers/documents_controller.rb Inspecting 1 file C Offenses: app/controllers/documents_controller.rb:3:17: C: Betterment/UnscopedFind: Records are being retrieved directly using user input. Please query for the associated record in a way that enforces authorization (e.g. "trust-root chaining"). INSTEAD OF THIS: Post.find(params[:post_id]) DO THIS: currentuser.posts.find(params[:postid]) See here for more information on this error: https://github.com/Betterment/betterlint/blob/main/README.md#bettermentunscopedfind @document = Document.find(params[:document_id]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1 file inspected, 1 offense detected The cop successfully located the vulnerability. If we attempted to deploy this code, RuboCop would fail the build, preventing the code from going out while letting reviewers know exactly why. Now let’s try running Betterment/AuthorizationInController on the AttachmentLink example from earlier: $ rubocop app/controllers/documents/attachments_controller.rb Inspecting 1 file C Offenses: app/controllers/documents/attachments_controller.rb:3:24: C: Betterment/AuthorizationInController: Model created/updated using unsafe parameters. Please query for the associated record in a way that enforces authorization (e.g. "trust-root chaining"), and then pass the resulting object into your model instead of the unsafe parameter. INSTEAD OF THIS: postparameters = params.permit(:albumid, :caption) Post.new(post_parameters) DO THIS: album = currentuser.albums.find(params[:albumid]) post_parameters = params.permit(:caption).merge(album: album) Post.new(post_parameters) See here for more information on this error: https://github.com/Betterment/betterlint/blob/main/README.md#bettermentauthorizationincontroller AttachmentLink.new(create_params.merge(document: document)).save! ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1 file inspected, 1 offense detected The model initialization was flagged because it was seen using create_params, which contains user input. Like with the other cop, this would fail the build and prevent the code from making it to production. You may have noticed that unlike the previous example, the vulnerable code doesn’t directly reference a params.permit call or any of the parameter names, but the code was still flagged. This is because both of the cops keep a little bit of state to ensure they have the appropriate context necessary when analyzing potentially unsafe function calls. We also made sure that when developing these cops that we tested them with real code samples and not just contrived scenarios that no developer would actually ever attempt. False Positives With any type of static analysis, there’s bound to be false positives. When working on these cops, we narrowed down false positives to two scenarios: The flagged code could be considered insecure only in other contexts: e.g. the application or models in question don’t have a concept of “private” data The flagged code isn’t actually insecure: e.g. the initialization happens to take a parameter whose name ends in _id but it doesn’t refer to a unique identifier for any objects In both these cases, the developer should feel empowered to either rewrite the line in question or locally disable the cop, both of which will prevent the code from being flagged. Normally we’d consider opting out of security analysis to be an unsafe thing to do, but we actually like the way RuboCop handles this because it can help reduce some code review effort; the first solution eliminates the vulnerable-looking pattern (even if it wasn’t a vulnerability to begin with) while the second one signals to reviewers that they should confirm this code is actually safe (making it easy to pinpoint areas of focus). Testing & Code Review Strategies Rubocop and Rails tooling can only get us so far in mitigating authorization bugs. The remainder falls on the shoulders of the developer and their peers to be cognizant of the choices they are making when shipping new application controllers. In light of that, we’ll cover some helpful strategies for keeping authorization front of mind. Testing When writing request specs for a controller action, write a negative test case to prove that attempts to circumvent your authorization measures return a 404. For example, consider a request spec for our Documents::AttachmentsController: These test cases are an inexpensive way to prove to yourself and your reviewers that you’ve considered the authorization context of your controller action and accounted for it properly. Like all of our tests, this functions both as regression prevention and as documentation of your intent. Code Review Our last line of defense is code review. Security is the responsibility of every engineer, and it’s critical that our reviewers keep authorization and security in mind when reviewing code. A few simple questions can facilitate effective security review of a PR that touches a controller action: Who is the authenticated user? What resource is the authenticated user operating on? Is the authenticated user authorized to operate on the resource in accordance with Rule #1? What parameters is the authenticated user submitting? Where are we authorizing the user’s access to those parameters? Do all associations navigated in the controller properly signify authorization? Getting in the habit of asking these questions during code review should lead to more frequent conversations about security and data access. Our hope is that linking out to this post and its associated Rules will reinforce a strong security posture in our application development. In Summary Unlike authentication, authorization is context specific and difficult to “abstract away” from the leaf nodes of application code. This means that application developers need to consider authorization with every controller we write or change. We’ve explored two new rules to encourage best practices when it comes to authorization in our application controllers: Authorization should happen in the controller and should emerge naturally from table relationships originating from the authenticated user, i.e. the “trust root chain”. Controllers should pass ActiveRecord models, rather than ids, into the model layer. We’ve also covered how our custom cops can help developers avoid antipatterns, resulting in safer and easier to read code. Keep these in mind when writing or reviewing application code that an authenticated user will utilize and remember that authorization should be clear and obvious. -
Using Targeted Universalism To Build Inclusive Features
Using Targeted Universalism To Build Inclusive Features true The best products are inclusive at every stage of the design and engineering process. Here's how we turned a request for more inclusion into a feature all Betterment customers can benefit from. Earlier this year, a coworker asked me how difficult it would be to add a preferred name option into our product. They showed me how we were getting quite a few requests from trans customers to quit deadnaming them. The simplest questions tend to be the hardest to answer. For me, simple questions bring to mind this interesting concept called The Illusion Of Explanatory Depth, which is when “people feel they understand complex phenomena with far greater precision, coherence, and depth than they really do.” Simple questions tend to shed light on subjects shrouded in this illusion and force you to confront your lack of knowledge. Asking for someone’s name is simple, but full of assumptions. Deadnaming is when, intentionally or not, you refer to a trans person by the name they used before transitioning. For many trans folks like myself, this is the name assigned at birth which means all legal and government issued IDs and documents use this non-affirming name. According to Healthline, because legal name changes are “expensive, inaccessible, and not completely effective at eliminating deadnaming”, institutions like Betterment can and should make changes to support our trans customers. This simple question from our trans customers “Can you quit deadnaming me?” was a sign that our original understanding of our customers' names was not quite right, and we were lacking knowledge around how names are commonly used. Now, our work involved dispelling our previous understanding of what a name is. How to turn simple questions into solutions. At Betterment, we’re required by the government to have a record of a customer’s legal first name, but that shouldn’t prevent us from letting customers share their preferred or chosen first name, and then using that name in the appropriate places. This was a wonderful opportunity to practice targeted universalism: a concept that explains how building features specifically for a marginalized audience not only benefit the people in that marginalized group, but also people outside of it, which increases its broad impact. From a design standpoint, executing a preferred name feature was pretty straightforward—we needed to provide a user with a way to share their preferred name with us, and then start using it. The lead designer for this project, Crys, did a lovely job of incorporating compassionate design into how we show the user which legal name we have on file for them, without confronting that user with their deadname every time they go to change their settings. They accomplished that by hiding the user’s legal name in a dropdown accordion that is toggled closed by default. Crys also built out a delightful flow that shows the user why we require their legal name, that answers a few common questions, and allows them to edit their preferred first name in the future if needed. With a solid plan for gathering user input, we pivoted to the bigger question: Where should we use a customer’s preferred first name? From an engineering standpoint, this question revealed a few hurdles that we needed to clear up. First, I needed to provide a translation of my own understanding of legal first names and preferred first names to our codebase. The first step in this translation was to deprecate our not-very-descriptively named #firstname method and push engineers to start using two new, descriptive methods called #legalfirstname and #commonfirstname (#commonfirstname is essentially a defaulting method that falls back to #legalfirstname if #preferredfirst_name is not present for that user). To do this, I used a tool built by our own Betterment engineer, Nathan, called Uncruft, which not only gave engineers a warning whenever they tried to use the old #first_name method but also created a list of all the places in our code where we were currently using that old method. This was essentially a map for us engineers to be able to reference and go update those old usages in our codebase whenever we wanted. This new map leads us to our second task: addressing those deprecated usages. At first glance the places where we used #firstname in-app seemed minimal—emails, in-app greetings, tax documents. But once we looked under the surface, #firstname was sprinkled nearly everywhere in our codebase. I identified the most visible spots where we address a user and changed them, but for less visible changes I took this new map and delegated cross-squad ownership of each usage. Then, a group of engineers from each squad began tackling each deprecation one by one. In order to help these engineers, we provided guidelines around where it was necessary to use a legal first name, but in general we pushed to use a customer’s preferred first name wherever possible. From a high level view I essentially split this large engineering lift into two different streams of work. There was the feature work stream which involved: Storing the user’s new name information. Building out the user interface. Updating the most visible spots in our application. Modifying our integration with SimonData in order to bulk update our outgoing emails, and Changing how we share a user’s name with our customer service (CX) team through a Zendesk integration, as well as in our internal CX application. Then there was the foundational work stream, which involved mapping out and addressing every single depreciation. Thanks to Uncruft, once I generated that initial map of deprecations the large foundational work stream could then be further split into smaller brooks of work that could be tackled by different squads at different times. Enabling preferred first names moves us towards a more inclusive product. Once this feature went live, it was extremely rewarding to see our targeted universalism approach reveal its benefits. Our trans customers got the solution they needed, which makes this work crucial for that fact alone—but because of that, our cis customers also received a feature that delighted them. Ultimately, we now know that if people are given a tool to personalize their experience within our product, folks of many different backgrounds will use it. -
Guidelines for Testing Rails Applications
Guidelines for Testing Rails Applications true Discusses the different responsibilities of model, request, and system specs, and other high level guidelines for writing specs using RSpec & Capybara. Testing our Rails applications allows us to build features more quickly and confidently by proving that code does what we think it should, catching regression bugs, and serving as documentation for our code. We write our tests, called “specs” (short for specification) with RSpec and Capybara. Though there are many types of specs, in our workflow we focus on only three: model specs, request specs, and system specs. This blog post discusses the different responsibilities of these types of specs, and other related high level guidelines for specs. Model Specs Model specs test business logic. This includes validations, instance and class method inputs and outputs, Active Record callbacks, and other model behaviors. They are very specific, testing a small portion of the system (the model under test), and cover a wide range of corner cases in that area. They should generally give you confidence that a particular model will do exactly what you intended it to do across a range of possible circumstances. Make sure that the bulk of the logic you’re testing in a model spec is in the method you’re exercising (unless the underlying methods are private). This leads to less test setup and fewer tests per model to establish confidence that the code is behaving as expected. Model specs have a live database connection, but we like to think of our model specs as unit tests. We lean towards testing with a bit of mocking and minimal touches to the database. We need to be economical about what we insert into the database (and how often) to avoid slowing down the test suite too much over time. Don’t persist a model unless you have to. For a basic example, you generally won’t need to save a record to the database to test a validation. Also, model factories shouldn’t by default save associated models that aren’t required for that model’s persistence. At the same time, requiring a lot of mocks is generally a sign that the method under test either is doing too many different things, or the model is too highly coupled to other models in the codebase. Heavy mocking can make tests harder to read, harder to maintain, and provide less assurance that code is working as expected. We try to avoid testing declarations directly in model specs - we’ll talk more about that in a future blog post on testing model behavior, not testing declarations. Below is a model spec skeleton with some common test cases: System Specs System specs are like integration tests. They test the beginning to end workflow of a particular feature, verifying that the different components of an application interact with each other as intended. There is no need to test corner cases or very specific business logic in system specs (those assertions belong in model specs). We find that there is a lot of value in structuring a system spec as an intuitively sensible user story - with realistic user motivations and behavior, sometimes including the user making mistakes, correcting them, and ultimately being successful. There is a focus on asserting that the end user sees what we expect them to see. System specs are more performance intensive than the other spec types, so in most cases we lean towards fewer system specs that do more things, going against the convention that tests should be very granular with one assertion per test. One system spec that asserts the happy path will be sufficient for most features. Besides the performance benefits, reading a single system spec from beginning to end ends up being good high-level documentation of how the software is used. In the end, we want to verify the plumbing of user input and business logic output through as few large specs per feature that we can get away with. If there is significant conditional behavior in the view layer and you are looking to make your system spec leaner, you may want to extract that conditional behavior to a presenter resource model and test that separately in a model spec so that you don’t need to worry about testing it in a system spec. We use SitePrism to abstract away bespoke page interactions and CSS selectors. It helps to make specs more readable and easier to fix if they break because of a UI or CSS change. We’ll dive more into system spec best practices in a future blog post. Below is an example system spec. Note that the error path and two common success paths are exercised in the same spec. Request Specs Request specs test the traditional responsibilities of the controller. These include authentication, view rendering, selecting an http response code, redirecting, and setting cookies. It’s also ok to assert that the database was changed in some way in a request spec, but like system specs, there is no need for detailed assertions around object state or business logic. When controllers are thin and models are tested heavily, there should be no need to duplicate business logic test cases from a model spec in a request spec. Request specs are not mandatory if the controller code paths are exercised in a system spec and they are not doing something different from the average controller in your app. For example, a controller that has different authorization restrictions because the actions it is performing are more dangerous might require additional testing. The main exception to these guidelines is when your controller is an API controller serving data to another app. In that case, your request spec becomes like your system spec, and you should assert that the response body is correct for important use cases. API boundary tests are even allowed to be duplicative with underlying model specs if the behavior is explicitly important and apparent to the consuming application. Request specs for APIs are owned by the consuming app’s team to ensure that the invariants that they expect to hold are not broken. Below is an example request spec. We like to extract standard assertions such as ones relating to authentication into shared examples. More on shared examples in the section below. Why don’t we use Controller Specs? Controller specs are notably absent from our guide. We used to use controller specs instead of request specs. This was mainly because they were faster to run than request specs. However, in modern versions of Rails, that has changed. Under the covers, request specs are just a thin wrapper around Rails integration tests. In Rails 5+, integration tests have been made to run very fast. Rails is so confident in the improvements they’ve made to integration tests that they’ve removed controller tests from Rails core in Rails 5.1. Additionally, request specs are much more realistic than controller specs since they actually exercise the full request / response lifecycle – routing, middleware, etc – whereas controller specs circumvent much of that process. Given the changes in Rails and the limitations of controller specs, we’ve changed our stance. We no longer write controller specs. All of the things that we were testing in controller specs can instead be tested by some combination of system specs, model specs, and request specs. Why don’t we use Feature Specs? Feature specs are also absent from our guide. System specs were added to Rails 5.1 core and it is the core team’s preferred way to test client-side interactions. In addition, the RSpec team recommends using system specs instead of feature specs. In system specs, each test is wrapped in a database transaction because it’s run within a Rails process, which means we don’t need to use the DatabaseCleaner gem anymore. This makes the tests run faster, and removes the need for having any special tables that don’t get cleaned out. Optimal Testing Because we use these three different categories of specs, it’s important to keep in mind what each type of spec is for to avoid over-testing. Don’t write the same test three times - for example, it is unnecessary to have a model spec, request spec, and a system spec that are all running assertions on the business logic responsibilities of the model. Over-testing takes more development time, can add additional work when refactoring or adding new features, slows down the overall test suite, and sets the wrong example for others when referencing existing tests. Think critically about what each type of spec is intended to be doing while writing specs. If you’re significantly exercising behavior not in the layer you’re writing a test for, you might be putting the test in the wrong place. Testing requires striking a fine balance - we don’t want to under-test either. Too little testing doesn’t give any confidence in system behavior and does not protect against regressions. Every situation is different and if you are unsure what the appropriate test coverage is for a particular feature, start a discussion with your team! Other Testing Recommendations Consider shared examples for last-mile regression coverage and repeated patterns. Examples include request authorization and common validation/error handling: Each spec’s description begins with an action verb, not a helping verb like “should,” “will” or something similar. -
WebValve – The Magic You Need for HTTP Integration
WebValve – The Magic You Need for HTTP Integration true Struggling with HTTP integrations locally? Use WebValve to define HTTP service fakes and toggle between real and fake services in non-production environments. When I started at Betterment (the company) five years ago, Betterment (the platform) was a monolithic Java application. As good companies tend to do, it began growing—not just in terms of users, but in terms of capabilities. And our platform needed to grow along with it. At the time, our application had no established patterns or tooling for the kinds of third-party integrations that customers were increasingly expecting from fintech products (e.g., like how Venmo connects to your bank to directly deposit and withdraw money). We were also feeling the classic pain points of a growing team contributing to a single application. To keep the momentum going, we needed to transition towards a service-oriented architecture that would allow the engineers of different business units to run in parallel against their specific business goals, creating even more demand for repeatable solutions to service integration. This brought up another problem (and the starting point for this blog post): in order to ensure tight feedback loops, we strongly believed that our devs should be able to do their work on a modern, modestly-specced laptop without internet connectivity. That meant no guaranteed connection to a cloud service mesh. And unfortunately, it’s not possible to run a local service mesh on a laptop without it melting. In short, our devs needed to be able to run individual services in isolation; by default they were set to communicate with one another, meaning an engineer would have to run all of the services locally in order to work on any one service. To solve this problem, we developed WebValve—a tool that allows us to define and register fake implementations of HTTP services and toggle between real and fake services in non-production environments. I’m going to walk you through how we got there. Start with the test Here’s a look at what a test would look like to see if a deposit from a bank was initiated: The five lines of code on the bottom is the meat of the test. Easy right? Not quite. Notice the two WebMock stub_requests calls at the top. The second one has the syntax you’d expect to execute the test itself. But take a look at the first one—notice the 100+ lines of (omitted) code. Without getting into the gory details, this essentially requires us, for every test we write, to stub a request for user data—with differences across minor things like ID values, we can’t share these stubs between tests. In short it’s a sloppy feature spec. So how do we narrow this feature spec down to something like this? Through the magic of libraries. First things first—defining our view of the problem space. The success of projects like these don’t come down to the code itself—it comes down to the ‘design’ of the solution based on its specific needs. In this case, it meant paring the conditions down to making it work using just rails. Those come to life in four major principles, which guide how we engage with the problem space for our shift to a service-oriented architecture: We use HTTP & REST to communicate with collaborator services We define the boundaries and limit the testing of integrations with contract tests We don't share code across service boundaries Engineers must remain nimble and building features must remain enjoyable. A little bit of color on each, starting with HTTP and REST. For APIs that we build for ourselves (e.g. internal services) we have full control over how we build them, so using HTTP and REST is no issue. We have a strong preference to use a single integration pattern for both internal and external service integrations; this reduces cognitive overhead for devs. When we’re communicating with external services, we have less control, but HTTP is the protocol of the web and REST has been around since 2000—the dawn of modern web applications— so the majority of integrations we build will use them. REST is semantic, evolvable, limber, and very familiar to us as Rails developers —a natural ‘other side of the coin’ for HTTP to make up the lingua franca of the web. Secondly, we need to define the boundaries in terms of ‘contracts.’ Contracts are a point of exchange between the consumption side (the app) and producer side (the collaborator service). The contract defines the expectations of input and output for the exchange. They’re an alternative to the kind of high-level systems integration tests that would include a critical mass of components that would render the test slow and non-repeatable. Thirdly, we don't want to have shared code across service boundaries. Shared code between services creates shared ownership, and shared ownership leads to undesirable coupling. We want the API provider to own and version their APIs, and we want the API consumer to own their integration with each version of a collaborator service's API. If we were willing to accept tight coupling between our services, specifically in their API contracts, we'd be well-served by a tool like Pact. With Pact, you create a contract file based on the consumer's expectations of an API and you share it with the provider. The contract files themselves are about the syntax and structure of requests and responses rather than the interpretation. There's a human conversation and negotiation to be had about these contracts, and you can fool yourself into thinking you don't need to have that conversation if you've got a file that guarantees that you and your collaborator service are speaking the same language; you may be speaking the same words, but you might not infer the same meaning. Pact's docs encourage these human conversations, but as a tool it doesn't require them. By avoiding shared code between services, we force ourselves to have a conversation about every API we build with the consumers of those APIs. Finally, these tests’ effectiveness is directly related to how we can apply them to reality, so we need to be simple—we want to be able to test and build features without connections to other features. We want them to be able to work without an internet connection, and if we do want to integrate with a real service in local development, we should be able to do that—meaning we should be able to test and integrate locally at will, without having to rely on cumbersome, extra-connected services (think Docker, Kubernetes; anything that pairs cloud features with the local environment.) Straightforward tests are easy to write, read, and maintain. That keeps us moving fast and not breaking things. So, to recap, there are four principles that will drive our solution: Service interactions happen over HTTP & REST Contract tests ensure that service interactions behave as expected Providing an API contract requires no shared code Building features remains fast and fun Okay, okay, but how? So we’ve established that we don’t want to hit external services in tests, which we can do through WebMock or similar libraries. The challenge becomes: how do we replicate the integration environment without the integration environment? Through fakes. We’ll fake the integration by using Sinatra to build a rack app that quacks like the real thing. In the rack app, we define the routes we care about for the things we normally would have stubbed in the tests. From here, we do the things we couldn’t do before—pull real parameters out of the requests and feed them back into the fake response to make it more realistic. Additionally, we can use things like ActiveRecord to make these fake responses even more realistic based on the data stored in our actual database. So what does the fake look like? It's a class with a route defined for each URL we care about faking. We can use WebMock to wire the fake to requests that match a certain pattern. If we receive a request for a URL we didn't define, it will 404. Simple. However, this doesn’t allow us to solve all the things we were working for. What’s missing? First, an idiomatic setup stance. We want to be able to define fakes in a single place, so when we add a new one, we can easily find it and change it. In the same vein, we want to be able to answer similar questions about registering fakes in one spot. Finally, convention over configuration—if we can load, register, and wire-up a fake based on its name, for example, that would be handy. Secondly, it’s missing environment-specific behavior, which in this case, translates into the ability to toggle the library on and off and separately toggle the connection to specific collaborator services on and off. We need to be able to have the library active when running tests or doing local development, but do not want to have it running in a production environment—if it remains active in a real environment, it might affect real customer accounts, which we cannot afford. But, there will also be times when we're running in a local development environment and we want to communicate with a real collaborator service to do some true integration testing. Thirdly, we want to be able to autoload our fakes. If they’re in our codebase, we should be able to iterate on the fakes without having to restart our server; the behavior isn’t always right the first time, and restarting is tedious and it's not the Rails Way. Finally, to bolt this on to an IRL application, we need the ability to define fakes incrementally and migrate them into existing integrations that we have, one by one. Okay brass tacks. No existing library allows us to integrate this way and map HTTP requests to in-process fakes for integration and development. Hence, WebValve. TL;DR—WebValve is an open-source gem that uses Sinatra and WebMock to provide fake HTTP service behavior. The special sauce is that it works for more than just your tests. It allows you to run your fakes in your dev environment as well, providing functionality akin to real environments with the toggles we need to access the real thing when we need to. Let’s run it through the gauntlet to show how it works and how it solves for all our requirements. First we add the gem to our Gemfile and run bundle install. With the gem installed, we can use the generator rails g webvalve:install to bootstrap a default config file where we can register our fakes. Then we can generate a fake for our "trading" collaborator service using rails generate webvalve:fake_service Trading. This gives us a class in a conventional location that inherits from WebValve::FakeService. This looks very similar to a Sinatra app, and that's because it is one—with some additional magic baked in. To make this fake work, all we have to do is define the conventionally-named environment variable, TRADINGAPIURL. That tells WebValve what requests to intercept and route to this fake. By inheriting from this WebValve class, we gain the ability to toggle the fake behavior on or off based on another conventionally-named environment variable, in this case TRADING_ENABLED. So let’s take our feature spec. First, we configure out test suite to use WebValve with the RSpec config helper require 'webvalve/rspec'. Then, we look at the user API call—we define a new route for user, in FakeTrading. Then we flesh out that fake route by scooping out our json from the test file and probably making it a little more dynamic when we drop it into the fake. Then we do the same for the deposit API call. And now our test, which doesn't care about the specifics of either of those API calls, is much clearer. It looks just like our ideal spec from before: We leverage all the power of WebMock and Sinatra through our conventions and the teeniest configuration to provide all the same functionality as before, but we can write cleaner tests, we get the ability to use these fakes in local development instead of the real services—and we can enable a real service integration without missing a beat. We’ve achieved our goal—we’ve allowed for all the functionality of integration without the threats of actual integration. Check it out on GitHub. This article is part of Engineering at Betterment. -
Building for Better: Gender Inclusion at Betterment
Building for Better: Gender Inclusion at Betterment true Betterment sits at the intersection of two industries with large, historical gender gaps. We’re working to change that—for ourselves and our industries. Since our founding, we’ve maintained a commitment to consistently build a better company and product for our customers and our customers-to-be. Part of that commitment includes reflecting the diversity of those customers. Betterment sits at the intersection of finance and technology—two industries with large, historical diversity gaps, including women and underrepresented populations. We’re far from perfect, but this is what we’re doing to embrace the International Women’s Day charge and work toward better gender balance at Betterment and in our world. Building Diversity And Inclusion At Betterment Change starts at the heart of the matter. For Betterment, this means working to build a company of passionate individuals who reflect our customers and bring new and different perspectives to our work. Our internal Diversity and Inclusion Committee holds regular meetings to discuss current events and topics, highlights recognition months (like Black History and Women’s History Months), and celebrates the many backgrounds and experiences of our employees. We’ve also developed a partnership with Peoplism. According to Caitlin Tudor-Savin, HR Business Partner, “This is more than a check-the-box activity, more than a one-off meeting with an attendance sheet. By partnering with Peoplism and building a long-term, action-oriented plan, we’re working to create real change in a sustainable fashion.” One next step we’re excited about is an examination of our mentorship program to make sure that everyone at Betterment has access to mentors. The big idea: By building empathy and connection among ourselves, we can create an inclusive environment that cultivates innovative ideas and a better product for our customers. Engaging The Tech Community At Large At Betterment, we’re working to creating change in the tech industry and bringing women into our space. By hosting meetups for Women Who Code, a non-profit organization that empowers women through technology, we’re working to engage this community directly. Rather than getting together to hear presentations, meetups are designed to have a group-led dynamic. Members break out and solve problems together, sharing and honing skills, while building community and support. This also fosters conversation, natural networking, and the chance for women to get their foot in the door. Jesse Harrelson, a Betterment Software Engineer, not only leads our hosting events, they found a path to Betterment through Women Who Code. “Consistency is key,” said Jesse. “Our Women Who Code meetups become a way to track your progression. It’s exciting to see how I’ve developed since I first started attending meetups, and how some of our long-time attendees have grown as engineers and as professionals.” Building A Community Of Our Own In 2018, our Women of Betterment group had an idea. They’d attended a number of networking and connection events, and the events never felt quite right. Too often, the events involved forced networking and stodgy PowerPoint presentations, with takeaways amounting to little more than a free glass of wine. Enter the SHARE (Support, Hire, Aspire, Relate, Empower) Series. Co-founder Emily Knutsen wanted “to build a network of diverse individuals and foster deeper connections among women in our community.” Through the SHARE Series, we hope to empower future leaders in our industry to reach their goals and develop important professional connections. While the series focuses on programming for women and those who identify as women, it is inclusive to everyone in our community who wish to be allies and support our mission. We developed the SHARE Series to create an authentic and conversational environment, one where attendees help guide the conversations and future event themes. Meetings thus far have included a panel discussion on breaking into tech from the corporate world and a small-group financial discussion led by financial experts from Betterment and beyond. “We’re excited that organizations are already reaching out to collaborate,” Emily said. “We’ve gotten such an enthusiastic response about designing future events around issues that women (and everyone!) face, such as salary negotiations.” Getting Involved Want to join us as we work to build a more inclusive and dynamic community? Our next SHARE Series event features CBS News Business Analyst and CFP® professional Jill Schlesinger, as we celebrate her new book, The Dumb Things Smart People Do with Their Money: Thirteen Ways to Right Your Financial Wrongs. You can also register to attend our Women Who Code meetups, and join engineers from all over New York as we grow, solve, and connect with one another. -
CI/CD: Standardizing the Interface
CI/CD: Standardizing the Interface true Meet our CI/CD platform, Coach and learn how we increased consistent adoption of Continuous Integration (CI) across our engineering organization. And why that's important. This is the second part of a series of posts about our new CI/CD platform, Coach. Part I explores several design choices we made in building out our notifications pipeline and describes how those choices are emblematic of our overarching engineering principles here at Betterment. Today I’d like to talk about how we increased consistent adoption of Continuous Integration (CI) across our engineering organization, and why. Our Principles in Action: Standardizing the Interface At Betterment, we want to empower our engineers to do their best work. CI plays an important role in all of our teams’ workflows. Over time, a handful of these teams formed deviating opinions on what kind of acceptance criteria they had for CI. While we love the concern that our engineers show toward solving these problems, these deviations became problematic for applications of the same runtime that should abide by the same set of rules; for example, all Ruby apps should run RSpec and Rubocop, not just some of them. In building a platform as a service (PaaS), we realized that in order to mitigate the problem of nurturing pets vs herding cattle we would need to identify a firm set of acceptance criteria for different runtimes. In the first post of this series we mention one of our principles, Standardize the Pipeline. In this post, we’ll explore that principle and dive into how we committed 5000 line configuration files to our repositories with confidence by standardizing CI for different runtimes, automating configuration generation in code, and testing the process that generates that configuration. What’s so good about making everything the same? Our goals in standardizing the CI interface were to: Make it easier to distribute new CI features more quickly across the organization. Onboard new applications more quickly. Ensure the same set of acceptance criteria is in place for all codebases in the org. For example, by assuming that any Java library will run the PMDlinter and unit tests in a certain way we can bootstrap a new repository with very little effort. Allow folks outside of the SRE team to contribute to CI. In general, our CI platform categorizes projects into applications and libraries and divides those up further by language runtime. Combined together we call this a project_type. When we make improvements to one project type’s base configuration, we can flip a switch and turn it on for everyone in the org at once. This lets us distribute changes across the org quickly. How we managed to actually execute on this will become clearer in the next section, but for the sake of hand-wavy-expediency, we have a way to run a few commands and distribute CI changes to every project in a matter of minutes. How did we do it? Because we use CircleCI for our CI pipelines, we knew we would have to define our workflows using their DSL inside a .circleci/config.yml file at the root of a project’s repository. With this blank slate in front of us we were able to iterate quickly by manually adding different jobs and steps to that file. We would receive immediate feedback in the CircleCI interface when those jobs ran, and this feedback loop helped us iterate even faster. Soon we were solving for our acceptance criteria requirements left and right — that Java app needs the PMD linter! This Ruby app needs to run integration tests! And then we reached the point where manual changes were hindering our productivity. The .circleci/config.yml file was getting longer than a thousand lines fast, partly because we didn’t want to use any YAML shortcuts to hide away what was being run, and partly because there were no higher-level mechanisms available at the time for re-use when writing YAML (e.g. CircleCI’s orbs). Defining the system Our solution to this problem was to build a system, a Coach CLI for our Coach app, designed according to CLI 12-factor conventions. This system’s primary goal is to create .circleci/config.yml files for repositories to encapsulate the necessary configuration for a project’s CI pipeline. The CLI reads a small project-level configuration definition file (coach.yml) located in a project’s directory and extrapolates information to create the much larger repo-level CircleCI specific configuration file (.circleci/config.yml), which we were previously editing ourselves. To clarify the hierarchy of how we thought about CI, here are the high level terms and components of our Coach CLI system: There are projects. Each project needs a configuration definition file (coach.yml) that declares its project_type. We support wordpress_app, java_library, java_app, ruby_gem, ruby_app, and javascript_libraryfor now. There are repos, each repo has one or more projects of any type. There needs to be a way to set up a new project. There needs to be a way to idempotently generate the CircleCI configuration (.circleci/config.yml) for all the projects in a repo at once. Each project needs to be built, tested, and linted. We realized that the dependency graph of repository → projects → project jobs was complicated enough that we would need to recreate the entire .circleci/config.yml file whenever we needed to update it, instead of just modifying the YAML file in place. This was one reason for automating the process, but the downsides of human-managed software were another. Manual updates to this file allow the configuration for infrequently-modified projects to drift. And leaving it up to engineers to own their own configuration lets folks modify the file in an unsupported way which could break their CI process. And then we’re back to square one. We decided to create that large file by ostensibly concatenating smaller components together. Each of those smaller components would be the output of specific functions, and each of those functions would be written in code and be tested. The end result was a lot of small files that look a little like this: https://gist.github.com/agirlnamedsophia/4b4a11acbe5a78022ecba62cb99aa85a Every time we make a change to the Coach CLI codebase we are confident that the thousands of lines of YAML that are idempotently generated as a result of the coach update ci command will work as expected because they’re already tested in isolation, in unit tests. We also have a few heftier integration tests to confirm our expectations. And no one needs to manually edit the .circleci/config.yml file again. Defining the Interface In order to generate the .circleci/config.yml that details which jobs to run and what code to execute we first needed to determine what our acceptance criteria was. For each project type we knew we would need to support: Static code analysis Unit tests Integration tests Build steps Test reports We define the specific jobs a project will run during CI by looking at the projecttype value inside a project’s coach.yml. If the value for projecttype is ruby_app then the .circleci/config.yml generator will follow certain conventions for Ruby programs, like including a job to run tests with RSpec or including a job to run static analysis commands like Rubocopand Brakeman. For Java apps and libraries we run integration and unit tests by default as well as PMD as part of our static code analysis. Here’s an example configuration section for a single job, the linter job for our Coach repository: https://gist.github.com/agirlnamedsophia/4b4a11acbe5a78022ecba62cb99aa85a And here’s an example of the Ruby code that helps generate that result: https://gist.github.com/agirlnamedsophia/a96f3a79239988298207b7ec72e2ed04 For each job that is defined in the .circleci/config.yml file, according to the project type’s list of acceptance criteria, we include additional steps to handle notifications and test reporting. By knowing that the Coach app is a ruby_appwe know how many jobs will need to be run and when. By writing that YAML inside of Ruby classes we can grow and expand our pipeline as needed, trusting that our tests confirm the YAML looks how we expect it to look. If our acceptance criteria change, because everything is written in code, adding a new job involves a simple code change and a few tests, and that’s it. We’ll go into contributing to our platform in more detail below. Onboarding a new project One of the main reasons for standardizing the interface and automating the configuration generation was to onboard new applications more quickly. To set up a new app all you need to do is be in the directory for your project and then run coach create project --type $project_type. -> % coach create project --type ruby_app 'coach.yml' configuration file added -- update it based on your project's needs When you run that, the CLI creates the small coach.yml configuration definition file discussed earlier. Here’s what an example Ruby app’s coach.yml looks like: https://gist.github.com/agirlnamedsophia/2f966ab69ba1c7895ce312aec511aa6b The CLI will refer back to a project’s coach.yml to decide what kind of CircleCI DSL needs to be written to the .circleci/config.yml file to wire up the right jobs to run at the right time. Though our contract with projects of different types is standardized, we permit some level of customization. The coach.yml file allows our users to define certain characteristics of their CI flow that vary and require more domain knowledge about a specific project: like the level of test parallelism their application test suite requires, or the list of databases required for tests to run, or an attribute composed of a matrix of Ruby versions and Gemfiles to run the whole test suite against. Using this declarative configuration is more extensible and more user friendly and doesn’t break the contract we’ve put in place for projects that use our CI platform. Contributing to CI Before, if you wanted to add an additional linter or CI tool to our pipeline, it would require adding a few lines of untested bash code to an existing Jenkins job, or adding a new job to a precarious graph of jobs, and crossing your fingers that it would “just work.” The addition couldn’t be tested and it was often only available to one project or one repository at a time. It couldn’t scale out to the rest of the org with ease. Now, updating CI requires opening a PR to make the change. We encourage all engineers who want to add to their own CI pipeline to make changes on a branch from our Coach repository, where all the configuration generation magic happens, verify its effectiveness for their use-case, and open a pull request. If it’s a reasonable addition to CI, our thought is that everyone should benefit. By having these changes in version control, each addition to the CI pipeline goes through code review and requires tests be written. We therefore have the added benefit of knowing that updates to CI have been tested and are deemed valid and working before they’re distributed, and we can prevent folks from removing a feature without considering the impact it may have. When a PR is merged, our team takes care of redistributing the new version of the library so engineers can update their configuration. CI is now a mechanism for instantly sharing the benefits of discovery made in isolated exploration, with everyone. Putting it all together Our configuration generator is doing a lot more than just taping together jobs in a workflow — we evaluate dependency graphs and only run certain jobs that have upstream changes or are triggered themselves. We built our Coach CLI into the Docker images we use in CircleCI and so those Coach CLI commands are available to us from inside the .circleci/config.yml file. The CLI handles notifications, artifact generation, and deployment triggers. As we stated in our requirements for Coach in the first post, we believe there should be one way to test code, and one way to deploy it. To get there we had to make all of our Java apps respond to the same set of commands, and all of our Ruby apps to do the same. Our CLI and the accompanying conventions make that possible. When before it could take weeks of both product engineering and SRE time to set up CI for an application or service within a complex ecosystem of bash scripts and Jenkins jobs and application configuration, now it takes minutes. When before it could take days or weeks to add a new step to a CI pipeline, now it takes hours of simple code review. We think engineers should focus on what they care about the most, shipping great features quickly and reliably. And we think we made it a little easier for them (and us) to do just that. What’s Next? Now that we’ve wrangled our CI process and encoded the best practices into a tool, we’re ready to tackle our Continuous Deployment pipeline. We’re excited to see how the model of projects and project types that we built for CI will evolve to help us templatize our Kubernetes deployments. Stay tuned. -
CI/CD: Shortening the Feedback Loop
CI/CD: Shortening the Feedback Loop true As we improve and scale our CD platform, shortening the feedback loop with notifications was a small, effective, and important piece. Continuous Delivery (CD) at scale is hard to get right. At Betterment, we define CD as the process of making every small change to our system shippable as soon as it’s been built and tested. It’s part of the CI/CD (continuous integration and continuous delivery) process. We’ve been doing CD at Betterment for a long time, but it had grown to be quite a cumbersome process over the last few years because our infrastructure and tools hadn’t evolved to meet the needs of our growing engineering team. We reinvented our Site Reliability Engineering (SRE) team last fall with our sights set on building software to help developers move faster, be happier, and feel empowered. The focus of our work has been on delivering a platform as a service to make sense of the complex process of CD. Coach is the beginning of that platform. Think of something like Heroku, but for engineers here at Betterment. We wanted to build a thoughtfully composed platform based on the tried and true principles of 12-factor apps. In order to build this, we needed to do two overhauls: 1) Build a new CI pipeline and 2) Build a new CD pipeline. Continuous Integration — Our Principles For years, we used Jenkins, an open-source tool for automation, and a mess of scripts to provide CI/CD to our engineers. Jenkins is a powerful tool and well-used in the industry, but we decided to cut it because the way that we were using it was wrong, we weren’t pleased with its feature set, and there was too much technical debt to overcome. Tests were flakey and we didn’t know if it was our Jenkins setup, the tests themselves, or both. Dozens of engineers contribute to our biggest repository every day and as the code base and engineering team have grown, the complexity of our CI story has increased and our existing pipeline couldn’t keep up. There were task forces cobbled together to drive up reliability of the test suite, to stamp out flakes, to rewrite, and to refactor. This put a band-aid on the problem for a short while. It wasn’t enough. We decided to start fresh with CircleCI, an alternative to Jenkins that comes with a lot more opinions, far fewer rough edges, and a lot more stability built-in. We built a tool (Coach) to make the way that we build and test code conventional across all of our of apps, regardless of language, application owner, or business unit. As an added bonus, since our CI process itself was defined in code, if we ever need to switch platforms again, it would be much easier. Coach was designed and built with these principles: Standardize the pipeline — there should be one way to test code, and one way to deploy it Test code often — code should be tested as often as it’s committed Build artifacts often — code should be built as often as it’s tested so that it can be deployed at any time Be environment agnostic — artifacts should be built in an environment-agnostic way with maximum portability Give consistent feedback — the CI output should be consistent no matter the language runtime Shorten the feedback loop — engineers should receive actionable feedback as soon as possible Standardizing CI was critical to our growth as an organization for a number of reasons. It ensures that new features can be shipped more quickly, it allows new services to adopt our standardized CI strategy with ease, and it lets us recover faster in the face of disaster — a hurricane causing a power outage at one of our data centers. Our goal was to replace the old way of building and testing our applications (what we called the “Old World”) and start fresh with these principles in mind (what we deemed the “New World”). Using our new platform to build and test code would allow our engineers to receive automated feedback sooner so they could iterate faster. One of our primary aims in building this platform was to increase developer velocity, so we needed to eliminate any friction from commit to deploy. Friction here refers to ambiguity of CI results and the uncertainty of knowing where your code is in the CI/CD process. Shortening the feedback loop was one of the first steps we took in building out our new platform, and we’re excited to share the story of how we designed that solution. Our Principles in Action: Shortening the Feedback Loop The feedback loop in the Old World run by Jenkins was one of the biggest hurdles to overcome. Engineers never really knew where their code was in the pipeline. We use Slack, like a lot of other companies, so that part of the messaging story wouldn’t change, but there were bugs we needed to fix and design flaws we needed to update. How much feedback should we give? When do we want to give feedback? How detailed should our messages be? These were some of the questions we asked ourselves during this part of the design phase. What our Engineers Needed For pull requests, developers would commit code and push it up to GitHub and then eventually they would receive a Slack message that said “BAD” for every test suite that failed, or “GOOD” if everything passed, or nothing at all in the case of a Jenkins agent getting stuck and hanging forever. The notifications were slightly more nuanced than good/bad, but you get the idea. We valued sending Slack messages to our engineers, as that’s how the company communicates most effectively, but we didn’t like the rate of communication or the content of those messages. We knew both of those would need to change. As for merges into master, the way we sent Slack messages to communicate to engineering teams (as opposed to just individuals) was limited because of how our CI/CD process was constructed. The entire CI and CD process happened as a series of interwoven Jenkins freestyle jobs. We never got the logic quite right around determining whose code was being deployed — the deploy logic was contingent on a pretty rough shell script called “inside a Jenkins job.” The best we had was a Slack message that was sent roughly five minutes before a deploy began, tagging a good estimation of contributors but often missing someone if their Github email address was different from their Slack email address. More critically, the one-off script solution wasn’t stored in source control, therefore it wasn’t tested. We had no idea when it failed or missed tagging some contributors. We liked notifying engineers when a deploy began, but we needed to be more accurate about who we were notifying. What our SRE Team Needed Our design and UX was informed by what our engineers using our platform needed, but Coach was built based on our needs. What did we need? Well-tested code stored in version control that could easily be changed and developed. All of the code that handles changesets and messaging logic in the New World is written in one central location, and it’s tested in isolation. Our CI/CD process invokes this code when it needs to, and it works great. We can be confident that the right people are notified at the right time because we wrote code that does that and we tested it. It’s no longer just a script that sometimes works and sometimes doesn’t. Because it’s in source control and it runs through its own CI process, we can also easily roll out changes to notifications without breaking things. We wanted to build our platform around what our engineers would need to know, when they need to know it, and how often. And so one of the first components we built out was this new communication pipeline. Next we’ll explore in more detail some of our design choices regarding the content of our messages and the rate at which we send them. Make sure our engineers don’t mute their slack notifications In leaving the Old World of inconsistent and contextually sparse communication we looked at our blank canvas and initially thought “every time the tests pass, send a notification! That will reduce friction!” So we tried that. If we merged code into a tracked branch — a branch that multiple engineers contribute to, like master — for one of our biggest repos, which contained 20 apps and 20 test suites, we would be notified at every transition: every rubocop failure, every flakey occurrence of a feature test. We quickly realized it was too much. We sat back and thought really hard about what we would want, considering we were dogfooding our own pipeline. How often did we want to be notified by the notification system when our tests that tested the code that built the notification system, succeeded? Sheesh, that’s a mouthful. Our Slack bot could barely keep up! We decided it was necessary to be told only once when everything ran successfully. However, for failures, we didn’t want to sit around for five minutes crossing our fingers hoping that everything was successful only to be told that we could have known three minutes earlier that we’d forgotten a newline at the end of one of our files. Additionally, in CircleCI where we can easily parallelize our test suites, we realized we wouldn’t want to notify someone for every chunk of the test suite that failed, just the first time a failure happened for the suite. We came up with a few rules to design this part of the system: Let the author know as soon as possible when something is red but don’t overdo it for redundant failures within the same job (e.g. if unit tests ran on 20 containers and 18 of them saw failures, only notify once) Only notify once about all the green things Give as much context as possible without being overwhelming: be concise but clear Next we’ll explore the changes we made in content. What to say when things fail This is what engineers would see in the Old World when tests failed for an open pull request: Among other deficiencies, there’s only one link and it takes us to a Jenkins job. There’s no context to orient us quickly to what the notification is for. After considering what we were currently sending our engineers, we realized that 1) context and 2) status were the most important things to communicate, which were the aspects of our old messaging that were suffering the most. Here’s what we came up with: Thanks Coach bot! Right away we know what’s happened. A PR build failed. It failed for a specific GitHub branch (“what-to-say-when-things-fail-branch”), in a specific repo (“Betterment/coach”), for a specific PR (#430),for a specific job in the test suite (“coach_cli — lint (Gemfile)”). We can click on any of these links and know exactly where they go based on the logo of the service. Messages about failures are now actionable and full of context,prompting the engineer to participate in CI, to go directly to their failures or to their PR. And this bounty of information helps a lot if the engineer has multiple PRs open and needs to quickly switch context. The messaging that happened for failures when you merged a pull request into master was a little different in that it included mentions for the relevant contributors (maybe all of them, if we were lucky!): The New World is cleaner, easier to grok, and more immediately helpful: The link title to GitHub is the commit diff itself, and it takes you to the compare URL for that changeset. The CircleCI info includes the title of the job that failed (“coach_cli — lint (Gemfile)”), the build number (“#11389”) to reference for context in case there are multiple occurrences of the failure in multiple workflows, a link to the top-level “Workflow”, and @s for each contributor. What to say when things succeed We didn’t change the frequency of messaging for success — we got that right the first time around. You got one notification message when everything succeeded and you still do. But in the Old World there wasn’t enough context to make the message immediately useful. Another disappointment we had with the old messaging was that it didn’t make us feel very good when our tests passed. It was just a moment in time that came and went: In the New World we wanted to proclaim loudly (or as loudly as you can proclaim in a Slack message) that the pull request was successful in CI: Tada! We did it! We wanted to maintain the same format as the new failure messages for consistency and ease of reading. The links to the various services we use are in the same order as our new failure messages, but the link to CircleCI only goes to the workflow that shows the graph of all the tests and jobs that ran. It’s delightful and easy to parse and has just the right amount of information. What’s next? We have big dreams for the future of this platform with more and more engineers using our product. Shortening the feedback loop with notifications is only one small, but rather important, part of our CD platform. In the next post of this series on CD, we’ll explore how we committed 5000 line configuration files to our repositories with confidence by standardizing CI for different runtimes, automating config generation in code, and testing that code generation. We believe in a world where shipping code, even in really large codebases with lots of contributors, should be done dozens of times a day. Where engineers can experience feedback about their code with delight and simplicity. We’re building that at Betterment. -
Shh… It’s a Secret: Managing Secrets at Betterment
Shh… It’s a Secret: Managing Secrets at Betterment true Opinionated secrets management that helps us sleep at night. Secrets management is one of those things that is talked about quite frequently, but there seems to be little consensus on how to actually go about it. In order to understand our journey, we first have to establish what secrets management means (and doesn’t mean) to us. What is Secrets Management? Secrets management is the process of ensuring passwords, API keys, certificates, etc. are kept secure at every stage of the software development lifecycle. Secrets management does NOT mean attempting to write our own crypto libraries or cipher algorithms. Rolling your own crypto isn’t a great idea. Suffice it to say, crypto will not be the focus of this post. There’s such a wide spectrum of secrets management implementations out there ranging from powerful solutions that require a significant amount of operational overhead, like Hashicorp Vault, to solutions that require little to no operational overhead, like a .env file. No matter where they fall on that spectrum, each of these solutions has tradeoffs in its approach. Understanding these tradeoffs is what helped our Engineering team at Betterment decide on a solution that made the most sense for our applications. In this post, we’ll be sharing that journey. How it used to work We started out using Ansible Vault. One thing we liked about Ansible Vault is that it allows you to encrypt a whole file or just a string. We valued the ability to encrypt just the secret values themselves and leave the variable name in plain-text. We believe this is important so that we can quickly tell which secrets an app is dependent on just by opening the file. So the string option was appealing to us, but that workflow didn’t have the best editing experience as it required multiple steps in order to encrypt a value, insert it into the correct file, and then export it into the environment like the 12-factor appmethodology tells us we should. At the time, we also couldn’t find a way to federate permissions with Ansible Vault in a way that didn’t hinder our workflow by causing a bottleneck for developers. To assist us in expediting this workflow, we had an alias in our bash_profiles that allowed us to run a shortcut at the command line to encrypt the secret value from our clipboard and then insert that secret value in the appropriate Ansible variables file for the appropriate environment. alias prod-encrypt="pbpaste | ansible-vault encrypt_string --vault-password-file=~/ansible-vault/production.key" This wasn’t the worst setup, but didn’t scale well as we grew. As we created more applications and hired more engineers, this workflow became a bit much for our small SRE team to manage and introduced some key-person risk, also known as the Bus Factor. We needed a workflow with less of a bottleneck, but allowing every developer access to all the secrets across the organization was not an acceptable answer. We needed a solution that not only maintained our security posture throughout the software development lifecycle, but also enforced our opinions about how secrets should be managed across environments. Decisions, decisions… While researching our options, we happened upon a tool called sops. Maintained and open-sourced by Mozilla, sops is a command line utility written in Go that facilitates slick encryption and decryption workflows by using your terminal’s default editor. Sops encrypts and decrypts your secret values using your cloud provider’s Key Management Service (AWS KMS, GCP KMS, Azure Key Vault) and PGP as a backup in the event those services are not available. It leaves the variable name in plain-text while only encrypting the secret value itself and supports YAML, JSON, or binary format. We use the YAML format because of its readability and terseness. See a demo of how it works. We think this tool works well with the way we think about secrets management. Secrets are code. Code defines how your application behaves. Secrets also define how your application behaves. So if you can encrypt them safely, you can ship your secrets with your code and have a single change management workflow. Github pull request reviews do software change management right. YAML does human readable key/value storage right. AWS KMS does anchored encryption right. AWS Regions do resilience right. PGP does irreversible encryption better than anything else readily available and is broadly supported. In sops, we’ve found a tool that combines all of these things enabling a workflow that makes secrets management easier. Who’s allowed to do what? Sops is a great tool by itself, but operations security is hard. Key handling and authorization policy design is tricky to get right and sops doesn’t do it all for us. To help us with that, we took things a step further and wrote a wrapper around sops we call sopsorific. Sopsorific, also written in Go, makes a few assumptions about application environments. Most teams need to deploy to multiple environments: production, staging, feature branches, sales demos, etc. Sopsorific uses the term “ecosystem” to describe this concept, as well as collectively describe a suite of apps that make up a working Betterment system. Some ecosystems are ephemeral and some are durable, but there is only one true production ecosystem holding sensitive PII (Personally Identifiable Information) and that ecosystem must be held to a higher standard of access control than all others. To capture that idea, we introduced a concept we call “security zones” into sopsorific. There are only two security zones per GitHub repository — sensitive, and non-sensitive — even if there are multiple apps in a repository. In the case of mono-repos, if an app in that repository shouldn’t have its secrets visible to all engineers who work in that repository, then the app belongs in a different repository. With sopsorific, secrets for the non-sensitive zone can be made accessible to a broader subset of the app team than sensitive zone secrets helping to eliminate some of bottleneck issues we’ve experienced with our previous workflow. By default, sopsorific wants to be configured with a production (sensitive zone) secrets file and a default (non-sensitive zone) secrets file. The default file makes it easy to spin up new non-sensitive one-off ecosystems without having to redefine every secret in every ecosystem. It should “just work” unless there are secrets that have different values than already configured in the default file. In that case, we would just need to define the secrets that have different values in a separate secrets file like devintest.yml below where devintest is the name of the ecosystem. Here’s an example of the basic directory structure: .sops.yaml app/ |_ deployment_secrets/ |_ sensitive/ |_ production.yml |_ nonsensitive/ |_ default.yml |_ devin_test.yml The security zone concept allows a more granular access control policy as we can federate decrypt permissions on a per application and per security zone basis by granting or revoking access to KMS keys with AWS Identity and Access Management (IAM) roles. Sopsorific bootstraps these KMS keys and IAM roles for a given application. It generates a secret-editor role that privileged humans can assume to manage the secrets and an application role for the application to assume at runtime to decrypt the secrets. Following the principle of least privilege, our engineering team leads are app owners of the specific applications they maintain. App owners have permissions to assume the secret-editor role for sensitive ecosystems of their specific application. Non app owners have the ability to assume the secret-editor role for non-sensitive ecosystems only. How it works now Now that we know who can do what, let’s talk about how they can do what they can do. Explaining how we use sopsorific is best done by exploring how our secrets management workflow plays out for each stage of the software development lifecycle. Development Engineers have permissions to assume the secret-editor role for the security zones they have access to. Secret-editor roles are named after their corresponding IAM role which includes the security zone and the name of the GitHub repository. For example, secreteditorsensitive_coach where coach is the name of the repository. We use a little command line utility to assume the role and are dropped into a secret-editor session where they use sops to add or edit secrets with their editor in the same way they add or edit code in a feature branch. assuming a secret-editor role The sops command will open and decrypt the secrets in their editor and, if changed, encrypt them and save them back to the file’s original location. All of these steps, apart from the editing, are transparent to the engineer editing the secret. Any changes are then reviewed in a pull request along with the rest of the code. Editing a file is as simple as: sops deployment_secrets/sensitive/production.yml Testing We built a series of validations into sopsorific to further enforce our opinions about secrets management. Some of these are: Secrets are unguessable — Short strings like “password” are not really secrets and this check enforces strings that are at least 128 bits of entropy expressed in unpadded base64. Each ecosystem defines a comprehensive set of secrets — The 12-factor app methodology reminds us that all environments should resemble production as closely as possible. When a secret is added to production, we have a check that makes sure that same secret is also added to all other ecosystems so that they continue to function properly. All crypto keys match — There are checks to ensure the multi-region KMS key ARNs and backup PGP key fingerprint in the sops config file matches the intended security zones. These validations are run as a step in our Continuous Integration suite. Running these checks is a completely offline operation and doesn’t require access to the KMS keys making it trivially secure. Developers can also run these validations locally: sopsorific check Deployment The application server is configured with the instance profile generated by sopsorific so that it can assume the IAM role that it needs to decrypt the secrets at runtime. Then, we configure our init system, upstart, to execute the process wrapped in the sopsorific run command. sopsorific run is another custom command we built to make our usage of sops seamless. When the app starts up, the decrypted secrets will be available as environment variables only to the process running the application instead of being available system wide. This makes our secrets less likely to unintentionally leak and our security team a little happier. Here’s a simplified version of our upstart configuration. start on starting web-app stop on stopping web-app respawn exec su -s /bin/bash -l -c '\ cd /var/www/web-app; \ exec "$0" "$@"' web-app-owner -- sopsorific run 'bundle exec puma -C config/puma.rb' >> /var/log/upstart.log 2>&1 >Operations The 12-factor app methodology reminds us that sometimes developers need to be able to run one-off admin tasks by starting up a console on a live running server. This can be accomplished by establishing a secure session on the server and running what you would normally run to get a console with the sopsorific run command. For our Ruby on Rails apps, that looks like this: sopsorific run 'bundle exec rails c' What did we learn? Throughout this journey, we learned many things along the way. One of these things was having an opinionated tool to help us manage secrets helped to make sure we didn’t accidentally leave around low-entropy secrets from when we were developing or testing out a feature. Having a tool to protect ourselves from ourselves is vital to our workflow. Another thing we learned was that some vendors provide secrets with lower entropy than we’d like for API tokens or access keys and they don’t provide the option to choose stronger secrets. As a result, we had to build features into sopsorific to allow vendor provided secrets that didn’t meet the sopsorific standards by default to be accepted by sopsorific’s checks. In the process of adopting sops and building sopsorific, we discovered the welcoming community and thoughtful maintainers of sops. We had the pleasure of contributing a few changes to sops, and that left us feeling like we left the community a little bit better than we found it. In doing all of these things, we’ve reduced bottlenecks for developers so they can focus more on shipping features and less on managing secrets. -
How We Develop Design Components in Rails
How We Develop Design Components in Rails true Learn how we use Rails components to keep our code D.R.Y. (Don’t Repeat Yourself) and to implement UX design changes effectively and uniformly.. A little over a year ago, we rebranded our entire site . And we've even written on why we did it. We were able to achieve a polished and consistent visual identity under a tight deadline which was pretty great, but when we had our project retrospective, we realized there was a pain point that still loomed over us. We still lacked a good way to share markup across all our apps. We repeated multiple styles and page elements throughout the app to make the experience consistent, but we didn’t have a great way to reuse the common elements. We used Rails partials in an effort to keep the code DRY (Don’t Repeat Yourself) while sharing the same chunks of code and that got us pretty far, but it had its limitations. There were aspects of the page elements (our shared chunks) that needed to change based on their context or the page where they were being rendered. Since these contexts change, we found ourselves either altering the partials or copying and pasting their code into new views where additional context-specific code could be added. This resulted in app code (the content-specific code) becoming entangled with “system” (the base HTML) code. Aside from partials, there was corresponding styling, or CSS, that was being copied and sometimes changed when these shared partials were altered. This meant when the designs were changed, we needed to find all of the places this code was used to update it. Not only was this frustrating, but it was inefficient. To find a solution, we drew inspiration from the component approach used by modern design systems and JavaScript frameworks. A component is a reusable code building block. Pages are built from a collection of components that are shared across pages, but can be expanded upon or manipulated in the context of the page they’re on. To implement our component system, we created our internal gem, Style Closet. There are a few other advantages and problems this system solves too: We’re able to make global changes in a pretty painless way. If we need to change our brand colors, let’s say, we can just change the CSS in Style Closet instead of scraping our codebase and making sure we catch it everywhere. Reusable parts of code remove the burden from engineers for things like CSS and allows time to focus on and tackle other problems. Engineers and designers can be confident they’re using something that’s been tested and validated across browsers. We’re able to write tests specific to the component without worrying about the use-case or increasing testing time for our apps. Every component is on brand and consistent with every other app, feels polished, high quality and requires lower effort to implement. It allows room for future growth which will inevitably happen. The need for new elements in our views is not going to simply vanish because we rebranded, so this makes us more prepared for the future. How does it work? Below is an example of one of our components, the flash. A flash message/warning is something you may use throughout your app in different colors and with different text, but you want it to look consistent. In our view, or the page where we write our HTML, we would write the following to render what you see above: Here’s a breakdown of how that one line, translates into what you see on the page. The component consists of 3 parts: structure, behavior and appearance. The view (the structure): a familiar html.erb file that looks very similar to what would exist without a component but a little more flexible since it doesn’t have its content hard coded in. These views can also leverage Rails’ view yield functionality when needed. Here’s the view partial from Style Closet: You can see how the component.message is passed into the dedicated space/ slot keeping this code flexible for reuse. A Ruby class (the behavior aside from any JavaScript): the class holds the “props” the component allows to be passed in as well as any methods needed for the view, similar to a presenter model. The props are a fancier attr_accessor with the bonus of being able to assign defaults. Additionally, all components can take a block, which is typically the content for the component. This allows the view to be reusable. CSS (the appearance): In this example, we use it to set things like the color, alignment and the border. A note on behavior: Currently, if we need to add some JS behavior, we use unobtrusive JavaScript or UJS sprinkles. When we add new components or make changes, we update the gem (as well as the docs site associated with Style Closet) and simply release the new version. As we develop and experiment with new types of components, we test these bigger changes out in the real world by putting them behind a feature flag using our open source split testing framework, Test Track. What does the future hold? We’ve used UJS sprinkles in similar fashion to the rest of the Rails world over the years, but that has its limitations as we begin to design more complex behaviors and elements of our apps. Currently we’re focusing on building more intricate and and interactive components using React. A bonus of Style Closet is how well it’s able to host these React components since they can simply be incorporated into a view by being wrapped in a Style Closet component. This allows us to continue composing a UI with self contained building blocks. We’re always iterating on our solutions, so if you’re interested in expanding on or solving these types of problems with us, check out our career page! Addition information Since we introduced our internal Rails component code, a fantastic open-source project emerged, Komponent, as well as a really great and in-depth blog post on component systems in Rails from Evil Martians. -
Reflecting on Our Engineering Apprenticeship Program
Reflecting on Our Engineering Apprenticeship Program true Betterment piloted an Apprentice Program to add junior talent to our engineering organization in 2017, and it couldn’t have been more successful or rewarding for all of us. One year later, we’ve asked them to reflect on their experiences. In Spring of 2017, Betterment’s Diversity & Inclusion Steering Committee partnered with our Engineering Team to bring on two developers with non-traditional backgrounds. We hired Jesse Harrelson (Betterment for Advisors Team) and Fidel Severino (Retail Team) for a 90 day Apprentice Program. Following their apprenticeship, they joined us as full-time Junior Engineers. I’m Jesse, a recruiter here at Betterment, and I had the immense pleasure of working closely with these two. It’s been an incredible journey, so I sat down with them to hear first hand about their experiences. Tell us a bit about your life before Betterment. Jesse Harrelson: I was born and raised in Wyoming and spent a lot of time exploring the outdoors. I moved to Nashville to study songwriting and music business, and started a small label through which I released my band’s album. I moved to New York after getting an opportunity at Sony and worked for a year producing video content. Fidel Severino: I’m originally from the Dominican Republic and moved to the United States at age 15. After graduation from Manhattan Center for Science and Mathematics High School, I completed a semester at Lehman College before unfortunate family circumstances required me to go back to the Dominican Republic. When I returned to the United States, I worked in the retail sector for a few years. While working, I would take any available time for courses on websites like Codecademy and Team Treehouse. Can we talk about why you decided to become an Engineer? Jesse Harrelson: Coding became a hobby for me when I would make websites for my bands in Nashville, but after meeting up with more and more people in tech in the city, I knew it was something I wanted to do as a career. I found coding super similar from a composition and structure perspective, which allowed me to tap into the creative side of coding. I started applying to every bootcamp scholarship I could find and received a full scholarship to Flatiron School. I made the jump to start becoming an engineer. Fidel Severino: While working, I would take any available time for courses on websites like Codecademy and Team Treehouse. I have always been interested in technology. I was one of those kids who “broke” their toys in order to find out how they worked. I’ve always had a curious mind. My interactions with technology prior to learning about programming had always been as a consumer. I cherished the opportunity and the challenge that comes with building with code. The feeling of solving a bug you’ve been stuck on for a while is satisfaction at its best. Those bootcamps changed all of our lives! You learned how to be talented, dynamic engineers and we reap the benefit. Let’s talk about why you chose Betterment. Jesse Harrelson: I first heard of Betterment by attending the Women Who Code — Algorithms meetup hosted at HQ. Paddy, who hosts the meetups, let us know that Betterment was launching an apprenticeship program and after the meetup I asked how I could get involved and applied for the program. I was also applying for another different apprenticeship program but throughout the transparent, straightforward interview process, the Betterment apprenticeship quickly became my first choice. Fidel Severino: The opportunity to join Betterment’s Apprenticeship program came via the Flatiron School. One of the main reasons I was ecstatic to join Betterment was how I felt throughout the recruiting process. At no point did I feel the pressure that’s normally associated with landing a job. Keep in mind, this was an opportunity unlike any other I had up to this point in my life, but once I got to talking with the interviewers, the conversation just flowed. The way the final interview was setup made me rave about it to pretty much everyone I knew. Here was a company that wasn’t solely focused on the traditional Computer Science education when hiring an apprentice/junior engineer. The interview was centered around how well you communicate,work with others, and problem solve. I had a blast pair programming with 3 engineers, which I’m glad to say are now my co-workers! We are so lucky to have you! What would you say has been the most rewarding part of your experience so far? Jesse Harrelson: The direct mentorship during my apprenticeship and exposure to a large production codebase. Prior to Betterment, I only had experience with super small codebases that I built myself or with friends. Working with Betterment’s applications gave me a hands-on understanding of concepts that are hard to reproduce on a smaller, personal application level. Being surrounded by a bunch of smart, helpful people has also been super amazing and helped me grow as an engineer. Fidel Severino: Oh man! There’s so many things I would love to list here. However, you asked for the most rewarding, and I would have to say without a doubt — the mentorship. As someone with only self-taught and Bootcamp experience, I didn’t know how much I didn’t know. I had two exceptional mentors who went above and beyond and removed any blocks preventing me from accomplishing tasks. On a related note, the entire company has a collaborative culture that is contagious. You want to help others whenever you can; and it has been the case that I’ve received plenty of help from others who aren’t even directly on my team. What’s kept you here? Fidel Severino: The people. The collaborative environment. The culture of learning. The unlimited supply of iced coffee. Great office dogs. All of the above! Jesse Harrelson: Seriously though, it was the combination of all that plus so many other things. Getting to work with talented, smart people who want to make a difference. This article is part of Engineering at Betterment. -
Building Better Software Faster with Shared Principles
Building Better Software Faster with Shared Principles true Betterment’s playbook for extending the golden hour of startup innovation at scale. Betterment’s promise to customers rests on our ability to execute. To fulfill that promise, we need to deliver the best product and tools available and then improve them indefinitely, which, when you think about it, sounds incredibly ambitious or even foolhardy. For a problem space as large as ours, we can’t fulfill that promise with a single two pizza team. But a scaled engineering org presents other challenges that could just as easily put the goal out of reach. Centralizing architectural decision-making would kill ownership and autonomy, and ensure your best people leave or never join in the first place. On the other hand, shared-nothing teams can lead to information silos, wheel-reinventing, and integration nightmares when an initiative is too big for a squad to deliver alone. To meet those challenges, we believe it’s essential to share more than languages, libraries, and context-free best practices. We can collectively build and share a body of interrelated principles driven by insights that our industry as a whole hasn’t yet realized or is just beginning to understand. Those principles can form chains of reasoning that allow us to run fearlessly, in parallel, and arrive at coherent solutions better than the sum of their parts. I gave a talk about Betterment’s engineering principles at a Rails at Scale meetup earlier last year and promised to share them after our diligent legal team finished reviewing. (Legal helpfully reviewed these principles months ago, but then I had my first child, and, as you can imagine, priorities shifted.) Without any further ado, here are Betterment’s Engineering Principles. You can also watch my Rails at Scale talk to learn why we developed them and how we maintain them. Parting Thoughts on Our Principles Our principles aren’t permanent as-written. Our principles are a living document in an actual git repository that we’ll continue to add to and revise as we learn and grow. Our principles derive from and are matched to Betterment’s collective experience and context. We don’t expect these principles to appeal to everybody. But we do believe strongly that there’s more to agree about than our industry has been able to establish so far. Consider these principles, along with our current and future open source work, part of our contribution to that conversation. What are the principles that your team share? -
Supporting Face ID on the iPhone X
Supporting Face ID on the iPhone X true We look at how Betterment's mobile engineering team developed Face ID for the latest phones, like iPhone X. Helping people do what’s best with their money requires providing them with responsible security measures to protect their private financial data. In Betterment’s mobile apps, this means including trustworthy but convenient local authentication options for resuming active login sessions. Three years ago, in 2014, we implemented Touch ID support as an alternative to using PIN entry in our iOS app. Today, on its first day, we’re thrilled to announce that the Betterment iOS app fully supports Apple’s new Face ID technology on the iPhone X. Trusting the Secure Enclave While we’re certainly proud of shipping this feature quickly, a lot of credit is due to Apple for how seriously the company takes device security and data privacy as a whole. The hardware feature of the Secure Enclave included on iPhones since the 5S make for a readily trustworthy connection to the device and its operating system. From an application’s perspective, this relationship between a biometric scanner and the Secure Enclave is simplified to a boolean response. When requested through the Local Authentication framework, the biometry evaluation either succeeds or fails separate from any given state of an application. The “reply” completion closure of evaluatePolicy(_:localizedReason:reply:) This made testing from the iOS Simulator a viable option for gaining a reasonable degree of certainty that our application would behave as expected when running on a device, thus allowing us to prepare a build in advance of having a device to test on. LABiometryType Since we’ve been securely using Touch ID for years, adapting our existing implementation to include Face ID was a relatively minor change. Thanks primarily to the simple addition of the LABiometryType enum newly available in iOS 11, it’s easy for our application to determine which biometry feature, if any, is available on a given device. This is such a minor change, in fact, that we were able to reuse all of our same view controllers that we had built for Touch ID with only a handful of string values that are now determined at runtime. One challenge we have that most existing iOS apps share is the need to still support older iOS versions. For this reason, we chose to wrap LABiometryTypebehind our own BiometryType enum. This allows us to encapsulate both the need to use an iOS 11 compiler flag and the need to call canEvaluatePolicy(_:error:) on an instance of LAContext before accessing its biometryType property into a single calculated property: See the Gist. NSFaceIDUsageDescription The other difference with Face ID is the new NSFaceIDUsageDescriptionprivacy string that should be included in the application’s Info.plist file. This is a departure from Touch ID which does not require a separate privacy permission, and which uses the localizedReason string parameter when showing its evaluation prompt. Touch ID evaluation prompt displaying the localized reason While Face ID does not seem to make a use of that localizedReason string during evaluation, without the privacy string the iPhone X will run the application’s Local Authentication feature in compatibility mode. This informs the user that the application should work with Face ID but may do so imperfectly. Face ID permissions prompt without (left) and with (right) an NSFaceIDUsageDescription string included in the Info.plist This compatibility mode prompt is undesirable enough on its own, but it also clued us into the need to check for potential security concerns opened up by this forwards-compatibility-by-default from Apple. Thankfully, the changes to the Local Authentication framework were done in such a way that we determined there wasn’t a security risk, but it did leave a problematic user experience in reaching a potentially-inescapable screen when selecting “Don’t Allow” on the privacy permission prompt. Since we believe strongly in our users’ right to say “no”, resolving this design issue was the primary reason we prioritized shipping this update. Ship It If your mobile iOS app also displays sensitive information and uses Touch ID for biometry-based local authentication, join us in making the easy adaption to delight your users with full support for Face ID on the iPhone X. -
From 1 to N: Distributed Data Processing with Airflow
From 1 to N: Distributed Data Processing with Airflow true Betterment has built a highly available data processing platform to power new product features and backend processing needs using Airflow. Betterment’s data platform is unique in that it not only supports offline needs such as analytics, but also powers our consumer-facing product. Features such as Time Weighted Returns and Betterment for Business balances rely on our data platform working throughout the day. Additionally, we have regulatory obligations to report complex data to third parties daily, making data engineering a mission critical part of what we do at Betterment. We originally ran our data platform on a single machine in 2015 when we ingested far less data with fewer consumer-facing requirements. However, recent customer and data growth coupled with new business requirements require us to now scale horizontally with high availability. Transitioning from Luigi to Airflow Our single-server approach used Luigi, a Python module created to orchestrate long-running batch jobs with dependencies. While we could achieve high availability with Luigi, it’s now 2017 and the data engineering landscape has shifted. We turned to Airflow because it has emerged as a full-featured workflow management framework better suited to orchestrate frequent tasks throughout the day. To migrate to Airflow, we’re deprecating our Luigi solution on two fronts: cross-database replication and task orchestration. We’re using Amazon’s Database Migration Service (DMS) to replace our Luigi-implemented replication solution and re-building all other Luigi workflows in Airflow. We’ll dive into each of these pieces below to explain how Airflow mediated this transition. Cross-Database Replication with DMS We used Luigi to extract and load source data from multiple internal databases into our Redshift data warehouse on an ongoing basis. We recently adopted Amazon’s DMS for continuous cross-database replication to Redshift, moving away from our internally-built solution. The only downside of DMS is that we are not aware of how recent source data is in Redshift. For example, a task computing all of a prior day’s activity executed at midnight would be inaccurate if Redshift were missing data from DMS at midnight due to lag. In Luigi, we knew when the data was pulled and only then would we trigger a task. However, in Airflow we reversed our thinking to embrace DMS, using Airflow’s sensor operators to wait for rows to be pushed from DMS before carrying on with dependent tasks. High Availability in Airflow While Airflow doesn’t claim to be highly available out of the box, we built an infrastructure to get as close as possible. We’re running Airflow’s database on Amazon’s Relational Database Service and using Amazon’s Elasticache for Redis queuing. Both of these solutions come with high availability and automatic failover as add-ons Amazon provides. Additionally, we always deploy multiple baseline Airflow workers in case one fails, in which case we use automated deploys to stand up any part of the Airflow cluster on new hardware. There is still one single point of failure left in our Airflow architecture though: the scheduler. While we may implement a hot-standby backup in the future, we simply accept it as a known risk and set our monitoring system to notify a team member of any deviances. Cost-Effective Scalability Since our processing needs fluctuate throughout the day, we were paying for computing power we didn’t actually need during non-peak times on a single machine, as shown in our Luigi server’s load. Distributed workers used with Amazon’s Auto Scaling Groups allow us to automatically add and remove workers based on outstanding tasks in our queues. Effectively, this means maintaining only a baseline level of workers throughout the day and scaling up during peaks when our workload increases. Airflow queues allow us to designate certain tasks to run on particular hardware (e.g. CPU optimized) to further reduce costs. We found just a few hardware type queues to be effective. For instance, tasks that saturate CPU are best run on a compute optimized worker with concurrency set to the number of cores. Non-CPU intensive tasks (e.g. polling a database) can run on higher concurrency per CPU core to save overall resources. Extending Airflow Code Airflow tasks that pass data to each other can run on different machines, presenting a new challenge versus running everything on a single machine. For example, one Airflow task may write a file and a subsequent task may need to email the file from the dependent task ran on another machine. To implement this pattern, we use Amazon S3 as a persistent storage tier. Fortunately, Airflow already maintains a wide selection of hooks to work with remote sources such as S3. While S3 is great for production, it’s a little difficult to work with in development and testing where we prefer to use the local filesystem. We implemented a “local fallback” mixin for Airflow maintained hooks that uses the local filesystem for development and testing, deferring to the actual hook’s remote functionality only on production. Development & Deployment We mimic our production cluster as closely as possible for development & testing to identify any issues that may arise with multiple workers. This is why we adopted Docker to run a production-like Airflow cluster from the ground up on our development machines. We use containers to simulate multiple physical worker machines that connect to officially maintained local Redis and PostgreSQL containers. Development and testing also require us to stand up the Airflow database with predefined objects such as connections and pools for the code under test to function properly. To solve this programmatically, we adopted Alembicdatabase migrations to manage these objects through code, allowing us to keep our development, testing, and production Airflow databases consistent. Graceful Worker Shutdown Upon each deploy, we use Ansible to launch new worker instances and terminate existing workers. But what happens when our workers are busy with other work during a deploy? We don’t want to terminate workers while they’re finishing something up and instead want them to terminate after the work is done (not accepting new work in the interim). Fortunately, Celery supports this shutdown behavior and will stop accepting new work after receiving an initial TERM signal, letting old work finish up. We use Upstart to define all Airflow services and simply wrap the TERM behavior in our worker’s post-stop script, sending the TERM signal first, waiting until we see the Celery process stopped, then finally poweroff the machine. Conclusion The path to building a highly available data processing service was not straightforward, requiring us to build a few specific but critical additions to Airflow. Investing the time to run Airflow as a cluster versus a single machine allows us to run work in a more elastic manner, saving costs and using optimized hardware for particular jobs. Implementing a local fallback for remote hooks made our code much more testable and easier to work with locally, while still allowing us to run with Airflow-maintained functionality in production. While migrating from Luigi to Airflow is not yet complete, Airflow has already offered us a solid foundation. We look forward to continuing to build upon Airflow and contributing back to the community. This article is part of Engineering at Betterment. -
A Functional Approach to Penny-Precise Allocation
A Functional Approach to Penny-Precise Allocation true How we solved the problem allocating a sum of money proportionally across multiple buckets by leaning on functional programming. An easy trap to fall into as an object-oriented developer is to get too caught up in the idea that everything has to be an object. I work in Ruby, for example, where the first thing you learn is that everything is an object. Some problems, however, are better solved by taking a functional approach. For instance, at Betterment, we faced the challenge of allocating a sum of money proportionally across multiple buckets. In this post, I’ll share how we solved the problem by leaning on functional programming to allocate money precisely across proportional buckets. The Problem Proportional allocation comes up often throughout our codebase, but it’s easiest to explain using a fictional example: Suppose your paychecks are $1000 each, and you always allocate them to your different savings accounts as follows: College savings fund: $310 Buy a car fund: $350 Buy a house fund: $200 Emergency fund: $140 Now suppose you’re an awesome employee and received a bonus of $1234.56. You want to allocate your bonus proportionally in the same way you allocate your regular paychecks. How much money do you put in each account? You may be thinking, isn’t this a simple math problem? Let’s say it is. To get each amount, take the ratio of the contribution from your normal paycheck to the total of your normal paycheck, and multiply that by your bonus. So, your college savings fund would get: (310/1000)*1234.56 = 382.7136 We can do the same for your other three accounts, but you may have noticed a problem. We can’t split a penny into fractions, so we can’t give your college savings fund the exact proportional amount. More generally, how do we take an inflow of money and allocate it to weighted buckets in a fair, penny-precise way? The Mathematical Solution: Integer Allocation We chose to tackle the problem by working with integers instead of decimal numbers in order to avoid rounding. This is easy to do with money — we can just work in cents instead of dollars. Next, we settled on an algorithm which pays out buckets fairly, and guarantees that the total payments exactly sum to the desired payout. This algorithm is called the Largest Remainder Method. Multiply the inflow (or the payout in the example above) by each weight (where the weights are the integer amounts of the buckets, so the contributions to the ticket in our example above), and divide each of these products by the sum of the buckets, finding the integer quotient and integer remainder Find the number of pennies that will be left over to allocate by taking the inflow minus the total of the integer quotients Sort the remainders in descending order and allocate any leftover pennies to the buckets in this order The idea here is that the quotients represent the amounts we should give each bucket aside from the leftover pennies. Then we figure out which bucket deserves the leftover pennies. Let’s walk through this process for our example: Remember that we’re working in cents, so our inflow is 123456 and we need to allocate it across bucket weights of [31000, 35000, 20000, 14000]. We find each integer quotient and remainder by multiplying the inflow by the weight and dividing by the total weight. We took advantage of the divmod method in Ruby to grab the integer quotient and remainder in one shot, like so: buckets.map do |bucket| (inflow * bucket).divmod(total_bucket_weight) end This gives us 12345631000/100000, 12345635000/100000, 12345620000 /100000 and 12345614000/100000. The integer quotients with their respective remainders are [38271, 36000], [43209, 60000], [24691, 20000], [17283, 84000]. Next, we find the leftover pennies by taking the inflow minus the total of the integer quotients, which is 123456 — (38271 + 43209 + 24691 + 17283) = 2. Finally, we sort our buckets in descending remainder order (because the buckets with the highest remainders are most deserving of extra pennies) and allocate the leftover pennies we have in this order. It’s worth noting that in our case, we’re using Ruby’s sort_by method, which gives us a nondeterministic order in the case where remainders are equal. In this case, our fourth bucket and second bucket, respectively, are most deserving. Our final allocations are therefore [38271, 43210, 24691, 17284]. This means that your college savings fund gets $382.71, your car fund gets $432.10, your house fund gets $246.91, and your emergency fund gets $172.84. The Code Solution: Make It Functional Given we have to manage penny allocations between a person’s goals often throughout our codebase, the last thing we’d want is to have to bake penny-pushing logic throughout our domain logic. Therefore, we decided to extract our allocation code into a module function. Then, we took it even further. Our allocation code doesn’t need to care that we’re looking to allocate money, just that we’re looking to allocate integers. What we ended up with was a black box ‘Allocator’ module, with a public module function to which you could pass 2 arguments: an inflow, and an array of weightings. Our Ruby code looks like this. The takeaway The biggest lesson to learn from this experience is that, as an engineer, you should not be afraid to take a functional approach when it makes sense. In this case, we were able to extract a solution to a complicated problem and keep our OO domain-specific logic clean. -
How We Engineered Betterment’s Tax-Coordinated Portfolio™
How We Engineered Betterment’s Tax-Coordinated Portfolio™ true For our latest tax-efficiency feature, Tax Coordination, Betterment’s solver-based portfolio management system enabled us to manage and test our most complex algorithms. Tax efficiency is a key consideration of Betterment’s portfolio management philosophy. With our new Tax Coordination feature, we’re continuing the mission to help our customers’ portfolios become as tax efficient as possible. While new products can often be achieved using our existing engineering abstractions, TCP brought the engineering team a new level of complexity that required us to rethink how parts of our portfolio management system were built. Here’s how we did it. A Primer on Tax Coordination Betterment’s TCP feature is our very own, fully automated version of an investment strategy known as asset location. If you’re not familiar with asset location, it is a strategy designed to optimize after-tax returns by placing tax-inefficient securities into more tax-advantaged accounts, such as 401(k)s and Individual Retirement Accounts (IRAs). Before we built TCP, Betterment customers had each account managed as a separate, standalone portfolio. For example, customers could set up a Roth IRA with a portfolio of 90% stocks and 10% bonds to save for retirement. Separately, they could set up a taxable retirement account invested likewise in 90% stocks and 10% bonds. Now, Betterment customers can turn on TCP in their accounts, and their holdings in multiple investment accounts will be managed as a single portfolio allocation, but rearranged in such a way that the holdings across those accounts seek to maximize the overall portfolio’s after-tax returns. To illustrate, let’s suppose you’re a Betterment customer with three different accounts: a Roth IRA, a traditional IRA, and a taxable retirement account. Let’s say that each account holds $50,000, for a total of $150,000 in investments. Now assume that the $50,000 in each account is invested into a portfolio of 70% stocks and 30% bonds. For reference, consider the diagram. The circles represent various asset classes, and the bar shows the allocation for all the accounts, if added together. Each account has a 70/30 allocation, and the accounts will add up to 70/30 in the aggregate, but we can do better when it comes to maximizing after-tax returns. We can maintain the aggregate 70/30 asset allocation, but use the available balances of $50,000 each, to rearrange the securities in such a way that places the most tax-efficient holdings into a taxable account, and the most tax-inefficient ones into IRAs. Here’s a simple animation solely for illustrative purposes: Asset Location in Action The result is the same 70/30 allocation overall, except TCP has now redistributed the assets unevenly, to reduce future taxes. How We Modeled the Problem The fundamental questions the engineering team tried to answer were: How do we get our customers to this optimal state, and how do we maintain it in the presence of daily account activity? We could have attempted to construct a procedural-style heuristic solution to this, but the complexity of the problem led us to believe this approach would be hard to implement and challenging to maintain. Instead, we opted to model our problem as a linear program. This made the problem provably solvable and quick to compute—on the order of milliseconds per customer. Let’s consider a hypothetical customer account example. Meet Joe Joe is a hypothetical Betterment customer. When he signed up for Betterment, he opened a Roth IRA account. As an avid saver, Joe quickly reached his annual Roth IRA contribution limit of $5,500. Wanting to save more for his retirement, he decided to open up a Betterment taxable account, which he funded with an additional $11,000. Note that the contribution limits mentioned in this example are as of the time this article was published. Limits are subject to change from year to year, so please defer to IRS guidelines for current limits. See IRA limits here and 401(k) limits. Joe isn’t one to take huge risks, so he opted for a moderate asset allocation of 50% stocks and 50% bonds in both his Roth IRA and taxable accounts. To make things simple, let’s assume that both portfolios are only invested in two asset classes: U.S. total market stocks and emerging markets bonds. In his taxable account, Joe holds $5,500 worth of U.S. total market stocks in VTI (Vanguard Total Stock Market ETF), and $5,500 worth of emerging markets bonds in VWOB (Vanguard Emerging Markets Bond ETF). Let’s say that his Roth IRA holds $2,750 of VTI, and $2,750 of VWOB. Below is a table summarizing Joe’s holdings: Account Type: VTI (U.S. Total Market) VWOB (Emerging Markets Bonds) Account Total Taxable $5,500 $5,500 $11,000 Roth $2,750 $2,750 $5,500 Asset Class Total $8,250 $8,250 $16,500 To begin to construct our model for an optimal asset location strategy, we need to consider the relative value of each fund in both accounts. A number of factors are used to determine this, but most importantly each fund’s tax efficiency and expected returns. Let’s assume we already know that VTI has a higher expected value in Joe’s taxable account, and that VWOB has a higher expected value in his Roth IRA. To be more concrete about this, let’s define some variables. Each variable represents the expected value of holding a particular fund in a particular account. For example, we’re representing the expected value of holding VTI in your Taxable as which we’ve defined to be 0.07. More generally, Let’s let be the expected value of holding fund F in account A. Circling back to the original problem, we want to rearrange the holdings in Joe’s accounts in a way that’s maximally valuable in the future. Linear programs try to optimize the value of an objective function. In this example, we want to maximize the expected value of the holdings in Joe’s accounts. The overall value of Joe’s holdings are a function of the specific funds in which he has investments. Let’s define that objective function. You’ll notice the familiar terms—measuring the expected value of holding each fund in each account, but also you’ll notice variables of the form Precisely, this variable represents the balance of fund F in account A. These are our decision variables—variables that we’re trying to solve for. Let’s plug in some balances to see what the expected value of V is with Joe’s current holdings: V=0.07*5500+0.04*5500+0.06*2750+0.05*2750=907.5 Certainly, we can do better. We cannot just assign arbitrarily large values to the decision variables due to two restrictions which cannot be violated: Joe must maintain $11,000 in his taxable account and $5,500 in his Roth IRA. We cannot assign Joe more money than he already has, nor can we move money between his Roth IRA and taxable accounts. Joe’s overall portfolio must also maintain its allocation of 50% stocks and 50% bonds—the risk profile he selected. We don’t want to invest all of his money into a single fund. Mathematically, it’s straightforward to represent the first restriction as two linear constraints. Simply put, we’ve asserted that the sum of the balances of every fund in Joe’s taxable account must remain at $11,000. Similarly, the sum of the balances of every fund in his Roth IRA must remain at $5,500. The second restriction—maintaining the portfolio allocation of 50% stocks and 50% bonds—might seem straightforward, but there’s a catch. You might guess that you can express it as follows: The above statements assert that the sum of the balances of VTI across Joe’s accounts must be equal to half of his total balance. Similarly, we’re also asserting that the sum of the balances of VWOB across Joe’s accounts must be equal to the remaining half of his total balance. While this will certainly work for this particular example, enforcing that the portfolio allocation is exactly on target when determining optimality turns out to be too restrictive. In certain scenarios, it’s undesirable to buy or to sell a specific fund because of tax consequences. These restrictions require us to allow for some portfolio drift—some deviation from the target allocation. We made the decision to maximize the expected after-tax value of a customer’s holdings after having achieved the minimum possible drift. To accomplish this, we need to define new decision variables. Let’s add them to our objective function: is the dollar amount above the target balance in asset class AC. Similarly, is the dollar amount below the target balance in asset class AC. For instance, is the dollar amount above the target balance in emerging markets bonds—the asset class to where VWOB belongs. We still want to maximize our objective function V. However, with the introduction of the drift terms, we want every dollar allocated toward a single fund to incur a penalty if it moves the target balance for that fund’s asset class below or above its target amount. To do this, we can relate the terms with the terms using linear constraints. As shown above, we’ve asserted that the sum of the balances in funds including U.S. total market stocks (in this case, only VTI), plus some net drift amount in that asset class, must be equal to the target balance of that asset class in the portfolio (which in this case, is 50% of Joe’s total holdings). Similarly, we’ve also done this for emerging markets bonds. This way, if we can’t achieve perfect allocation, we have a buffer that we can fill—albeit at a penalty. Now that we have our objective function and constraints set up, we just need to solve these equations. For this we can use a mathematical programming solver. Here’s the optimal solution: Managing Engineering Complexity Reaching the optimal balances would require our system to buy and sell securities in Joe’s investment accounts. It’s not always free for Joe to go from his current holdings to optimal ones because buying and selling securities can have tax consequences. For example, if our system sold something at a short-term capital gain in Joe’s taxable account, or bought a security in his Roth IRA that was sold at a loss in the last 30 days—triggering the wash-sale rule, we would be negatively impacting his after-tax return. In the simple example above with two accounts and two funds, there are a total of four constraints. Our production model is orders of magnitude more complex, and considers each Betterment customer’s individual tax lots, which introduces hundreds of individual constraints to our model. Generating these constraints that ultimately determine buying and selling decisions can often involve tricky business logic that examines a variety of data in our system. In addition, we knew that as our work on TCP progressed, we were going to need to iterate on our mathematical model. Before diving head first into the code, we made it a priority to be cognizant of the engineering challenges we would face. Key Principles for Using Tax Coordination on a Retirement Goal As a result, we wanted to make sure that the software we built respected four key principles, which are: Isolation from third-party solver APIs. Ability to keep pace with changes to the mathematical model, e.g., adding, removing, and changing the constraints and the objective function must be quick and painless. Separation of concerns between how we accessed data in our system and the business logic defining algorithmic behavior. Easy and comprehensive testing. We built our own internal framework for modeling mathematical programs that was not tied to our trading system’s domain-specific business logic. This gave us the flexibility to switch easily between a variety of third-party mathematical programming solvers. Our business logic that generates the model knows only about objects defined by our framework, and not about third-party APIs. To incorporate a third-party solver into our system, we built a translation layer that received our system-generated constraints and objective function as inputs, and utilized those inputs to solve the model using a third-party API. Switching between third-party solvers simply meant switching implementations of the interface below. We wanted that same level of flexibility in changing our mathematical model. Changing the objective function and adding new constraints needed to be easy to do. We did this by providing well-defined interfaces that give engineers access to core system data needed to generate our model. This means that an engineer implementing a change to the model would only need to worry about implementing algorithmic behavior, and not about how to retrieve the data needed to do that. To add a new set of constraints, engineers simply provide an implementation of a TradingConstraintGenerator. Each TradingConstraintGenerator knows about all of the system related data it needs to generate constraints. Through dependency injection, the new generator is included among the set of generators used to generate constraints. The sample code below illustrates how we generated the constraints for our model. With hundreds of constraints and hundreds of thousands of unique tax profiles across our customer base, we needed to be confident that our system made the right decisions in the right situations. For us, that meant having clear, readable tests that were a joy to write. Below is a test written in Groovy, which sets up fixture data that mimics the exact situation in our “Meet Joe” example. We not only had unit tests such as the one above to test simple scenarios where a human could calculate the outcome, but we also ran the optimizer in a simulated production-like environment, through hundreds of thousands of scenarios that closely resembled real ones. During testing, we often ran into scenarios where our model had no feasible solution—usually due to a bug we had introduced. As soon as the bug was fixed, we wanted to ensure that we had automated tests to handle a similar issue in the future. However, with so many sources of input affecting the optimized result, writing tests to cover these cases was very labor-intensive. Instead, we automated the test setup by building tools that could snapshot our input data as of the time the error occurred. The input data was serialized and automatically fed back into our test fixtures. Striving for Simplicity At Betterment, we aim to build products that help our customers reach their financial goals. Building new products can often be done using our existing engineering abstractions. However, TCP brought a new level of complexity that required us to rethink the way parts of our trading system were built. Modeling and implementing our portfolio management algorithms using linear programming was not easy, but it ultimately resulted in the simplest possible system needed to reliably pursue optimal after-tax returns. To learn more about engineering at Betterment, visit the engineering page on the Betterment Resource Center. All return examples and return figures mentioned above are for illustrative purposes only. For much more on our TCP research, including additional considerations on the suitability of TCP to your circumstances, please see our white paper. See full disclosure for our estimates and Tax Coordination in general. -
The Evolution of the Betterment Engineering Interview
The Evolution of the Betterment Engineering Interview true Betterment’s engineering interview now includes a pair programming experience where candidates are tested on their collaboration and technical skills. Building and maintaining the world’s largest independent robo-advisor requires a talented team of human engineers. This means we must continuously iterate on our recruiting process to remain competitive in attracting and hiring top talent. As our team has grown impressively from five to more than 50 engineers (and this was just in the last three years), we’ve significantly improved our abilities to make clearer hiring decisions, as well as shortened our total hiring timeline. Back in the Day Here’s how our interview process once looked: Resumé review Initial phone screen Technical phone screen Onsite: Day 1 Technical interview (computer science fundamentals) Technical interview (modelling and app design) Hiring manager interview Onsite: Day 2 Product and design interview Company founder interview Company executive interview While this process helped in growing our engineering team, it began showing some cracks along the way. The main recurring issue was that hiring managers were left uncertain as to whether a candidate truly possessed the technical aptitude and skills to justify making them an employment offer. While we tried to construct computer science and data modelling problems that led to informative interviews, watching candidates solve these problems still wasn’t getting to the heart of whether they’d be successful engineers once at Betterment. In addition to problems arising from the types of questions asked, we saw that one of our primary interview tools, the whiteboard, was actually getting in the way; many candidates struggled to communicate their solutions using a whiteboard in an interview setting. The last straw for using whiteboards came from feedback provided by Betterment’s Women in Technology group. When I sat down with them to solicit feedback on our entire hiring process, they pointed to the whiteboard problem-solving dynamics (one to two engineers sitting, observing, and judging the candidate standing at a whiteboard) as unnatural and awkward. It was clear this part of the interviewing process needed to go. We decided to allow candidates the choice of using a whiteboard if they wished, but it would no longer be the default method for presenting one’s skills. If we did away with the whiteboard, then what would we use? The most obvious alternative was a computer, but then many of our engineers expressed concerns with this method, having had bad experiences with computer-based interviews in the past. After spirited internal discussions we landed on a simple principle: We should provide candidates the most natural setting possible to demonstrate their abilities. As such, our technical interviews switched from whiteboards to computers. Within the boundaries of that principle, we considered multiple interview formats, including take-home and online assessments, and several variations of pair programming interviews. In the end, we landed on our own flavor of a pair programming interview. Today: A Better Interview Here’s our revised interview process: Resumé review Initial phone screen Technical phone screen Onsite: Technical interview 1 Ask the candidate to describe a recent technical challenge in detail Set up the candidate’s laptop Introduce the pair programming problem and explore the problem Pair programming (optional, time permitting) Technical interview 2 Pair programming Technical interview 3 Pair programming Ask-Me-Anything session Product and design interview Hiring manager interview Company executive interview While an interview setting may not offer pair programming in its purest sense, our interviewers truly participate in the process of writing software with the candidates. Instead of simply instructing and watching candidates as they program, interviewers can now work with them on a real-world problem, and they take turns in control of the keyboard. This approach puts candidates at ease, and feels closer to typical pair programming than one might expect. As a result, in addition to learning how well a candidate can write code, we learn how well they collaborate. We also split the main programming portion of our original interview into separate sections with different interviewers. It’s nice to give candidates a short break in between interviews, but the main reason for the separation is to evaluate the handoff. We like to evaluate how well a candidate explains the design decisions and progress from one interviewer to the next. Other Improvements We also streamlined our question-asking process and hiring timeline, and added an opportunity for candidates to speak with non-interviewers. Questions Interviews are now more prescriptive regarding non-technical questions. Instead of multiple interviewers asking a candidate about the same questions based on their resumé, we prescribe topics based on the most important core competencies of successful (Betterment) engineers. Each interviewer knows which competencies (e.g., software craftsmanship) to evaluate. Sample questions, not scripts, are provided, and interviewers are encouraged to tailor the competency questions to the candidates based on their backgrounds. Timeline Another change is that the entire onsite interview is completed in a single day. This can make scheduling difficult, but in a city as competitive as New York is for engineering talent, we’ve found it valuable to get to the final offer stage as quickly as possible. Discussion Finally, we’ve added an Ask-Me-Anything (AMA) session—another idea provided by our Women in Technology group. While we encourage candidates to ask questions of everyone they meet, the AMA provides an opportunity to meet with a Betterment engineer who has zero input on whether or not to hire them. Those “interviewers” don’t fill out a scorecard, and our hiring managers are forbidden from discussing candidates with them. Ship It Our first run of this new process took place in November 2015. Since then, the team has met several times to gather feedback and implement tweaks, but the broad strokes have remained unchanged. As of July 2016, all full-stack, mobile, and site-reliability engineering roles have adopted this new approach. We’re continually evaluating whether to adopt this process for other roles, as well. Our hiring managers now report that they have a much clearer understanding of what each candidate brings to the table. In addition, we’ve consistently received high marks from candidates and interviewers alike, who prefer our revamped approach. While we didn’t run a scientifically valid split-test for the new process versus the old (it would’ve taken years to reach statistical significance), our hiring metrics have improved across the board. We’re happy with the changes to our process, and we feel that it does a great job of fully and honestly evaluating a candidate’s abilities, which helps Betterment to continue growing its talented team. For more information about working at Betterment, please visit our Careers page. More from Betterment: Server Javascript: A Single-Page App To…A Single-Page App Going to Work at Betterment Engineering at Betterment: Do You Have to Be a Financial Expert? -
Women Who Code: An Engineering Q&A with Venmo
Women Who Code: An Engineering Q&A with Venmo true Betterment recently hosted a Women in Tech meetup with Venmo developer Cassidy Williams, who spoke about impostor syndrome. Growing up, I watched my dad work as an electrical engineer. Every time I went with him on Take Your Child to Work Day, it became more and more clear that I wanted to be an engineer, too. In 2012, I graduated from the University of Portland with a degree in computer science and promptly moved to the Bay Area. I got my first job at Intel, where I worked as a Scala developer. I stayed there for several years until last May, when I uprooted my life to New York for Betterment, and I haven’t looked back since. As an engineer, I not only love building products from the ground up, but I’m passionate about bringing awareness to diversity in tech, an important topic that has soared to the forefront of social justice issues. People nationwide have chimed in on the conversation. Most recently, Isis Wenger, a San Francisco-based platform engineer, sparked the #ILookLikeAnEngineer campaign, a Twitter initiative designed to combat gender inequality in tech. At Betterment, we’re working on our own set of initiatives to drive the conversation. We’ve started an internal roundtable to voice our concerns about gender inequality in the workplace, we’ve sponsored and hosted Women in Tech meetups, and we’re starting to collaborate with other companies to bring awareness to the issue. Cassidy Williams, a software engineer at mobile payments company Venmo, recently came in to speak. She gave a talk on impostor syndrome, a psychological phenomenon in which people are unable to internalize their accomplishments. The phenomenon, Williams said, is something that she has seen particularly among high-achieving women—where self-doubt becomes an obstacle for professional development. For example, they think they’re ‘frauds,’ or unqualified for their jobs, regardless of their achievements. Williams’ goal is to help women recognize the characteristic and empower them to overcome it. Williams has been included as one of Glamour Magazine's 35 Women Under 35 Who Are Changing the Tech Industry and listed in the Innotribe Power Women in FinTech Index. As an engineer myself, I was excited to to speak with her after the event about coding, women in tech, and fintech trends. Cassidy Williams, Venmo engineer, said impostor syndrome tends to be more common in high-achieving women. Photo credit: Christine Meintjes Abi: Can you speak about a time in your life where ‘impostor syndrome’ was limiting in your own career? How did you overcome that feeling? Cassidy: For a while at work, I was very nervous that I was the least knowledgeable person in the room, and that I was going to get fired because of it. I avoided commenting on projects and making suggestions because I thought that my insight would just be dumb, and not necessary. But at one point (fairly recently, honestly), it just clicked that I knew what I was doing. Someone asked for my help on something, and then I discussed something with him, and suddenly I just felt so much more secure in my job. Can you speak to some techniques that have personally proven effective for you in overcoming impostor syndrome? Asking questions, definitely. It does make you feel vulnerable, but it keeps you moving forward. It's better to ask a question and move forward with your problem than it is to struggle over an answer. As a fellow software engineer, I can personally attest to experiencing this phenomenon in tech, but I’ve also heard from friends and colleagues that it can be present in non-technical backgrounds, as well. What are some ways we can all work together to empower each other in overcoming imposter syndrome? It's cliché, but just getting to know one another and sharing how you feel about certain situations at work is such a great way to empower yourself and empower others. It gets you both vulnerable, which helps you build a relationship that can lead to a stronger team overall. Whose Twitter feed do you religiously follow? InfoSec Taylor Swift. It's a joke feed, but they have some great tech and security points and articles shared there. In a few anecdotes throughout your talk, you mentioned the importance of having mentors and role models. Who are your biggest inspirations in the industry? Jennifer Arguello - I met Jennifer at the White House Tech Inclusion Summit back in 2013, where we hit it off talking about diversity in tech and her time with the Latino Startup Alliance. I made sure to keep in touch because I would be interning in the Bay Area, where she’s located, and we’ve been chatting ever since. Kelly Hoey - I met Kelly at a women in tech hackathon during my last summer as a student in 2013, and then she ended up being on my team on the British Airways UnGrounded Thinking hackathon. She and I both live in NYC now, and we see each other regularly at speaking engagements and chat over email about networking and inclusion. Rane Johnson - I met Rane at the Grace Hopper Celebration for Women in Computing in 2011, and then again when I interned at Microsoft in 2012. She and I started emailing and video chatting each other during my senior year of college, when I started working with her on the Big Dream Documentary and the International Women’s Hackathon at the USA Science and Engineering Festival. Ruthe Farmer - I first met Ruthe back in 2010 during my senior year of high school when I won the Illinois NCWIT Aspirations Award. She and I have been talking with each other at events and conferences and meetups (and even just online) almost weekly since then about getting more girls into tech, working, and everything in between. One of the things we chatted about after the talk was how empowering it is to have the resources and movements of our generation to bring more diversity to the tech industry. The solutions that come out of that awareness are game-changing. What are some specific ways in which companies can contribute to these movements and promote a healthier and more inclusive work culture? Work with nonprofits: Groups like NCWIT, the YWCA, the Anita Borg Institute, the Scientista Foundation, and several others are so great for community outreach and company morale. Educate everyone, not just women and minorities: When everyone is aware and discussing inclusion in the workplace, it builds and maintains a great company culture. Form small groups: People are more open to talking closely with smaller groups than a large discussion roundtable. Building those small, tight-knit groups promotes relationships that can help the company over time. It’s a really exciting time to be a software engineer, especially in fintech. What do you think are the biggest trends of our time in this space? Everyone's going mobile! What behavioral and market shifts can we expect to see from fintech in the next five to 10 years? I definitely think that even though cash is going nowhere fast, fewer and fewer people will ever need to make a trip to the bank again, and everything will be on our devices. What genre of music do you listen to when you’re coding? I switch between 80s music, Broadway show tunes, Christian music, and classical music. Depends on my feelings about the problem I'm working on. ;) IDE of choice? Vim! iOS or Android? Too tough to call.
Join our Open Source Projects
-
test_track
Server app for the TestTrack multi-platform split-testing and feature-gating system.
-
webvalve
Betterment’s framework for locally developing and testing service-oriented apps in isolation with WebMock and Sinatra-based fakes.
-
better_test_reporter
Tooling and libraries for processing dart test output into dev-friendly formats.
-
delayed
A multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day.