“Slack is down.” It’s a headline we have had blaring at TechCrunch on numerous occasions (mostly because we actually get work done when not distracted by a constant waterfall of GIFs). But Slack is not alone — issues with uptime and reliability plague modern web services, from Alexa to WhatsApp to Apple Maps.
As any software engineer can atest, web application development is extraordinarily complicated. Databases, storage services, and business logic all need to work together perfectly so that users can buy their goods or watch their films.
But what happens when one piece of that application breaks down? Today, a small outage in one AWS availability zone could cascade and knock an entire service offline, as we have seen repeatedly. Today’s developer tools are decent at spotting bugs and other logic errors, but they don’t investigate applications systematically to ask how they can respond to various crises.
That’s where Gremlin comes in. The service, founded by CEO Kolton Andrus, who designed Netflix’s failure injection service and worked with CTO Matthew Fornaciari while at Amazon, is designed to throw a monkey wrench into any application, simulating faults like storage errors, database congestion, and sudden spikes in latency. It’s tagline is “break things on purpose” (something of a rift of Facebook’s “move fast and break things”).
Resiliency is clearly on investors’ minds, since the startup announced this morning at its Chaos Conf in SF that it has raised a $18 million Series B round led by Redpoint partner Tomasz Tunguz. That’s a follow-up to a $7.5 million series A led by Index Ventures partner Mike Volpi, which was announced less than a year ago.
In addition to announcing the funding today, the company unveiled its “Application Level Fault Injection” system — a mouthful of a name, but a feature that will help DevOps engineers test systems at the application level, including most importantly serverless environments.
Andrus said in a note to TechCrunch that “This past year has been a whirlwind. We spent a lot of time educating everyone from engineers to CIOs about chaos engineering and building up the community.” He said the new funding will be used to further build out Gremlin’s engineering team.
As I wrote about in-depth a few months ago, Gremlin is pioneering a field of software development dubbed “chaos engineering.” Rather than using formal verification to test whether code is accurate and performant, chaos engineers throw deliberate and systematic errors at an application in an attempt to simulate various types of failure and find brittle parts of software programs.
That sounds easy on the surface, but extremely complicated in practice: you want to simulate an outage without actually creating an outage on a mission-critical system. Netflix wants to test whether losing a database will cause video to stop playing, without physically pulling the plug on a database and seeing if your movie is still on the TV.
Gremlin’s platform provides something of a sandbox for engineers to slowly ramp up errors, and then more importantly, ramp down errors if a breakage is detected. So a DevOps engineer can add a few milliseconds of latency to a program and see how it responds, and then add a few more.
With the rise of serverless services like AWS Lambda, the complexity around applications gets even more challenging. Now, applications aren’t just on a single instance, but individual functions could be scattered across multiple instances and potentially multiple data centers. That can save developer time and reduce costs, but it also exponentially increases the risk of something going wrong and harming an application’s reliability.
Gremlin’s new ALFI feature is designed to allow more fine-grain tuning of attacks, so that DevOps engineers can target just particular aspects of an application living in a serverless environment. It’s inspired by Andrus’ work at Netflix around Failure Injection Testing, which was a sort of successor to the company’s earlier Chaos Monkey tools.
Gremlin’s ALFI feature allows developers to simulate more fine-grained failures.
It’s these sorts of features that partly intrigued Tunguz at Redpoint, who is well-known for his thoughts on SaaS. He said in a note to TechCrunch that “In the modern cloud era — where systems are distributed, containerized, and highly ephemeral — it’s become nearly impossible to have a complete understanding of system behavior without doing the kind of proactive testing Gremlin offers.”
Gremlin’s work is to not just sell a service, but to reshape how developers think about building and testing applications. Perhaps someday all of our web services will be reliable – and then how will we get work done?