21 August 2025
At 3 a.m., you don't want to search for the right runbook. You want your organization's existing knowledge applied automatically, so you can focus on exploring what's truly new.
It's 3 a.m. and PagerDuty jolts you awake. You open Datadog and see metrics dropping fast. Users are already feeling it. You're tired, the pressure is on, and you need to fix things quickly.
Traditionally, you'd reach for a runbook. You'd scroll through a wiki or a document, trying to find the right one. Then, you'd follow each step manually. It works, sometimes. But it's slow, error-prone, and often out of date.
Runbooks are the backbone of SRE and operations. They're where engineers capture their expertise, turning tribal knowledge into step-by-step instructions that anyone on the team can follow. They make recovery consistent, helping new engineers step into complex situations with confidence. They're critical when the pressure is high and decisions need to be exact.
Still, runbooks come with serious challenges. They take time to locate, require manual effort, and can't possibly cover every scenario. Even when they're well maintained, they slow down incident response at the worst possible moments.
That's why new approaches have emerged.
On one side, startups are chasing automated RCA (root cause analysis). The vision: AI agents sift through logs, metrics, and traces to instantly pinpoint the cause of an outage. Companies like Traversal, Resolve, Cleric, and TierZero are pursuing this path. The promise is huge, but the current reality is modest. Customers today report accuracy rates around 20%. True RCA automation depends on breakthroughs in agent swarms and long-horizon planning, which are fields still being developed at places like DeepMind and OpenAI. It's exciting, but it isn't ready.
On the other side, companies like Rundeck (acquired by PagerDuty), StackStorm, Cutover, and Shoreline (acquired by Nvidia) focus on runbook automation. The idea is to take existing runbooks and wire them into automated flows. The problem is that creating these automations takes too long, and once created, they're brittle. Real incidents never play out exactly the same. They rhyme, but they don't repeat. Small variations throw deterministic automations off track, leaving teams back where they started.
Forge solves this problem along with many other problems within Ops teams.
We've built a system that allows engineers to rapidly create agentic automations for diagnosis and remediation. Instead of brittle scripts, Forge automations adapt intelligently at runtime, responding to the unique context of each incident. They can mix deterministic steps, agentic steps, and human-in-the-loop checkpoints, giving engineers the confidence that automation won't run wild in production.
With Forge, tribal knowledge doesn't get trapped in documents. It becomes executable, flexible, and immediately useful. Engineers can offload repetitive diagnostic and remediation tasks, shrinking MTTR and eliminating the risk of human error at 3 a.m. Frontline responders act with the collective expertise of the entire org, while senior engineers focus on higher-value root cause analysis.
And automation doesn't have to be risky. Teams can start by automating the safest, most repetitive steps, the ones everyone already knows cold. Over time, they can expand to more complex flows as confidence grows.
By adopting Forge, teams get:
Faster recovery times through automation of repetitive tasks
Reduced human error during stressful incidents
Knowledge captured and reused, not lost in wikis
Flexible automations that adapt to real-world variability
Confidence and control, with human-in-the-loop options when needed
At 3 a.m., you don't want to search for the right runbook. You want your organization's existing knowledge applied automatically, so you can focus on exploring what's truly new.
Ready to transform your SRE operations?
Contact us to learn how Forge can revolutionize your incident response.