14 August 2025
How a Series B FinTech reduced their platform engineering team from 15 to 8 while automating 80% of help-desk tickets and achieving 99.99% uptime using Forge.
a37 recently concluded an engagement with a rapidly growing, Series B SaaS FinTech, henceforth “the Company”. The company at the time of engagement had a user base increasing 300% year-over-year, and was undergoing monitoring by their chosen auditor for SOC II while additionally pursuing ISO 27001 compliance. They had a 15-person platform engineering team that was handling a high volume of manual DevOps tasks and incident responses.
Their main challenges included:
Operational Overhead: Engineers dedicated 40-50% of their time to repetitive help-desk tasks, like access grants or test environment provisioning.
Incident Management Issues: SRE events, such as database overloads or compliance issues, relied on ad-hoc runbooks, causing delays and fatigue.
Team Pressure: The 15-person team faced difficulties keeping up, resulting in slower feature releases and higher than average attrition, the highest among any team at the company.
Our forward-deployed engineers helped them adopt Forge. Forge is an AI-native DevOps workspace that enables the creation of event-driven, reusable agentic workflows integrated with tools like AWS, Kubernetes, Terraform, Datadog, and PagerDuty. Forge supports real-time event responses, API integrations, and AI for tasks like incident classification or resource prediction. Workflows are modular, shareable, versioned, and auditable to maintain consistency and compliance.
Implementation started in Q1 2025 with a proof-of-concept, followed by a 1-month rollout. Their team worked with a37's architects to develop two key workflow sets:
These addressed routine tasks that previously required significant manual effort. Using Forge's interface and customization options, engineers built automations triggered by tools like Jira and Slack. Some workflows include:
Trigger: Jira ticket for access requests (e.g., to an S3 bucket or Kubernetes namespace).
Steps:
Forge parses ticket details and checks against policies (e.g., via Okta integration).
If compliant, Forge sends Slack notification for approval to the access manager.
Upon approval, IAM roles are provisioned using Terraform, and actions are logged and notified to the requester and approver.
Escalates non-standard cases to a reviewer with relevant context.
Details: For a production database access request, the workflow applied PCI DSS rules, auto-approved read-only access, and flagged write access for review, completing in under 5 minutes compared to 2-3 hours before. Over 6 months, it processed 1,200 requests, automating 85% fully and cutting errors by 95%.
Trigger: Self-service request for a new dev or staging environment.
Steps:
Provision a Kubernetes namespace with resources via Terraform and GitHub IaC templates.
Monitors usage and scales down idle environments after 4 hours.
Runs compliance scans (e.g., with Trivy) and applies patches.
Ensures adherence to practices like encrypted storage.
Details: During a feature sprint, 50 environments were provisioned in 15 minutes each versus 4 hours previously, saving $25,000 quarterly on idle resources. It also maintained GDPR compliance by masking data in non-prod setups.
These workflows were developed in under two weeks during the POC, using Forge's connectors.
These improvements in incident handling are achieved by intelligently selecting and routing alerts to appropriate prebuilt runbooks.
Previously, the company maintained an extensive library of prebuilt runbooks for common SRE events, such as resource scaling, security threats, and compliance checks. However, engineers often took too long to locate the right ones during active incidents, leading to delays. To solve this, they developed an agentic workflow in Forge that analyzes events, identifies and executes relevant existing runbooks, and shares findings via a Slack post for quick visibility and collaboration.
Trigger: Alert from Datadog or PagerDuty (e.g., high CPU usage, security anomalies, or compliance violations).
Steps:
Forge analyzes the event based on metadata, logs, classifying its severity and matching it to one or more relevant prebuilt runbooks from the company's library.
For low-severity issues, it automatically invokes and executes the runbooks to remediate (e.g., scaling Kubernetes pods or running compliance scans with OPA to correct drifts).
For high-severity issues, it runs initial diagnostic steps from the runbooks to gather context, then escalates to the on-call engineer with enriched details (e.g., diagnostic outputs and recommended actions).
Posts a summary of findings, including key findings, actions taken, and links to logs, via Slack for team review and follow-up; also sends email notifications if escalation requires broader involvement.
Logs all steps for post-incident reviews and continuous improvement.
This company was one of the earliest users of Forge. After implementation, the company saw:
Team Adjustment: Platform engineering team reduced from 15 to 8.
Efficiency Improvements: Automated 80% of help-desk tickets, saving ~500 hours quarterly.
Cost Reductions: Cut AWS spending by 25% via optimization.
Compliance and Reliability: No violations in recent audits, with 99.99% uptime and 95% fewer provisioning errors.
Key Takeaway
By implementing Forge, this rapidly scaling FinTech transformed their DevOps operations from a bottleneck into a competitive advantage. The combination of automated help-desk workflows and intelligent SRE event routing not only reduced operational overhead but also improved team morale and enabled faster feature delivery.
Ready to transform your DevOps operations like this FinTech did?
Contact us to learn how Forge can revolutionize your infrastructure management.