Senior/Principal DevOps
Aghanim • Lisboa
Publicado em 15/04/2026 às 21:16
Descrição da Vaga
We’re looking for a Senior/Principal DevOps to own our cloud-only platform and keep it reliable under high-load and bursty traffic. Our services run entirely on GCP, fronted by Cloudflare, with deep observability in Datadog and CI/CD in GitHub Actions.
This is a hands-on role with real ownership: ensuring we meet our SLA/SLOs, scaling fast (10–50×), and keeping infrastructure efficient and cost-conscious as the company grows and microservices multiply.
Role Responsibilities
-------------------------
- Cloud Infrastructure Ownership
- Own and evolve production infrastructure on GCP and Cloudflare (cloud-only, no on-prem).
- Maintain high availability and performance for a SaaS platform serving both B2B and B2C use cases.
- Scalability & Highload Resilience
- Design and operate for unpredictable spikes where load can jump 10–20× within seconds.
- Build scaling strategies across compute, networking, and data layers (autoscaling, capacity planning, bottleneck removal, safe degradation patterns).
- SLA/SLO & Incident Excellence
- Be accountable for reliability outcomes: availability/latency/error rates tied to SLA/SLO.
- Lead incident response practices: detection mitigation postmortem permanent fixes (root cause elimination).
- IaC & Kubernetes Platform Operations
- Build and maintain Infrastructure as Code using Terraform (and Terragrunt where applicable).
- Own Kubernetes operations on GKE: upgrades, scaling, operational hardening.
- Write and maintain Helm charts and Kubernetes manifests where needed.
- Observability (Datadog)
- Build end-to-end observability using Datadog (metrics/logs/APM): dashboards, monitors, alert strategy.
- Ensure critical system paths and dependencies are visible and actionable (reduce alert noise, increase signal).
- DevSecOps Baseline
- Configure and operate security tooling and monitoring (e.g., Security Command Center, scanners/analyzers).
- Triage findings and either fix issues directly or delegate remediation to the right teams.
- CI/CD Enablement
- Collaborate with engineering to streamline and harden GitHub Actions / GitHub CI/CD pipelines.
- Increase deployment safety and speed through automation and platform guardrails.
- Cost Management
- Own cost visibility and optimization: identify waste, right-size resources, and implement practical FinOps controls.
Required Qualifications
---------------------------
- Strong production experience in DevOps/SRE (typically 5+ years, but we value impact over years).
- Proven experience operating infrastructure for SaaS with explicit SLA commitments (B2B + B2C is a plus).
- Hands-on expertise with GCP, especially GKE, plus relevant managed services (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Dataflow, Cloud Run, Cloud Deploy, Memorystore).
- Strong Infrastructure-as-Code with Terraform (bonus: Terragrunt).
- Strong Kubernetes operations background (GKE at scale, reliability practices, upgrades, scaling).
- Experience with Cloudflare (WAF/DNS/edge basics; Workers/CDN is a plus).
- Production observability experience with Datadog (or comparable), ideally including APM/logging.
- Strong scripting/automation skills and a reliability-first mindset.
Preferred Qualifications
----------------------------
- Experience in game dev or similarly bursty high-load consumer products.
- Familiarity with SOC 2 / PCI-DSS audits and security architecture requirements.
- Service mesh experience (e.g., Cloud Service Mesh) in production.
- Mature SRE practices: error budgets, on-call maturity, runbooks, proactive incident prevention.
What Success Looks Like
---------------------------
- Platform consistently meets or exceeds SLA/SLO targets under bursty highload.
- Incidents are detected early, mitigated quickly, and don’t repeat due to strong postmortem follow-through.
- Scaling events (10–50×) are routine rather than heroic.
- Cloud spend is transparent, controlled, and optimized without harming reliability.
- Engineering teams ship faster with safer, smoother CI/CD and fewer infrastructure bottlenecks.
Why Join Us
---------------
- Cloud-only infrastructure (GCP) with meaningful scale and real reliability ownership.
- Small team (15–20 engineers) with high autonomy and fast decision-making.
- Direct impact on platform stability, scaling, and cost efficiency.
- Opportunity to shape SRE culture, tooling, and operational standards in a fast-growing startup.
Aghanim helps game developers achieve financial and creative independence by providing the solutions they need to launch, run, and grow their businesses.