Senior/Principal DevOps

Aghanim • Lisboa

Publicado em 15/04/2026 às 21:16

Full-time Informática (Programação)
Descrição da Vaga

We’re looking for a Senior/Principal DevOps to own our cloud-only platform and keep it reliable under high-load and bursty traffic. Our services run entirely on GCP, fronted by Cloudflare, with deep observability in Datadog and CI/CD in GitHub Actions.

This is a hands-on role with real ownership: ensuring we meet our SLA/SLOs, scaling fast (10–50×), and keeping infrastructure efficient and cost-conscious as the company grows and microservices multiply.

Role Responsibilities
-------------------------

  • Cloud Infrastructure Ownership
  • Own and evolve production infrastructure on GCP and Cloudflare (cloud-only, no on-prem).
  • Maintain high availability and performance for a SaaS platform serving both B2B and B2C use cases.
  • Scalability & Highload Resilience
  • Design and operate for unpredictable spikes where load can jump 10–20× within seconds.
  • Build scaling strategies across compute, networking, and data layers (autoscaling, capacity planning, bottleneck removal, safe degradation patterns).
  • SLA/SLO & Incident Excellence
  • Be accountable for reliability outcomes: availability/latency/error rates tied to SLA/SLO.
  • Lead incident response practices: detection mitigation postmortem permanent fixes (root cause elimination).
  • IaC & Kubernetes Platform Operations
  • Build and maintain Infrastructure as Code using Terraform (and Terragrunt where applicable).
  • Own Kubernetes operations on GKE: upgrades, scaling, operational hardening.
  • Write and maintain Helm charts and Kubernetes manifests where needed.
  • Observability (Datadog)
  • Build end-to-end observability using Datadog (metrics/logs/APM): dashboards, monitors, alert strategy.
  • Ensure critical system paths and dependencies are visible and actionable (reduce alert noise, increase signal).
  • DevSecOps Baseline
  • Configure and operate security tooling and monitoring (e.g., Security Command Center, scanners/analyzers).
  • Triage findings and either fix issues directly or delegate remediation to the right teams.
  • CI/CD Enablement
  • Collaborate with engineering to streamline and harden GitHub Actions / GitHub CI/CD pipelines.
  • Increase deployment safety and speed through automation and platform guardrails.
  • Cost Management
  • Own cost visibility and optimization: identify waste, right-size resources, and implement practical FinOps controls.

Required Qualifications
---------------------------

  • Strong production experience in DevOps/SRE (typically 5+ years, but we value impact over years).
  • Proven experience operating infrastructure for SaaS with explicit SLA commitments (B2B + B2C is a plus).
  • Hands-on expertise with GCP, especially GKE, plus relevant managed services (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Dataflow, Cloud Run, Cloud Deploy, Memorystore).
  • Strong Infrastructure-as-Code with Terraform (bonus: Terragrunt).
  • Strong Kubernetes operations background (GKE at scale, reliability practices, upgrades, scaling).
  • Experience with Cloudflare (WAF/DNS/edge basics; Workers/CDN is a plus).
  • Production observability experience with Datadog (or comparable), ideally including APM/logging.
  • Strong scripting/automation skills and a reliability-first mindset.

Preferred Qualifications
----------------------------

  • Experience in game dev or similarly bursty high-load consumer products.
  • Familiarity with SOC 2 / PCI-DSS audits and security architecture requirements.
  • Service mesh experience (e.g., Cloud Service Mesh) in production.
  • Mature SRE practices: error budgets, on-call maturity, runbooks, proactive incident prevention.

What Success Looks Like
---------------------------

  • Platform consistently meets or exceeds SLA/SLO targets under bursty highload.
  • Incidents are detected early, mitigated quickly, and don’t repeat due to strong postmortem follow-through.
  • Scaling events (10–50×) are routine rather than heroic.
  • Cloud spend is transparent, controlled, and optimized without harming reliability.
  • Engineering teams ship faster with safer, smoother CI/CD and fewer infrastructure bottlenecks.

Why Join Us
---------------

  • Cloud-only infrastructure (GCP) with meaningful scale and real reliability ownership.
  • Small team (15–20 engineers) with high autonomy and fast decision-making.
  • Direct impact on platform stability, scaling, and cost efficiency.
  • Opportunity to shape SRE culture, tooling, and operational standards in a fast-growing startup.

Aghanim helps game developers achieve financial and creative independence by providing the solutions they need to launch, run, and grow their businesses.

Precisa de estar logado para se candidatar.
Login para Candidatar