Lead DevOps Engineer
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Legal authorization to work in the UK. • Track record in a senior/lead DevOps, SRE, or Platform role, including mentorship of engineers. • Expert‑level Terraform (including importing existing resources and taming legacy estates). • Deep, hands‑on experience with AWS (ECS, RDS, ElastiCache, Lambda, ALB, WAF, S3, CloudFront, EventBridge, CloudWatch) and production networking/IAM. • Proven design and maintenance of CI/CD pipelines (GitHub Actions) and container workflows (Docker, ECS Fargate or Kubernetes). • Proficiency with modern observability/monitoring (Datadog, CloudWatch, Sentry, PagerDuty), incident response, and incident retrospectives. • Strong background in cloud security principles and practical hardening. • Ability to define and execute a technical roadmap and communicate with both technical and non‑technical stakeholders. • What we would like you to have: • Experience with GCP, Azure, Alibaba Cloud, and managed platforms (Databricks, Vercel). • Familiarity with SST/CDK, Next.js/Vercel delivery flows, and performance considerations for web platforms. • VPN/zero‑trust networking (e.g., Tailscale); perimeter hardening and WAF tuning. • Please submit a resume and cover letter, and be prepared to provide three (3) professional references upon request.
Responsibilities
• Leadership: Lead direction and mentor for the DevOps team; set technical direction for infrastructure and security; foster a culture of ownership, reliability, and continuous improvement. • Roadmap Ownership & Strategy: Define, own, and drive the Infrastructure & Security Roadmap, prioritizing infrastructure ownership, profound monitoring, disaster recovery, developer experience, and security hardening. • Infrastructure as Code (IaC): Inventory and capture unmanaged resources in Terraform (and CDK/SST where required); create reusable modules and guardrails; institute code reviews and change management. • Platform Operations (AWS‑first): Design and operate services built on ECS (Fargate), ECR, RDS, ElastiCache, S3, ALB/CloudFront, WAF, Lambda, EventBridge, CloudWatch; improve networking, IAM, and resilience. • Resilience & Reliability: Modernize critical workloads; design and run disaster recovery drills; automate backups/restore; codify RPO/RTO targets and runbooks; lead incident response and postmortems. • Observability & On‑Call: Standardize monitoring/alerting with Datadog, CloudWatch, Sentry, PagerDuty; implement SLOs and noise‑reduction baselines; maintain a humane on‑call rotation. • Security Hardening: Mature the configuration and rollout of tools like Jit and CrowdStrike; improve firewall/WAF rules; enforce secrets management and least‑privilege access; champion threat modeling and automated scanning. • Collaboration & Governance: Serve as a key technical voice on the Architecture Review Board; partner with Product and Engineering to align solutions with operational standards and business goals.