Pro members applied to this job 36 hours before you saw itGet Pro ›

bespokelabs - DevOps / Site Reliability Engineer

Remote - Europe *+ Equity3d ago

Remote Mid EMEA Cloud Computing Site Reliability Engineer AWS Kubernetes Terraform Go Python Pulumi CDK Technical Writing Prometheus Grafana

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• 3–5 years in DevOps, SRE, or infrastructure engineering • Strong AWS experience — EKS, EC2, RDS, S3, IAM • Kubernetes — deployment, scaling, troubleshooting in production • CI/CD pipelines — GitHub Actions, ArgoCD, or similar • Infrastructure as Code — Terraform, Pulumi, or CDK • Python or Go scripting • Experience working in production environments with real users • Comfort with ambiguity and ability to operate autonomously • Experience supporting ML training workloads or GPU clusters • Familiarity with distributed computing or large-scale data pipelines • Prior work at an AI, ML, or data company • Open-source contributions or published technical writing

Responsibilities

• Own cloud infrastructure on AWS — EC2, EKS, RDS, S3, IAM, VPC • Manage Kubernetes clusters and container orchestration end-to-end • Build and maintain CI/CD pipelines using GitHub Actions or similar • Implement monitoring, alerting, and observability stacks (Prometheus, Grafana, or DataDog) • Improve reliability, performance, and security of production systems • Automate infrastructure with Terraform or similar IaC tools • Debug and resolve issues across complex, distributed systems • Participate in design reviews and help raise the infrastructure bar

Benefits

• Competitive compensation and meaningful equity • Direct impact on frontier AI model training and evaluation infrastructure • Flexible, remote-friendly environment with low bureaucracy • A small, high-caliber team with deep AI research expertise • Health, wellness, and learning & development benefits