bespokelabs - DevOps / Site Reliability Engineer
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• 3–5 years in DevOps, SRE, or infrastructure engineering • Strong AWS experience — EKS, EC2, RDS, S3, IAM • Kubernetes — deployment, scaling, troubleshooting in production • CI/CD pipelines — GitHub Actions, ArgoCD, or similar • Infrastructure as Code — Terraform, Pulumi, or CDK • Python or Go scripting • Experience working in production environments with real users • Comfort with ambiguity and ability to operate autonomously • Experience supporting ML training workloads or GPU clusters • Familiarity with distributed computing or large-scale data pipelines • Prior work at an AI, ML, or data company • Open-source contributions or published technical writing
Responsibilities
• Own cloud infrastructure on AWS — EC2, EKS, RDS, S3, IAM, VPC • Manage Kubernetes clusters and container orchestration end-to-end • Build and maintain CI/CD pipelines using GitHub Actions or similar • Implement monitoring, alerting, and observability stacks (Prometheus, Grafana, or DataDog) • Improve reliability, performance, and security of production systems • Automate infrastructure with Terraform or similar IaC tools • Debug and resolve issues across complex, distributed systems • Participate in design reviews and help raise the infrastructure bar
Benefits
• Competitive compensation and meaningful equity • Direct impact on frontier AI model training and evaluation infrastructure • Flexible, remote-friendly environment with low bureaucracy • A small, high-caliber team with deep AI research expertise • Health, wellness, and learning & development benefits
Similar Jobs
No credit card. Takes 10 seconds.