Kpler - Senior DevOps Engineer (Cloud & ML Infrastructure)
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• 5+ years of experience in cloud/platform engineering in production environments. • Strong hands-on experience with Kubernetes in production. • Experience with Infrastructure as Code (Terraform preferred). • Strong knowledge of AWS (or equivalent cloud provider). • Experience operating distributed systems in 24/7 environments. • Strong operational mindset (SLOs, monitoring, incident management). • Proven experience running ML/AI workloads in production. • Exposure to LLM-based or compute-intensive systems. • Experience optimizing cost and performance of high-compute infrastructure • Strong cloud platform engineering expertise (AWS preferred). • Advanced Kubernetes operations in production (scaling, upgrades, workload isolation, troubleshooting). • Solid Infrastructure as Code experience (Terraform or equivalent). • Strong understanding of distributed systems and cloud-native architectures. • Experience designing and operating CI/CD pipelines. • Strong observability practices (monitoring, logging, alerting, SLO definition). • Incident management and root cause analysis in 24/7 systems. • Infrastructure cost optimization and performance tuning. • Solid programming skills (Python or Go preferred). • Practical experience supporting ML/AI or GPU-based workloads in production (highly valued). • Behavioural Competencies: • Ownership & Accountability – Takes end-to-end responsibility for production systems and reliability outcomes. • Systems Thinking – Understands architectural trade-offs and long-term impact of technical decisions. • Structured Problem Solving Under Pressure – Maintains clarity and effectiveness during incidents and high-stakes situations. • Collaborative & Autonomy – Communicates clearly in distributed teams, documents decisions effectively, and works autonomously while maintaining strong cross-team alignment • Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent practical experience. • Strong programming skills (Python or Go preferred). • Solid understanding of cloud-native architecture and reliability engineering principles. • We are a dynamic company dedicated to nurturing connections and innovating solutions to tackle market challenges head-on. If you thrive on customer satisfaction and turning ideas into reality, then you’ve found your ideal destination. Are you ready to embark on this exciting journey with us? • We make things happen • We act decisively and with purpose, going the extra mile. • We build together • We foster relationships and develop creative solutions to address market challenges. • We are here to help • We are accessible and supportive to colleagues and clients with a friendly approach. • Our People Pledge • Don’t meet every single requirement? Research shows that women and people of color are less likely than others to apply if they feel like they don’t match 100% of the job requirements. Don’t let the confidence gap stand in your way, we’d love to hear from you! We understand that experience comes in many different forms and are dedicated to adding new perspectives to the team.
Responsibilities
• Design, operate, and improve Kpler’s cloud-native infrastructure (Kubernetes, networking, compute, storage). • Contribute to Infrastructure as Code, CI/CD pipelines, and platform automation. • Ensure high availability, reliability, and security of production systems. • Improve observability, monitoring, alerting, and incident response processes. • Reduce MTTR and failure rates through structured reliability improvements. • Optimize infrastructure cost and performance, including compute-intensive workloads. • Support and help standardize ML/GPU-based workloads within the existing platform model. • Collaborate closely with ML engineers, data engineers, and backend teams to ensure production-grade deployments. • Contribute to architectural decisions shaping the evolution of the platform.
Similar Jobs
No credit card. Takes 10 seconds.