Okta, Inc. - Staff Site Reliability Engineer
Requirements
• 8+ years in SRE, DevOps, or Infrastructure Engineering roles. • 3–5 years of experience with Kubernetes (EKS/GKE) and related ecosystem tools (Helm, Karpenter, etc.) in production. • 3–5 years of experience with AWS and GCP. • 3–5 years using Terraform to manage multi-cloud infrastructure. • 5+ years of coding experience in Python, Go, or similar languages. • Proven track record leading high-impact projects, specifically migration projects (ECS → EKS/GKE) and enabling microservice architectures. • Experience implementing SLOs/SLIs, performing root cause analyses, and improving operational resilience. • Prior work in SaaS or high-scale, cloud-native environments is a strong plus. • Strong Linux and security fundamentals. • Bachelor’s degree in Computer Science or equivalent hands-on experience. • Supporting Your Well-Being • Driving Social Impact • Developing Talent and Fostering Connection + Community • We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one.
Responsibilities
• Design, build, and operate highly scalable, reliable, and secure infrastructure powering our production systems across AWS and GCP. • Lead major reliability and modernization initiatives, including container platform migrations (e.g., ECS to EKS/GKE) and microservice enablement across multi-cloud environments. • Serve as a technical authority in Kubernetes (EKS and GKE), cloud infrastructure (AWS and GCP), and modern CI/CD practices (GitOps, automation pipelines). • Partner with development teams to architect and enable microservice-based applications, ensuring production readiness, scalability, and observability. • Implement and manage infrastructure as code (Terraform, Ansible) to automate provisioning, scaling, and configuration management across multiple cloud providers. • Drive improvements in observability, performance, and cost efficiency through robust monitoring, logging, and alerting systems that span AWS and GCP. • Champion SRE best practices — defining SLOs/SLIs, conducting blameless postmortems, and continuously improving incident response. • Lead complex technical projects from conception to completion, managing timelines, and technical dependencies across teams. • Mentor engineers across teams, fostering a culture of reliability, automation, and continuous learning. • Collaborate with security and compliance partners to ensure infrastructure adheres to best practices and standards (e.g., IAM Federation, Workload Identity). • Participate in the on-call rotation, using incidents as learning opportunities to enhance systems and processes.
Benefits
• This work requires a relentless drive to solve complex challenges with real-world stakes.
Apply in one click
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT