Runpod, Inc. - Site Reliability Engineer
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• 5+ years of experience in SRE, Reliability Engineering, or Production Engineering • Strong Linux systems and Networking expertise • Experience managing containerized production systems • Strong understanding of distributed systems and failure modes • Experience defining and managing SLIs/SLOs • Proven incident response and postmortem leadership experience • Strong scripting or programming skills • Experience with monitoring and alerting systems • Excellent written communication skills • Successful completion of a background check • Preferred: • Experience with GPU infrastructure or AI/ML platforms • Experience improving reliability in high-growth or large scale environments • Familiarity with GPU observability tooling • Experience with Infrastructure as Code • Experience working in startup environments • Experience building internal reliability platforms or frameworks • What You’ll Receive: • The competitive base pay for this position ranges from $150,000- $200,000 usd. This salary range may be inclusive of several career levels at Runpod and will be narrowed during the interview process based on a number of factors, including the candidate’s experience, qualifications, and location • Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside. • Generous medical, dental & vision plans • Flexible PTO- take the time you need to recharge • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
Responsibilities
• Reliability Engineering • Define and implement SLIs/SLOs for critical services • Lead incident response and coordinate cross-team mitigation efforts • Conduct blameless postmortems and ensure corrective actions are completed • Perform production readiness reviews for new services and features • Identify systemic risks and drive preventative improvements • Observability & Monitoring • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.) • Improve signal-to-noise ratio in alerts and reduce alert fatigue • Build internal tooling for reliability tracking and reporting • Improve visibility into GPU performance and distributed systems health • Automation & Toil Reduction • Automate recurring operational workflows • Build tools and scripts (Python, Go, Bash) to eliminate manual processes • Improve deployment safety through automation and guardrails • Strengthen CI/CD reliability and release processes • Cross-Functional Reliability Advocacy • Partner with engineering teams to improve system resilience • Provide guidance on fault tolerance, scalability, and failure handling • Contribute to architectural discussions with a reliability-first mindset
No credit card. Takes 10 seconds.