Spotify - Senior Site Reliability Engineer

Remote - New York, NY$164k - $235k+ Equity1mo ago

Remote Senior NA Health Insurance Insurance Site Reliability Engineer Java Go TypeScript Python GCP

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• You have 5+ years of hands-on experience operating cloud infrastructure (GCP and/or AWS), using Terraform and Kubernetes to run production systems at scale. • You have practical experience — or a strong demonstrated interest — in operating LLM-based systems, RAG pipelines, or agentic workloads, and understand the reliability challenges of non-deterministic systems. • You think in distributed systems first principles — consistency, availability, partition tolerance — and translate that thinking into pragmatic infrastructure decisions. • You are proficient in at least one modern language (TypeScript, Java, Go, or Python) and comfortable navigating large, heterogeneous codebases, including environments where AI-generated PRs are common. • You build automation and improve systems so that whole categories of operational issues disappear over time. • You communicate complex infrastructure trade-offs clearly to both technical and non-technical stakeholders, and you write postmortems that lead to meaningful change. • Where You'll Be • This role is based in New York, NY. • We offer you the flexibility to work where you work best! There will be some in person meetings, but still allows for flexibility to work from home. • The United States base range for this position is $164,448–$234,926 USD, plus equity. The benefits available for this position include health insurance, six-month paid parental leave, 401(k) retirement plan, monthly meal allowance, 23 paid days off, paid flexible holidays, and paid sick leave. These ranges may be modified in the future. • At Spotify, we are passionate about inclusivity and making sure our entire recruitment process is accessible to everyone. We have ways to request reasonable accommodations during the interview process and help assist in what you need. If you need accommodations at any stage of the application or interview process, please let us know - we’re here to support you in any way we can.

Responsibilities

• Own fleet reliability. Lead the reliability, security, and scalability strategy for Portal’s SaaS infrastructure, including the runtime environments that power our platform and LLM-driven agent workflows. Define SLOs, drive capacity planning, and ensure our systems meet the demands of a rapidly growing product. • Architect for the agentic era. Design and evolve infrastructure on GCP and AWS using Terraform and infrastructure-from-code patterns. Shape how we structure environments for non-deterministic AI workloads — including sandboxing, resource isolation, cost governance, and security boundaries. • Drive operational excellence. Evolve our incident management, on-call, and postmortem practices. Leverage AI assistants to accelerate root cause analysis and build increasingly self-healing capabilities into our production systems. • Lead fullstack reliability. Operate across a modern web stack (TypeScript, React, Python). While not frontend-heavy, you’ll diagnose and resolve issues across the stack and drive reliability improvements end-to-end. • Mentor and multiply. Raise the reliability IQ of the broader engineering team. Establish SRE best practices, conduct production-readiness reviews, and mentor engineers on operational thinking. • Shape the roadmap. Partner with engineering and product leadership to evolve our infrastructure in step with generative AI features. Translate operational insights into strategic input on the product roadmap.

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities