Cerebras Systems - Staff Site Reliability Engineer – Automation and Platform

Remote - California, United States; Sunnyvale, CA; Toronto, Ontario, Canada1mo ago

Remote Staff NA Site Reliability Engineer Loki Prometheus Plane SAFe

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• 8+ years in SRE, infrastructure engineering, or platform engineering, with a strong record of improving automation and reliability at large scale in FAANG, hyperscaler, or similarly demanding environments. • Deep expertise operating large scale heterogenous clusters with a proprietary cloud control plane • Proven track record designing and delivering CI/CD or GitOps systems using Argo CD or similar tools, with strong safety and observability built in. • Hands-on experience with observability systems such as Loki, Tempo, Mimir, and Prometheus • Ability to lead complex projects end to end, influence cross-functional stakeholders, and communicate technical direction clearly. • Nice-to-Haves • Experience with Bazel or other large-scale build systems in production. • Background in AI/ML inference systems, including model serving runtimes, GPU or wafer-scale orchestration, latency and accuracy SLOs, or drift monitoring. • Prior work on predictive autoscaling, chaos engineering, or cost-aware capacity planning for compute-intensive workloads. • Location

Responsibilities

• Define and implement a robust strategy for delivering and running software reliably and at scale across multiple datacenters and cloud-based solutions. • Architect self-service platforms and internal tooling that let product teams, external customers, and cluster operators safely trigger and observe critical workflows with minimal handoffs. • Define and evolve reliability practices for inference workloads, including SLOs and SLIs for latency, throughput, and accuracy stability; error budgets; blameless postmortems; chaos testing; and capacity forecasting across multi-datacenter and on-prem environments. • Mentor mid-level SREs, support critical incident escalations, and use production pain points to prioritize the highest-leverage automation work. • Measure and drive impact through clear metrics, including toil reduction, deployment velocity, SLO compliance, MTTR, and adoption of self-service workflows.

Benefits

• People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras: • Build a breakthrough AI platform beyond the constraints of the GPU. • Publish and open source their cutting-edge AI research. • Work on one of the fastest AI supercomputers in the world. • Enjoy job stability with startup vitality. • Our simple, non-corporate work culture that respects individual beliefs. • Read our blog: Five Reasons to Join Cerebras in 2026.