wagey.ggwagey.ggv1.0-e93b95d-4-May
Browse Tech JobsCompaniesFeaturesPricingFAQs
Log InGet Started Free
Jobs/Site Reliability Engineer Role/Cerebras Systems - Staff Site Reliability Engineer – Automation and Platform
Cerebras Systems

Cerebras Systems - Staff Site Reliability Engineer – Automation and Platform

Remote - California, United States; Sunnyvale, CA; Toronto, Ontario, Canada1mo ago
RemoteStaffNASite Reliability EngineerLokiPrometheusPlaneSAFe

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click
Apply in One Click

Requirements

• 8+ years in SRE, infrastructure engineering, or platform engineering, with a strong record of improving automation and reliability at large scale in FAANG, hyperscaler, or similarly demanding environments. • Deep expertise operating large scale heterogenous clusters with a proprietary cloud control plane • Proven track record designing and delivering CI/CD or GitOps systems using Argo CD or similar tools, with strong safety and observability built in. • Hands-on experience with observability systems such as Loki, Tempo, Mimir, and Prometheus • Ability to lead complex projects end to end, influence cross-functional stakeholders, and communicate technical direction clearly. • Nice-to-Haves • Experience with Bazel or other large-scale build systems in production. • Background in AI/ML inference systems, including model serving runtimes, GPU or wafer-scale orchestration, latency and accuracy SLOs, or drift monitoring. • Prior work on predictive autoscaling, chaos engineering, or cost-aware capacity planning for compute-intensive workloads. • Location

Responsibilities

• Define and implement a robust strategy for delivering and running software reliably and at scale across multiple datacenters and cloud-based solutions. • Architect self-service platforms and internal tooling that let product teams, external customers, and cluster operators safely trigger and observe critical workflows with minimal handoffs. • Define and evolve reliability practices for inference workloads, including SLOs and SLIs for latency, throughput, and accuracy stability; error budgets; blameless postmortems; chaos testing; and capacity forecasting across multi-datacenter and on-prem environments. • Mentor mid-level SREs, support critical incident escalations, and use production pain points to prioritize the highest-leverage automation work. • Measure and drive impact through clear metrics, including toil reduction, deployment velocity, SLO compliance, MTTR, and adoption of self-service workflows.

Benefits

• People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection  point in our business. Members of our team tell us there are five main reasons they joined Cerebras: • Build a breakthrough AI platform beyond the constraints of the GPU. • Publish and open source their cutting-edge AI research. • Work on one of the fastest AI supercomputers in the world. • Enjoy job stability with startup vitality. • Our simple, non-corporate work culture that respects individual beliefs. • Read our blog: Five Reasons to Join Cerebras in 2026.

Similar Jobs

NewselaNewsela - Contract: Senior Site Reliability Engineer2d ago
·Remote - Argentina
RemoteLATAMSeniorCloud ComputingSite Reliability EngineerTeam ManagementTeam LeadershipPerformance ReviewsTerraformAWS
ZscalerZscaler - Principal Site Reliability Engineer2d ago
·Remote - USA; San Jose, California, USA - Hybrid·$193k - $193k/year + Equity
In OfficeNAPrincipalNonprofitSite Reliability EngineerPrincipalKubernetesTerraformLinuxPythonAnsible
fyxerfyxer - Lead Product Reliability Engineer3d ago
·Remote - London, UK, United Kingdom·£170k/year/year + Equity
RemoteEMEAStaffSite Reliability EngineerTypeScript
fieldguidefieldguide - Senior Site Reliability Engineer3d ago
·Remote - San Francisco, California, United States·$190k - $206k/year + Equity
RemoteNASeniorCloud ComputingSite Reliability EngineerTerraformAWSDatadogPrometheusGrafana
Get Started Free

No credit card. Takes 10 seconds.

Privacy·Terms··Contact·FAQ·Wagey on X