fieldguide - Staff Site Reliability Engineer

Remote - San Francisco, California, United States$210k - $247k+ Equity1mo ago

Remote Staff NA Cloud Computing Site Reliability Engineer Terraform Coaching AWS Prometheus Datadog

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• 10+ years of experience in software engineering, with a focus on distributed systems and production infrastructure. • Extensive experience operating and scaling distributed systems in cloud environments, with a strong preference for AWS. • Deep expertise in system reliability, scalability, and performance engineering at scale. • Demonstrated experience implementing SLO-driven engineering practices and reliability frameworks. • Strong background building and owning observability ecosystems (e.g., Datadog, Prometheus, Grafana). • Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent. • Proven experience leading incident management, post-mortems, and production operations. • Strong software engineering fundamentals with the ability to contribute to and review complex codebases. • Track record of technical leadership and cross-functional influence across engineering and product teams. • Ability to balance tactical short-term needs with strategic long-term architectural improvements. • Excellent written and verbal communication skills, with the ability to translate complex technical concepts for diverse audiences. • Experience designing or operating multi-region and globally distributed systems. • Deep expertise in distributed tracing and performance analysis across complex service architectures. • Hands-on experience with database scalability and performance tuning at scale. • Familiarity with compliance-driven engineering environments (e.g., SOC 2, FedRAMP, or similar frameworks). • Experience applying chaos engineering practices to validate and improve system resilience. • Experience building or scaling an SRE function within a high-growth organization.

Responsibilities

• Lead the design and evolution of highly scalable, fault-tolerant distributed systems across our cloud infrastructure. • Define and drive adoption of SLOs, SLIs, and error budgets across engineering teams. • Architect and continuously improve observability platforms (metrics, logging, tracing). • Own reliability strategy and roadmap, proactively identifying risks and driving long-term improvements. • Lead cross-team initiatives to improve system performance, scalability, and resilience. • Establish and enforce best practices for incident response, on-call, and operational excellence. • Drive root cause analysis and systemic improvements through blameless postmortems. • Champion automation and reduction of operational toil. • Guide capacity planning, load testing, and performance optimization efforts. • Design and validate disaster recovery, failover strategies, and resilience testing. • Mentor and coach engineers to elevate reliability engineering maturity. • Partner with Staff engineers across the organization to drive meaningful change • Partner with leadership to align business goals with reliability investments.

Benefits

• Competitive compensation packages with meaningful ownership • Wellness benefits, including a bundle of free therapy sessions • Technology & Work from Home reimbursement • Flexible work schedules

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities