fieldguide - Senior Site Reliability Engineer
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• 5+ years of experience in site reliability engineering, infrastructure, or a related software engineering discipline. • Strong experience operating and scaling distributed systems in cloud environments, with AWS preferred. • Hands-on experience building and managing observability platforms (e.g., Datadog, Prometheus, Grafana, CloudWatch). • Experience defining SLOs/SLIs and leveraging them to inform and drive engineering priorities. • Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent. • Deep understanding of system performance, reliability patterns, and distributed system failure modes. • Experience supporting production systems through on-call rotations and incident response. • Proficiency in at least one programming or scripting language used for automation and tooling. • Strong communication and collaboration skills, with the ability to work effectively across engineering and product teams. • Experience implementing distributed tracing systems, such as OpenTelemetry or similar frameworks. • Experience with capacity planning and performance benchmarking at scale. • Familiarity with database performance tuning and observability across high-traffic systems. • Exposure to regulated or compliance-heavy engineering environments (e.g., SOC 2, FedRAMP, or equivalent frameworks). • Experience applying chaos engineering practices to proactively test and strengthen system resilience.
Responsibilities
• Design and operate highly scalable, fault-tolerant systems that support production workloads across a distributed cloud environment. • Define and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to guide reliability decisions. • Build and improve observability systems (metrics, logs, tracing) to provide deep visibility into system behavior and performance. • Lead efforts to improve system reliability and performance, including capacity planning, load testing, and performance tuning. • Automate operational processes to reduce manual toil and improve system consistency and resilience. • Partner with engineering teams to design systems with reliability and scalability built in from the start. • Participate in and improve incident response, on-call practices, and post-incident reviews, focusing on root cause analysis and systemic improvements. • Drive continuous improvement of system resilience, including disaster recovery and chaos testing. • Establish best practices for monitoring, alerting, and incident management to ensure rapid detection and resolution of issues. • Advocate for reliability-focused engineering culture, including blameless postmortems and operational excellence.
Benefits
• Competitive compensation packages with meaningful ownership • Wellness benefits, including a bundle of free therapy sessions • Technology & Work from Home reimbursement • Flexible work schedules
No credit card. Takes 10 seconds.