Backblaze External Website - Sr. Site Reliability Engineer

Remote - USA$150k - $200k+ Equity2mo ago

Remote Senior NA Cloud Computing Site Reliability Engineer Principal Go Python Documentation Linux Performance Management

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience). • 8+ years of progressive experience in site reliability, systems engineering, or operations. • 8+ years • Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems. • Expert-level Linux systems administration and advanced troubleshooting skills. • Expert-level • Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification. • Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis. • Deep mastery • Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred). • Advanced proficiency • Expert knowledge of incident response methodologies and operational best practices. • Expert knowledge • Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required. • Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment. • Preferred Attributes • Deep familiarity with ITIL/OSS practices and experience defining/enforcing SLO/SLA’s. • Deep familiarity • Exceptional problem-solving skills and a strong drive to learn and apply new, complex technologies. • Exceptional • Advanced experience with cloud platforms (AWS, GCP, or Azure) in a production setting.

Responsibilities

• Service Reliability & Operations • Own and drive the availability, durability, and performance of critical services across all production environments. • Own and drive • Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership. • Lead and champion • Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services. • Define, establish, and enforce • Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes. • Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management). • Mentor • Automation & Tooling • Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform. • Design and architect • Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability. • Drive the strategic direction • Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins). • Build, maintain, and secure • Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems. • Write production-grade code • Collaboration • Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation. • Act as a principal partner • Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features. • Lead and formalize • Lead capacity planning and disaster recovery strategy across critical infrastructure components. • Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance. • Manage • Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams. • Drive • cultivate • Continuous Improvement • Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation. • Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans. • Proactively identify • architect • drive • Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams. • Be a leading voice

Benefits

• Healthcare for family, including dental and vision • Competitive compensation and 401K • RSU grants for full-time employees • Flexible vacation policy • Maternity & paternity leave • MacBook Pro to use for work, plus a generous stipend to personalize your workstation • Childcare bonus (human children only) • Fertility treatment and support • Learning & development program • Commuter benefits • Culture that supports a healthy work-life balance • To provide greater transparency to candidates, we share base pay ranges for all US-based job postings regardless of state. We set standard base pay ranges for all roles based on function, level, and country location, benchmarked against similar-stage growth companies. Final offer amounts are determined by multiple factors, including candidate location, skills, depth of work experience, and relevant licenses/credentials, and may vary from the amounts listed below. • The expected salary range for this role is $150,000 - $200,000.

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities

Benefits