Backblaze External Website - Site Reliability Engineer II
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience). • 2–4 years of experience in site reliability, systems engineering, or operations. • Exposure to large-scale, production-grade systems. • Solid Linux systems administration and troubleshooting skills. • Familiarity with service reliability concepts - monitoring, alerting, incident response, and root cause analysis. • Proficiency in at least one scripting language (Python, Bash, or Go). • Understanding of containers (Kubernetes, Docker) and microservices concepts. • Knowledge of incident response and operational best practices. • Preferred Attributes • Experience in a SaaS, service provider, or distributed systems environment. • Familiarity with ITIL/OSS practices and SLO/SLA’s • Strong problem-solving skills and willingness to learn new technologies. • Experience with cloud platforms (AWS, GCP, or Azure). • Ability to work independently, take ownership, and drive projects from problem discovery through resolution.
Responsibilities
• Service Reliability & Operations • Support the availability and durability of critical services across production environments. • Monitor service health using SLIs, SLOs, and error budgets, and escalate issues when thresholds are at risk. • Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements. • Follow established ITIL/OSS processes (incident, change, problem, and capacity management). • Automation & Tooling • Develop automation for common operational tasks, reducing manual intervention and toil. • Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint,ELK). • Work with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins). • Write scripts (Bash, Python, Go, etc.) to improve system reliability and efficiency. • Collaboration • Collaboration • Partner with engineering, product, and operations teams to support resilient system design and operations. • Assist in capacity planning and disaster recovery exercises. • Work with vendors and service providers to troubleshoot service issues and track SLA performance. • Document systems, share learnings, and help grow a reliability-minded engineering culture. • Continuous Improvement • Contribute to playbooks, runbooks, and operational documentation. • Identify recurring issues and propose long-term improvements. • Promote reliability-focused practices within development and operations teams.
No credit card. Takes 10 seconds.