Backblaze External Website - Site Reliability Engineer II

Remote - Bangalore3w ago

Remote Mid APAC Cloud Computing Software Site Reliability Engineer Bash Go Python Linux Docker Kubernetes ITIL GCP AWS Azure Prometheus ELK Grafana Terraform Jenkins Ansible Documentation

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience). • 2–4 years of experience in site reliability, systems engineering, or operations. • Exposure to large-scale, production-grade systems. • Solid Linux systems administration and troubleshooting skills. • Familiarity with service reliability concepts - monitoring, alerting, incident response, and root cause analysis. • Proficiency in at least one scripting language (Python, Bash, or Go). • Understanding of containers (Kubernetes, Docker) and microservices concepts. • Knowledge of incident response and operational best practices. • Preferred Attributes • Experience in a SaaS, service provider, or distributed systems environment. • Familiarity with ITIL/OSS practices and SLO/SLA’s • Strong problem-solving skills and willingness to learn new technologies. • Experience with cloud platforms (AWS, GCP, or Azure). • Ability to work independently, take ownership, and drive projects from problem discovery through resolution.

Responsibilities

• Service Reliability & Operations • Support the availability and durability of critical services across production environments. • Monitor service health using SLIs, SLOs, and error budgets, and escalate issues when thresholds are at risk. • Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements. • Follow established ITIL/OSS processes (incident, change, problem, and capacity management). • Automation & Tooling • Develop automation for common operational tasks, reducing manual intervention and toil. • Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint,ELK). • Work with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins). • Write scripts (Bash, Python, Go, etc.) to improve system reliability and efficiency. • Collaboration • Collaboration • Partner with engineering, product, and operations teams to support resilient system design and operations. • Assist in capacity planning and disaster recovery exercises. • Work with vendors and service providers to troubleshoot service issues and track SLA performance. • Document systems, share learnings, and help grow a reliability-minded engineering culture. • Continuous Improvement • Contribute to playbooks, runbooks, and operational documentation. • Identify recurring issues and propose long-term improvements. • Promote reliability-focused practices within development and operations teams.

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities