Grafana Labs - Senior Software Engineer
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• 6+ years engineering experience, 3+ in SRE/CRE/production engineering. Strong preference for those with formal customer reliability engineering experience. • Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.). • Experience operating multi-tenant systems in production • Strong experience designing and implementing SLOs • Experience with one or more programming languages (e.g. Go, Python, Java, etc) • Experience with Linux operating systems internals, and some knowledge of networking, cloud storage, and scaling. • Excellent problem-solving and troubleshooting skills. • Experience with calmly and actively participating in blame-free Incident Response, following up on actions, and writing high quality PIRs (Post Incident Reviews, a.k.a. post-mortem documents) • Ability to reason about performance, scaling, and failure modes • Comfortable working within an engineering team where individuals are encouraged to have a strong sense of autonomy and self-direction. • Ability to partner deeply with product engineering teams • We highly value those who are intellectually curious, who default to transparency, possess a high bias towards action, and who are also kind (this is important!)
Responsibilities
• The SRE team is embedded within the Mimir and Loki squads and focuses on ensuring that Grafana Cloud’s database products deliver exceptional reliability for our highest-SLA customers. We seek a senior engineer operating at the intersection of customer needs, production systems, and product engineering. In this role, you will: • Partner closely with product engineering squads (embedded model) • Own production reliability for high-SLA and complex customer environments • Design and implement automation to scale our reliability practices • Ensuring our customers meet our SLO targets • Define and evolve per-tenant SLOs and reliability models • Proactively reduce SLO burn to prevent repeat incidents • Serving as a primary escalation point and on-call for relevant incidents • Lead customer-impacting incident response and post-incident reviews • Contribute to design docs and code reviews • Influence feature design to ensure production scalability and operability • Build automation to eliminate toil where needed • Improve alert quality and reduce noisy escalations • Of course, there is an on-call component to this role and one that we take seriously. As a company, we hire globally (remote-first) to ensure our on-call remains healthy and aligned to approximately 12 daylight hours per day. You will work closely with counterparts in other regions to provide balanced coverage and shared ownership. • We invest heavily in developer productivity. You can use modern AI coding assistants as part of your daily workflow (your choice of tools, within security guidelines), backed by a company-funded usage budget so you can iterate quickly without unnecessary friction. We encourage pragmatic AI-assisted development: faster prototyping, test generation, refactors, documentation, and incident follow-ups—always paired with strong code review and quality standards. You’ll also have access to frontier models (e.g., GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro). • What Makes You a Great Fit: • Regular 1:1s to with your manager and colleagues • Reviewing and creating SLOs, proactively investigating ways in which we can further reduce budget burn for those SLOs, which can be self-directed or as the result of learnings from incidents, and may include improvements to monitoring, automation, increasing self-healing, auto-scaling, etc. • Improve observability of customers within their environments • Designing and implementing solutions to ensure reliability and scalability of our environments can meet rapidly increasing demands • Develop fault-tolerant design patterns ensuring that we are considering reliability at all stages of the service lifecycle. • Collaborating with our Engineering Leaders to help define and influence product strategy, roadmaps and technical designs • Participate in PR review and collaborating with other engineers on their Design Docs • Teach others about Site Reliability Engineering and communicate best practices to be applied early in development of new features and functionality • Participate in Incident Response when applicable, including investigation through to resolution, PIR, and communication with customers via Bridge calls where necessary
Benefits
• 100% Remote, Global Culture - As a remote-only company, we bring together talent from around the world, united by a culture of collaboration and shared purpose. • 100% Remote, Global Culture - • Scaling Organization – Tackle meaningful work in a high-growth, ever-evolving environment. • Scaling Organization • Transparent Communication – Expect open decision-making and regular company-wide updates. • Transparent Communication • Innovation-Driven – Autonomy and support to ship great work and try new things. • Innovation-Driven • Open Source Roots – Built on community-driven values that shape how we work. • Open Source Roots • Empowered Teams – High trust, low ego culture that values outcomes over optics. • Career Growth Pathways – Defined opportunities to grow and develop your career. • Career Growth Pathways • Approachable Leadership – Transparent execs who are involved, visible, and human. • Approachable Leadership • Passionate People – Join a team of smart, supportive folks who care deeply about what they do. • Passionate People • In-Person onboarding - We want you to thrive from day 1 with your fellow new ‘Grafanistas’ to learn all about what we do and how we do it. • In-Person onboarding • Balance is Key - We operate a global annual leave policy of 30 days per annum. 3 days of your annual leave entitlement are reserved for Grafana Shutdown Days to allow the team to really disconnect. *We will comply with local legislation where applicable. • Balance is Key
No credit card. Takes 10 seconds.