wagey.ggwagey.gg
Open Tech JobsCompaniesPricing
Log InGet Started Free
© 2026 Dominic Morris. All rights reserved.·Privacy·Terms·
Jobs/AWS Jobs/Senior Software Engineer - Grafana Databases, SRE | Germany | Remote

Senior Software Engineer - Grafana Databases, SRE | Germany | Remote

Grafana LabsRemote - UK19h ago
RemoteSeniorEMEACloud ComputingCommercial Real EstateSenior Software EngineerAWSGCPKubernetesAzureTerraform

Upload My Resume

Drop here or click to browse · PDF, DOCX, TXT

Apply in One Click

Requirements

  • 6+ years engineering experience, 3+ in SRE/CRE/production engineering. Strong preference for those with formal customer reliability engineering experience.
  • Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
  • Experience operating multi-tenant systems in production
  • Strong experience designing and implementing SLOs
  • Experience with one or more programming languages (e.g. Go, Python, Java, etc)
  • Experience with Linux operating systems internals, and some knowledge of networking, cloud storage, and scaling.
  • Excellent problem-solving and troubleshooting skills.
  • Experience with calmly and actively participating in blame-free Incident Response, following up on actions, and writing high quality PIRs (Post Incident Reviews, a.k.a. post-mortem documents)
  • Ability to reason about performance, scaling, and failure modes
  • Comfortable working within an engineering team where individuals are encouraged to have a strong sense of autonomy and self-direction.
  • Ability to partner deeply with product engineering teams
  • We highly value those who are intellectually curious, who default to transparency, possess a high bias towards action, and who are also kind (this is important!)

Responsibilities

  • The SRE team is embedded within the Mimir and Loki squads and focuses on ensuring that Grafana Cloud’s database products deliver exceptional reliability for our highest-SLA customers. We seek a senior engineer operating at the intersection of customer needs, production systems, and product engineering. In this role, you will:
  • Partner closely with product engineering squads (embedded model)
  • Own production reliability for high-SLA and complex customer environments
  • Design and implement automation to scale our reliability practices
  • Ensuring our customers meet our SLO targets
  • Define and evolve per-tenant SLOs and reliability models
  • Proactively reduce SLO burn to prevent repeat incidents
  • Serving as a primary escalation point and on-call for relevant incidents
  • Lead customer-impacting incident response and post-incident reviews
  • Contribute to design docs and code reviews
  • Influence feature design to ensure production scalability and operability
  • Build automation to eliminate toil where needed
  • Improve alert quality and reduce noisy escalations
  • Of course, there is an on-call component to this role and one that we take seriously. As a company, we hire globally (remote-first) to ensure our on-call remains healthy and aligned to approximately 12 daylight hours per day. You will work closely with counterparts in other regions to provide balanced coverage and shared ownership.
  • We invest heavily in developer productivity. You can use modern AI coding assistants as part of your daily workflow (your choice of tools, within security guidelines), backed by a company-funded usage budget so you can iterate quickly without unnecessary friction. We encourage pragmatic AI-assisted development: faster prototyping, test generation, refactors, documentation, and incident follow-ups—always paired with strong code review and quality standards. You’ll also have access to frontier models (e.g., GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro).
  • What Makes You a Great Fit:
  • Regular 1:1s to with your manager and colleagues
  • Reviewing and creating SLOs, proactively investigating ways in which we can further reduce budget burn for those SLOs, which can be self-directed or as the result of learnings from incidents, and may include improvements to monitoring, automation, increasing self-healing, auto-scaling, etc.
  • Improve observability of customers within their environments
  • Designing and implementing solutions to ensure reliability and scalability of our environments can meet rapidly increasing demands
  • Develop fault-tolerant design patterns ensuring that we are considering reliability at all stages of the service lifecycle.
  • Collaborating with our Engineering Leaders to help define and influence product strategy, roadmaps and technical designs
  • Participate in PR review and collaborating with other engineers on their Design Docs
  • Teach others about Site Reliability Engineering and communicate best practices to be applied early in development of new features and functionality
  • Participate in Incident Response when applicable, including investigation through to resolution, PIR, and communication with customers via Bridge calls where necessary

Benefits

  • 100% Remote, Global Culture - As a remote-only company, we bring together talent from around the world, united by a culture of collaboration and shared purpose.
  • 100% Remote, Global Culture -
  • Scaling Organization – Tackle meaningful work in a high-growth, ever-evolving environment.
  • Scaling Organization
  • Transparent Communication – Expect open decision-making and regular company-wide updates.
  • Transparent Communication
  • Innovation-Driven – Autonomy and support to ship great work and try new things.
  • Innovation-Driven
  • Open Source Roots – Built on community-driven values that shape how we work.
  • Open Source Roots
  • Empowered Teams – High trust, low ego culture that values outcomes over optics.
  • Career Growth Pathways – Defined opportunities to grow and develop your career.
  • Career Growth Pathways
  • Approachable Leadership – Transparent execs who are involved, visible, and human.
  • Approachable Leadership
  • Passionate People – Join a team of smart, supportive folks who care deeply about what they do.
  • Passionate People
  • In-Person onboarding - We want you to thrive from day 1 with your fellow new ‘Grafanistas’ to learn all about what we do and how we do it.
  • In-Person onboarding
  • Balance is Key - We operate a global annual leave policy of 30 days per annum. 3 days of your annual leave entitlement are reserved for Grafana Shutdown Days to allow the team to really disconnect. *We will comply with local legislation where applicable.
  • Balance is Key

Similar Jobs

Sr./Staff iOS SDK developer
7h ago
oscilaroscilar·Remote - Brazil - Remote·Equity
RemoteStaffLATAMFintechCloud ComputingiOS EngineerSwiftXcodeTeam ManagementPerformance ManagementStaff DevelopmentSQLGitAWSRisk Management
Software Engineer
10h ago
EnodeEnode·Remote - Europe
RemoteEMEACloud ComputingInternet of ThingsSoftware EngineerLearning & DevelopmentNode.jsAWSTypeScriptClose
Staff Software Engineer - Grafana Databases, Managed Services
10h ago
Grafana LabsGrafana Labs·Remote - United States (Remote)
RemoteStaffNACloud ComputingStaff EngineerSoftware EngineerGrafanaKafkaAWSSnowflakeKubernetesCassandraGCPAzureTerraformHelmLinuxGoPlaneCloseGeminiClaudeDocumentationMentoring
Principal Architect, AI/ML
10h ago
ZencoreZencore·Remote - UK
RemotePrincipalEMEACloud ComputingArtificial IntelligencePrincipalPrincipal EngineerCoachingClaudeGeminiBusiness DevelopmentJAXLangChainROI AnalysisAzureGCPAWSGoogle GKEMentoringStakeholder Management
Senior Software Engineer, Product Data Platform
11h ago
BrexBrex·Unknown - USA *·$192k – $240k/year + Equity
In OfficeSeniorNAData AnalyticsSenior Software EngineerSenior Data EngineerElasticsearchBrexReportingJavaKotlin

Stop filling. Start chilling.Start chilling.

Get Started Free

No credit card. Takes 10 seconds.