Grafana Labs - Senior Software Engineer - Grafana Databases, Managed Services | Ireland | Remote

Remote - Ireland1mo ago

Remote Senior EMEA Cloud Computing Senior Software Engineer Senior Data Engineer Grafana Kafka Kubernetes Snowflake AWS

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• 6+ years of engineering experience, including meaningful time in SRE, platform engineering, production engineering, infrastructure engineering, or distributed systems roles. • Experience operating distributed systems in production (e.g., streaming systems, analytical databases, large-scale storage backends). Examples of these include Kafka, Redpanda, WarpStream, Postgres, ClickHouse, Snowflake, or Cassandra. • Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.). • Solid understanding of distributed systems design and large-scale system trade-offs. • Proficiency in at least one programming language (Go preferred, but not required). • Working knowledge of Linux internals, networking, cloud storage, and performance/scaling behavior. • Experience participating in blameless incident response and writing high-quality post-incident reviews. • Clear communicator who can collaborate across teams and work autonomously. • Curious, pragmatic, action-oriented, and kind (this is important!)

Responsibilities

• As a Senior Engineer on Managed Services, you will take ownership of running these systems in production. This involves: • Operating and evolving 100+ multi-cloud streaming clusters and related database infrastructure • Diagnosing and eliminating cross-layer failure modes (e.g., object storage latency, noisy neighbors, control-plane bottlenecks, query performance regressions, etc.) • Designing safe upgrade and rollout strategies at scale • Improving observability, automation, and operational ergonomics • Partnering closely with database and platform teams to ensure safe scaling, partitioning, consumer fan-out, and query performance • Working directly with distributed systems behavior, Kubernetes scheduling dynamics, storage engines, compression trade-offs, etc. • Serving as a primary escalation point and on-call for relevant incidents • Owning the relationship with all system vendors, including WarpStream Labs and others. • As we are remote-first and our engineering organization is largely remote, we provide guidance and meet regularly using video calls, so an independent attitude and good communication skills are a must. • This role blends deep distributed systems work with the opportunity to influence how the team approaches reliability, scaling, and operational excellence. • We invest heavily in developer productivity. You can use modern AI coding assistants as part of your daily workflow (your choice of tools, within security guidelines), backed by a company-funded usage budget so you can iterate quickly without unnecessary friction. We encourage pragmatic AI-assisted development: faster prototyping, test generation, refactors, documentation, and incident follow-ups—always paired with strong code review and quality standards. You’ll also have access to frontier models (e.g., GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro). • Of course, there is an on-call component to this role and one that we take seriously. As a company, we hire globally (remote-first) to ensure our on-call remains healthy and aligned to approximately 12 daylight hours per day. You will work closely with counterparts in other regions to provide balanced coverage and shared ownership. • What Makes You a Great Fit: • Regular 1:1s with your manager and close collaboration with teammates across regions • Reviewing and defining SLOs for shared database infrastructure, proactively reducing error budgets through improvements to monitoring, automation, scaling strategies, and system design • Improving the diagnosability of core streaming and database systems in production, where possible. • Implementing solutions that ensure reliability, scalability, and performance of high-throughput, multi-cloud infrastructure • Developing fault-tolerant patterns that account for distributed system realities such as storage latency, partition imbalance, noisy neighbors, and control-plane dependencies • Planning and executing safe upgrades and rollouts across dozens of production clusters • Collaborating with database and platform engineering leaders to influence architecture, roadmap priorities, and long-term strategy • Participating in PR review and contributing to design documents, automation, tooling, and code improvements that reduce operational risk • Sharing best practices and distributed systems knowledge with partner teams • Participating in incident response, from investigation through resolution and post-incident reviews (PIR)

Benefits

• 100% Remote, Global Culture - As a remote-only company, we bring together talent from around the world, united by a culture of collaboration and shared purpose. • 100% Remote, Global Culture - • Scaling Organization – Tackle meaningful work in a high-growth, ever-evolving environment. • Scaling Organization • Transparent Communication – Expect open decision-making and regular company-wide updates. • Transparent Communication • Innovation-Driven – Autonomy and support to ship great work and try new things. • Innovation-Driven • Open Source Roots – Built on community-driven values that shape how we work. • Open Source Roots • Empowered Teams – High trust, low ego culture that values outcomes over optics. • Career Growth Pathways – Defined opportunities to grow and develop your career. • Career Growth Pathways • Approachable Leadership – Transparent execs who are involved, visible, and human. • Approachable Leadership • Passionate People – Join a team of smart, supportive folks who care deeply about what they do. • Passionate People • In-Person onboarding - We want you to thrive from day 1 with your fellow new ‘Grafanistas’ to learn all about what we do and how we do it. • In-Person onboarding • Balance is Key - We operate a global annual leave policy of 30 days per annum. 3 days of your annual leave entitlement are reserved for Grafana Shutdown Days to allow the team to really disconnect. *We will comply with local legislation where applicable. • Balance is Key

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities

Benefits