wagey.ggwagey.gg
38,923  jobs38,923  jobs
Browse Tech JobsCompaniesFeaturesPricingFAQs
Log InGet Started Free
Jobs(38,923)/Site Reliability Engineer Role(222)/opsmill (5) - Product Reliability Engineer | EU
opsmill

opsmill - Product Reliability Engineer | EU

Remote - European Union2mo ago
RemoteEMEADiagnosticsSite Reliability EngineerCloseDocumentationKubernetesPerformance ManagementRust

Requirements

• Strong software engineering fundamentals including design, debugging, testing, code review, and a focus on maintainable, production-quality code • Practical Kubernetes expertise sufficient to debug real deployments: troubleshooting resources, networking, storage, RBAC, and platform-specific quirks across different distributions • Deep troubleshooting instincts and observability experience using logs, metrics, and traces to diagnose issues quickly in complex, distributed systems • Experience with at least one of: Python, Go, or Rust for building tooling and contributing to product code (you don't need to be expert in all three) • Excellent problem decomposition and communication skills—you can break down messy, ambiguous issues and clearly explain your findings and recommendations • Self-directed remote work capability with strong async communication skills and the ability to operate independently in a fast-moving environment where priorities shift based on customer needs • Collaborative mindset with experience partnering across product, engineering, and customer-facing teams to drive systematic improvements • Experience with packaging and distribution systems (containers, Helm charts, installers) and managing upgrade/migration flows • Background running CI/CD at scale including test parallelization, hermetic environments, and artifact management • Familiarity with performance tooling such as profiling, load generation, and benchmark harnesses • Previous experience in customer-facing technical roles like escalation engineering, support engineering, or solutions engineering • Contributions to open source projects, especially in infrastructure, observability, or reliability tooling

Responsibilities

• Partner directly with customers and with our Solution Architecture/Customer Success teams on L2/L3 escalations—communicating findings, driving root-cause analysis, and resolving complex packaging, deployment, upgrade, and runtime issues across heterogeneous Kubernetes environments. • Drive issues to resolution by reproducing problems locally, isolating root causes, and coordinating fixes with engineering—then documenting learnings in crisp RCAs that become actionable improvements • Build and maintain diagnostics tooling including support bundles, health checks, environment validators, and "what changed?" helpers that make future troubleshooting 10x faster • Own the test automation infrastructure roadmap, improving CI stability, reducing flaky tests, and creating reproducible integration/e2e environments that catch issues before customers do • Establish and maintain performance baselines and regression tests that serve as actionable gates, helping teams catch scale and latency issues early • Improve installation and upgrade robustness by identifying recurring failure modes and eliminating them through product changes, automation, and guardrails • Write production-quality code in Python, Go, or Rust for internal tooling and product improvements that directly enhance reliability • Close the reliability feedback loop by systematically turning field issues into better tests, observability, documentation, and product defaults—measuring success through reduced time-to-resolution and fewer repeat incidents

Benefits

• We need someone who can operate in both worlds: diving deep on gnarly customer escalations while systematically eliminating entire classes of problems. You'll be the crucial bridge between "customer is blocked right now" and "this type of issue can't happen again." You'll build the diagnostics, tests, and automation that turn on-prem deployment chaos into predictable, debuggable, fixable reliability. • The people: Work alongside world-class engineers who've built and scaled automation platforms in production. Daily technical challenges with smart colleagues who push you to grow. • The product: Shape Infrahub based on real customer needs. Your input directly influences features, integrations, and roadmap priorities. • The mission: We're making enterprise-grade infrastructure automation accessible to any organization. Open-source at the core, production-ready out of the box. This is a multi-year journey, not a quarterly sprint. • The impact: You'll work with teams managing some of the world's most complex infrastructure deployments, solving problems that ripple across entire organizations. • OUR COMMITMENT TO DIVERSITY AND INCLUSION • OpsMill is committed to building a diverse and inclusive team. We believe different perspectives make us stronger and more innovative. We encourage applications from candidates of all backgrounds and experiences, and we're committed to providing an inclusive environment where everyone can do their best work.

Apply in one click

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click
Apply in One Click

Similar roles

RedditReddit - Staff Site Reliability Engineer - Site Experience1mo ago
·Remote - UK
RemoteEMEAStaffCloud ComputingSite Reliability EngineerGoPythonPerformance ManagementLinuxKubernetes
RedditReddit - Staff Site Reliability Engineer1mo ago
·Dublin, Ireland
In OfficeEMEAStaffCloud ComputingSite Reliability EngineerGoPythonPerformance ManagementLinuxKubernetes
terawattinfrastructureterawattinfrastructure - Terawatt Infrastructure - EV Site Operations Specialist1mo ago
·Remote - Europe *
RemoteEMEAMidSite Reliability EngineerReportingClose
TrendyolTrendyol - Site Reliability Engineer (Fintech Services)1mo ago
·Istanbul / Maslak
In OfficeEMEAFintechSite Reliability EngineerGoKubernetes
Augur Initiative LtdAugur Initiative Ltd - Site Reliability Engineer2mo ago
·London, England, United Kingdom
In OfficeEMEASoftwareSite Reliability EngineerDocumentationResource Allocation
MinIOMinIO - Site Reliability Engineer - South Korea2mo ago
·Remote - South Korea·$21k - $21k/year
RemoteAPACSeniorSite Reliability EngineerGoC++RustDocumentationKubernetes
AlgoliaAlgolia - Site Reliability Engineer, PaaS1mo ago
·Paris, France - Hybrid·€56.5 - €78.5/hour/year
In OfficeEMEAPaymentsCloud ComputingSite Reliability EngineerRubyPythonKubernetesAlgoliaPipeline Management
Wand Synthesis AI IncWand Synthesis AI Inc - Senior Site Reliability Engineer1mo ago
·Remote, Europe Timezone - Hybrid
In OfficeEMEASeniorCloud ComputingSite Reliability EngineerKubernetesTerraformB2BB2CMLOps
replitreplit - Senior Site Reliability Engineer1mo ago
·Remote - Europe
RemoteEMEASeniorCloud ComputingSite Reliability EngineerGoPythonReportingKubernetesGCP

Browse more by category

Show 222 moreSite Reliability EngineerShow 3,014 moreCloseShow 5,795 moreDocumentationShow 1,928 moreKubernetesShow 1,430 morePerformance ManagementShow 732 moreRust
Privacy·Terms··Contact·FAQ·Wagey on X