kraken123 - Senior Platform Engineer - Product Reliability

London / Manchester / Berlin / Paris2mo ago

In Office Senior EMEA Cloud Computing Platform Engineer Performance Management AWS GCP Azure Terraform

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Responsibilities

• Lead reliability improvements across multiple product services and domains • Identify systemic reliability risks and drive cross-team initiatives to address them • Partner closely with product and engineering teams to influence system design, operational practices, and prioritisation • Improve observability, incident management, and service performance across critical systems • Lead incident investigations and follow-up, ensuring root causes are addressed and long-term fixes are driven through to completion • Help standardise incident management practices, tooling usage, and operational guardrails across teams • Contribute hands-on through debugging, code changes, automation, and system design • Identify common reliability patterns and implement scalable solutions that can be reused across teams • Establish and promote best practices for building and operating reliable systems • Support broader platform engineering work where needed across infrastructure, release, developer enablement, and resilience initiatives • Help solve complex and ambiguous problems in a fast-moving environment • ## What You'll Have: • Strong experience operating and improving production systems at scale • Proven track record of leading reliability or platform initiatives across teams, with measurable impact • Deep understanding of distributed systems and common failure modes • Strong debugging and problem-solving skills in complex production environments • Hands-on experience with cloud infrastructure, with AWS preferred; strong GCP or Azure experience is also valued • Experience working with infrastructure tooling such as Terraform • Ability to read, write, review, and improve production-grade code; Python experience is highly valued • Experience with incident management tooling such as Rootly, PagerDuty, Incident.io, or Datadog • Experience leading incident investigations, post-incident follow-up, and long-term remediation • Strong communication skills, including explaining technical concepts and trade-offs clearly to different audiences • Experience working cross-functionally with product and engineering teams in distributed environments • Strong interpersonal skills, empathy, and the ability to influence teams constructively • Comfort operating with high autonomy in small, accountable teams • Comfortable working in a Kanban environment • ## What Will Help: • Kubernetes and container orchestration • Observability tooling such as Datadog, Grafana, and Prometheus • CI/CD and release engineering practices • Event-driven systems and messaging platforms • Experience in SRE, platform engineering, or other reliability-focused roles • Familiarity with PostgreSQL or Amazon RDS at scale • Experience defining reusable standards, guardrails, or golden paths across teams • Kraken is a certified Great Place to Work in France, Germany, Spain, Japan and Australia. In the UK we are one of the Best Workplaces on Glassdoor with a score of 4.5 and in Germany we rate 4.7 on Kununu as a Top Company. Check out our Welcome to the Jungle site (FR/EN) to learn more about our teams and culture. • Are you ready for a career with us? We want to ensure you have all the tools and environment you need to unleash your potential. If you have any specific accommodations or a unique preference, please contact us at [email protected] and we'll do what we can to customise your interview process for comfort and maximum magic!

Get Started Free

No credit card. Takes 10 seconds.

Responsibilities