k-ID - Lead Site Reliability Engineer

Singapore+ Equity2mo ago

In Office Staff APAC Cloud Computing Site Reliability Engineer Go Python TypeScript AWS Kubernetes

Requirements

• 7 or more years of experience in site reliability engineering, infrastructure engineering, platform engineering, or software engineering with significant production ownership • Strong experience operating production systems in AWS • Strong hands on experience with Kubernetes, containerized services, and modern infrastructure tooling • Experience building and improving observability across metrics, logs, tracing, alerting, and service health • Deep understanding of distributed systems, service failure modes, traffic management, capacity planning, and recovery design • Experience designing or running incident response programs, on call operations, escalation frameworks, and post incident review processes • Experience leading or managing NOC, production operations, or support functions in a high availability environment • Strong experience with infrastructure as code such as Terraform • Experience improving CI and CD workflows, release safety, rollback practices, and change management • Ability to write code or automation in one or more languages such as Go, Python, or TypeScript • Strong written and verbal communication skills, especially in high pressure operational settings • Experience working in fast moving startup environments is strongly preferreded

Responsibilities

• Own the reliability and operational health of k-ID’s production systems and critical services • Lead the NOC function, including shift structure, escalation paths, incident handling standards, readiness processes, and operational reporting • Act as the senior escalation point for major incidents and serve as incident commander for high severity events when needed • Design and improve monitoring, alerting, and operational tooling so the NOC can detect issues early and respond effectively • Drive root cause analysis and post incident review practices that produce real corrective action rather than superficial summaries • Partner with engineering teams to improve system resilience, deployment safety, service ownership, and production readiness • Identify systemic risks across infrastructure, services, dependencies, and operational processes, then drive plans to reduce them • Improve platform performance, availability, and recovery time through architecture changes, better automation, and stronger operating discipline • Build and maintain runbooks, readiness checklists, service health standards, and escalation playbooks across the organization • Help define service level objectives, operational metrics, and reliability targets that align with business needs • Support and mentor senior NOC engineers and other operations team members, helping raise technical depth and decision quality across the function • Contribute hands on to infrastructure and reliability engineering work where needed, especially in high leverage areas

Benefits

• A competitive startup salary aligned with experience and market benchmarks. • Employee Stock Ownership Plan so you participate directly in the long term upside of the company. • HEALTH AND WELLBEING • Comprehensive family health coverage, including medical, dental, and vision benefits • Provided Mental Health and Wellness support benefit • PROFESSIONAL DEVELOPMENT • Hands on exposure with key clients in a scaling global tech company • Opportunities for continuous learning through real ownership rather than formal training alone. • Direct collaboration with the Founders and the tech leadership team • CULTURE AND WAYS OF WORKING • A collaborative, inclusive and low politics work environment. • Flexible, trust based working culture shaped by a US startup operating model. • A mission driven company focused on improving online experiences for kids and teens globally. • Applicants Privacy Policy https://k-id.com/job-applicants-privacy-notice