k-ID - Senior Site Reliability Engineer

Remote - Singapore2mo ago

Remote Senior APAC Cloud Computing Site Reliability Engineer Go Python TypeScript AWS Kubernetes

Requirements

• 5+ years of experience in infrastructure, platform engineering, site reliability engineering, or software engineering with meaningful production ownership • Strong experience running production systems in AWS • Strong hands on experience with Kubernetes and container based workloads • Experience with infrastructure as code, preferably Terraform • Experience designing and operating observability stacks using tools such as Prometheus, Alertmanager, Grafana, OpenTelemetry, or equivalent systems • Strong understanding of distributed systems, failure modes, service reliability, and production debugging • Experience building or improving CI and CD systems and release workflows in modern engineering environments • Ability to write code and automation in one or more languages such as Go, Python, or TypeScript • Good judgment during incidents and a practical mindset around tradeoffs, risk, and recovery • Clear written and verbal communication skills with the ability to work effectively in a remote team • Startup experience is a plus, especially in environments where systems and processes are still being built • Applicants Privacy Policy https://k-id.com/job-applicants-privacy-notice

Responsibilities

• Own the reliability, availability, and performance of the systems behind k-ID’s platform and public APIs • Design and improve scalable infrastructure on AWS and Kubernetes that can support high growth, uneven traffic, and global production workloads • Build and maintain strong observability across logs, metrics, tracing, alerting, and service health so issues are caught early and investigated quickly • Improve deployment safety through better CI and CD workflows, release controls, rollback paths, and environment consistency • Drive incident response and production readiness practices, including runbooks, on call hygiene, postmortems, capacity planning, and resilience testing • Reduce operational toil by automating repetitive work and improving internal tooling for developers and operators • Partner with engineering teams to embed reliability and operability into service design from the start, not after something fails in production • Strengthen platform security and infrastructure hygiene across access controls, secrets handling, system hardening, and production safeguards • Continuously improve system performance, resource efficiency, and cost awareness without compromising reliability