Lead Software Engineer, Devops domain (Bangkok based, relocation provided)

AgodaBangkok, Thailand1mo ago

In Office Staff APAC Fintech Hotels Travel E-commerce DevOps Engineer Software Engineer Senior DevOps Engineer Go Rust Java Python Kubernetes

Upload My Resume

Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• · Demonstrated ownership of architecting, building, and operating mission-critical production systems, making long-term technical and reliability trade-off decisions. • · Proven ability to lead and coordinate complex cross-team initiatives, setting technical direction and aligning stakeholders to deliver outcomes at organizational scale. • · Expertise in one or more programming skills (e.g., Go, Python, Rust, Java) with a solid understanding of distributed systems fundamentals (concurrency, backpressure, timeouts/retries, idempotency, circuit breaking). • · Deep hands-on experience with the Kubernetes ecosystem, service mesh technologies (e.g., Istio), Kubernetes deployment workflows (e.g., Argo CD). • · Observability & monitoring expertise, using Prometheus, Grafana, and common logging/telemetry stacks (e.g., OpenTelemetry), with an understanding of signal quality, scalability, and cost trade-offs. • · Strong incident management lifecycle aiming for improving area of alert quality, alert management, incident response, RCA, and postmortems. • · Experience with reliability engineering patterns such as canary deployments, automated rollback, capacity/right-sizing automation, and production operation. • · Solid data analysis, including SQL(e.g., PostgreSQL, MSSQL) and data pipelines. • · Data-driven mindset, able to perform deep research, analyze complex problems, and make informed technical decisions. • · Excellent communication and collaboration skills, able to explain complex technical concepts clearly to stakeholders at all levels, and to operate effectively both as a self-directed individual contributor and as part of a team. • · Curiosity and continuous learning, staying current with industry trends, open-source advancements, and emerging reliability practices. • Nice-to-Have: • · Experience operating large-scale, high-QPS systems serving millions of users in domains such as e-commerce, travel, or fintech. • · Hands-on experience with multi-region / multi-DC architectures and traffic isolation or failover strategies. • · Background in chaos engineering and resilience testing. • · Experience defining or scaling org-wide SLO/SRE frameworks. • · Built or operated Kubernetes controllers/operators. • · Exposure to ML-assisted detection or statistical methods for signal tuning (e.g., windowing strategies, precision/recall trade-offs).

Responsibilities

• · Lead the technical vision, architecture, and execution of new SRE platforms or reliability initiatives. • · Define and promote SRE best practices across Agoda’s services e.g., SLI/SLO-driven engineering, error budgets, and other data-driven reliability factors. • · Design, build, and operate reliability platforms including load shedding , business signals monitoring, and safe-deployment automation to reduce blast radius while preserving developer velocity.. • · Own safe deployment strategies such as canary releases, automated rollback, and business-impact protection integrated with deployment & monitoring. • · Proactively identify and mitigate reliability and scaling risks across Agoda’s services. • · Improve system resilience and multi-cluster readiness by partnering with platform team and operation team. • · Lead major incident response and operational excellence, driving fast detection, mitigation, root cause analysis, postmortems, and learnings focused on business impact. • · Maintain and evolve incident, observability, alerting, and on-call tooling, improving signal quality, alert enrichment, grouping, and reducing time-to-clue and time-to-mitigation for NOC and on-call engineers. • · Advance platform observability and reliability signals using Prometheus and Grafana, balancing actionability, scale, and cost efficiency. • · Define reliability roadmaps and OKRs, translating ambiguous business reliability goals into clear technical requirements.