Agoda - Lead DevOps Engineer (SRE) (Bangkok based, relocation provided)
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• 8+ years of relevant experience. • Demonstrated ownership of architecting, building, and operating mission-critical production systems, making long-term technical and reliability trade-off decisions. • Proven ability to lead and coordinate complex cross-team initiatives, setting technical direction and aligning stakeholders to deliver outcomes at organizational scale. • Expertise in one or more programming skills (e.g., Go, Python, Rust, Java) with a solid understanding of distributed systems fundamentals (concurrency, backpressure, timeouts/retries, idempotency, circuit breaking). • Deep hands-on experience with the Kubernetes ecosystem, service mesh technologies (e.g., Istio), Kubernetes deployment workflows (e.g., Argo CD). • Observability & monitoring expertise, using Prometheus, Grafana, and common logging/telemetry stacks (e.g., OpenTelemetry), with an understanding of signal quality, scalability, and cost trade-offs. • Strong incident management lifecycle aiming for improving area of alert quality, alert management, incident response, RCA, and postmortems. • Experience with reliability engineering patterns such as canary deployments, automated rollback, capacity/right-sizing automation, and production operation. • Solid data analysis, including SQL(e.g., PostgreSQL, MSSQL) and data pipelines. • Data-driven mindset, able to perform deep research, analyze complex problems, and make informed technical decisions. • Excellent communication and collaboration skills, able to explain complex technical concepts clearly to stakeholders at all levels, and to operate effectively both as a self-directed individual contributor and as part of a team. • Curiosity and continuous learning, staying current with industry trends, open-source advancements, and emerging reliability practices. • Nice-to-Have: • Experience operating large-scale, high-QPS systems serving millions of users in domains such as e-commerce, travel, or fintech. • Hands-on experience with multi-region / multi-DC architectures and traffic isolation or failover strategies. • Background in chaos engineering and resilience testing. • Experience defining or scaling org-wide SLO/SRE frameworks. • Built or operated Kubernetes controllers/operators. • Exposure to ML-assisted detection or statistical methods for signal tuning (e.g., windowing strategies, precision/recall trade-offs). • #Bengaluru #SãoPaulo #Delhi #NewYorkCity #Nigeria #London #Hyderabad #Pune #Mumbai #Colombia #Paris #Jakarta #Chennai #SanFrancisco #WashingtonDC #Toronto #Pakistan #LosAngeles #Dallas #Chicago #Kenya #Boston #Shanghai #Egypt #BuenosAires #Manila #Netherlands #Singapore #RiodeJaneiro #Beijing #Atlanta #Sydney #Madrid #Vietnam #SaudiArabia #Peru #Melbourne #Ireland #Russia #Bangladesh #MexicoCity #Philadelphia #Chile #SeattleArea #Noida #Kolkata #Guangdong #UnitedArabEmirates #TelAvivDistrict #Houston #KualaLumpur #BeloHorizonte #SouthKorea #Bangkok #Istanbul #Austin #Curitiba #Warsaw #Campinas #Barcelona #Ukraine #CostaRica #Berlin #Romania #Denver #Johannesburg #Minneapolis #Manchester #Miami #Phoenix #Detroit #Coimbatore #Milan #PortoAlegre #Vancouver #Montreal #Charlotte #SanDiego #Ghana #SaltLakeCity #Raleigh #HongKong #Munich #Prague #Ecuador #TampaBay #Tokyo #Serbia #Lithuania #Taipei #Cracow #Zhejiang #CapeTown #Brasilia #Columbus #Ahmedabad #Indore #Kochi #Gurgaon #Chandigarh #Lucknow #Bhubaneswar #Thiruvananthapuram #Visakhapatnam #Bhopal #JerseyCity #Irving #Denton #Worcester #Arlington #OverlandPark #AuroraDistrict #Baltimore #Tampa #Halethorpe #Dayton #Syracuse #Chonburi #ChiangMai #NakhonRatchasima #KhonKaen #HatYai #Phuket #Surabaya #Tangerang #Birmingham #Casablanca #Rabat #Camp #PetalingJaya #GeorgeTown • Please review our Hiring Process Guidelines before your interview — click here to learn how interviewing at Agoda works.
Responsibilities
• Lead the technical vision, architecture, and execution of new SRE platforms or reliability initiatives. • Define and promote SRE best practices across Agoda’s services e.g., SLI/SLO-driven engineering, error budgets, and other data-driven reliability factors. • Design, build, and operate reliability platforms including load shedding , business signals monitoring, and safe-deployment automation to reduce blast radius while preserving developer velocity. • Own safe deployment strategies such as canary releases, automated rollback, and business-impact protection integrated with deployment & monitoring. • Proactively identify and mitigate reliability and scaling risks across Agoda’s services. • Improve system resilience and multi-cluster readiness by partnering with platform team and operation team. • Lead major incident response and operational excellence, driving fast detection, mitigation, root cause analysis, postmortems, and learnings focused on business impact. • Maintain and evolve incident, observability, alerting, and on-call tooling, improving signal quality, alert enrichment, grouping, and reducing time-to-clue and time-to-mitigation for NOC and on-call engineers. • Advance platform observability and reliability signals using Prometheus and Grafana, balancing actionability, scale, and cost efficiency. • Define reliability roadmaps and OKRs, translating ambiguous business reliability goals into clear technical requirements.
No credit card. Takes 10 seconds.