Wand Synthesis AI Inc - Head of SRE
Requirements
• Proven hands-on experience in Site Reliability Engineering, Production Engineering, or a similar role. • Strong hands-on expertise in cloud infrastructure (AWS or Azure preferred), IaaC (Terraform) and Kubernetes. • Experience building or maturing SRE practices within an organisation. • Demonstrated ability to improve uptime, reliability, and operational processes. • Deep understanding of CI/CD, dev exp, infrastructure-as-code, and automation. • Experience designing on-call processes and incident response frameworks. • Experience managing at least one team of SRE engineers. • Strong communication skills, with the ability to influence across teams. • Experience supporting data platforms and ML systems in production environments. • MLOps experience (model deployment, monitoring, retraining workflows). • Background in large-scale global B2B/B2C products. • Background in enterprise environments with security and compliance requirements. • Expertise in ML, AI, LLMs. • Experience implementing regulatory controls within cloud infrastructure. • Experience evaluating and managing infrastructure vendors and tooling. • Experience scaling systems in high-growth environments. • Experience in collaborating with large scale enterprise customers to deploy and operate environments within their accounts and VPCs. • Personal Characteristics • Practical and hands-on; willing to lead from the front. • Strong operational mindset with clear opinions on best practices. • Structured thinker who can build processes from ambiguity. • High ownership mentality and accountability. • Learning-oriented with a continuous improvement mindset. • Excellent communication and interpersonal skills. • Continuous drive for improvement and innovation.
Responsibilities
• Own and lead all SRE-related strategy, standards, and execution. Embed SRE culture and operational excellence across engineering teams. • Review the current infrastructure and operational model; redesign and rebuild where needed. • Architect, deploy, and maintain scalable, secure production environments. • Define and implement SLIs, SLOs, and uptime targets. • Establish robust monitoring, alerting, and observability practices. • Design and implement incident management, RCA and postmortem processes. • Build and manage sustainable on-call frameworks and escalation models. • Automate the software delivery lifecycle to improve release predictability and safety. • Create reproducible environments and IaaC provisioning templates. • Improve system performance, availability, and reliability. • Support and productionise data platforms and ML workloads. • Partner closely with QA and Engineering leadership to improve release quality and stability. • Ensure infrastructure meets enterprise-grade security and regulatory requirements. • Hire, manage, and mentor a team of SRE engineers.
Apply in one click
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT