lambda - Senior Incident Manager
Requirements
• Experience operating AI or HPC infrastructure • Background in SRE, infrastructure engineering, or data center operations • Familiarity with high-density GPU environments (NVIDIA clusters, InfiniBand networks) • Experience with hyperscale or colocation data center environments • Knowledge of automation and incident response tooling • Knowledge of and experience with Incident command system (ICS) • Experience in leading and developing incident command from stractch • Key Competencies • Incident Command & Leadership • Operational Decision Making • Cross-Team Coordination • Root Cause Analysis • Crisis Communication • Infrastructure Reliability
Responsibilities
• Incident Leadership • Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations. • Serve as the Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams. • Act as the liaison between leadership and external teams during incidents / post-incidents to provide updates and status summaries. • Establish clear incident timelines, triage actions, and resolution plans. • Incident Management Operations • Own the incident response lifecycle including: • Assisting Technical Triage • Ensure timely and accurate communication with internal stakeholders and leadership. • Maintain incident response documentation and operational playbooks. • Conduct analysis on incidents and identify patterns / trends for improvement in response and systems reliability. • Work in an On-Call Rotation to respond to, lead, and coordinate incidents • Cross-Functional Coordination • Work closely with: • Data center operations • Infrastructure engineering & operations • Network engineering • Platform reliability engineering • Security operations • Hardware and facility vendors • Drive alignment during outages involving multiple infrastructure layers. • Lead post-incident reviews (PIRs) and root cause analysis. Identify systemic reliability gaps and implement corrective actions. • Track incident metrics including MTTR, MTTD, and incident recurrence rates. • Operational Excellence • Improve incident response processes, escalation paths, and tooling by working with technical support and engineering teams.. • Contribute to runbooks, operational standards, and reliability frameworks. • Support implementation of automation and observability improvements. • Communication & Reporting • Provide executive-level incident summaries and reports. • Deliver clear, concise updates during active incidents. • Maintain incident dashboards and operational health reporting. • 8+ years experience in incident management, site reliability engineering, or infrastructure operations • Experience managing incidents in large-scale distributed infrastructure environments • Strong understanding of: • GPU compute clusters • Networking and storage infrastructure • Cloud or hybrid infrastructure platforms • Proven ability to lead high-pressure incident response situations • Experience with incident management frameworks (ITIL, SRE, or equivalent) • Excellent communication and stakeholder management skills • Experience with incident tracking and monitoring tools such as: • Prometheus / Grafana • Reduced Mean Time to Resolution (MTTR) for critical incidents • Improved cross-team incident coordination • High-quality post-incident reviews and corrective actions • Increased infrastructure reliability and operational maturity
Benefits
• The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
Apply in one click
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT