wagey.ggwagey.gg
38,923  jobs38,923  jobs
Browse Tech JobsCompaniesFeaturesPricingFAQs
Log InGet Started Free
Jobs(38,923)/Senior Community Manager Role(516)/lambda (4) - Senior Incident Manager
lambda

lambda - Senior Incident Manager

Remote, USA - Hybrid+ Equity3w ago
In OfficeSeniorNASenior Community ManagerTeam LeadershipDecision MakingReportingDocumentationITILPrometheusGrafana

Requirements

• Experience operating AI or HPC infrastructure • Background in SRE, infrastructure engineering, or data center operations • Familiarity with high-density GPU environments (NVIDIA clusters, InfiniBand networks) • Experience with hyperscale or colocation data center environments • Knowledge of automation and incident response tooling • Knowledge of and experience with Incident command system (ICS) • Experience in leading and developing incident command from stractch • Key Competencies • Incident Command & Leadership • Operational Decision Making • Cross-Team Coordination • Root Cause Analysis • Crisis Communication • Infrastructure Reliability

Responsibilities

• Incident Leadership • Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations. • Serve as the Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams. • Act as the liaison between leadership and external teams during incidents / post-incidents to provide updates and status summaries. • Establish clear incident timelines, triage actions, and resolution plans. • Incident Management Operations • Own the incident response lifecycle including: • Assisting Technical Triage • Ensure timely and accurate communication with internal stakeholders and leadership. • Maintain incident response documentation and operational playbooks. • Conduct analysis on incidents and identify patterns / trends for improvement in response and systems reliability. • Work in an On-Call Rotation to respond to, lead, and coordinate incidents • Cross-Functional Coordination • Work closely with: • Data center operations • Infrastructure engineering & operations • Network engineering • Platform reliability engineering • Security operations • Hardware and facility vendors • Drive alignment during outages involving multiple infrastructure layers. • Lead post-incident reviews (PIRs) and root cause analysis. Identify systemic reliability gaps and implement corrective actions. • Track incident metrics including MTTR, MTTD, and incident recurrence rates. • Operational Excellence • Improve incident response processes, escalation paths, and tooling by working with technical support and engineering teams.. • Contribute to runbooks, operational standards, and reliability frameworks. • Support implementation of automation and observability improvements. • Communication & Reporting • Provide executive-level incident summaries and reports. • Deliver clear, concise updates during active incidents. • Maintain incident dashboards and operational health reporting. • 8+ years experience in incident management, site reliability engineering, or infrastructure operations • Experience managing incidents in large-scale distributed infrastructure environments • Strong understanding of: • GPU compute clusters • Networking and storage infrastructure • Cloud or hybrid infrastructure platforms • Proven ability to lead high-pressure incident response situations • Experience with incident management frameworks (ITIL, SRE, or equivalent) • Excellent communication and stakeholder management skills • Experience with incident tracking and monitoring tools such as: • Prometheus / Grafana • Reduced Mean Time to Resolution (MTTR) for critical incidents • Improved cross-team incident coordination • High-quality post-incident reviews and corrective actions • Increased infrastructure reliability and operational maturity

Benefits

• The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

Apply in one click

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click
Apply in One Click

Similar roles

TranscarentTranscarent - Senior Manager, Solutions1w ago
·Remote - US·$145k - $145k/year + Equity
RemoteNASeniorSenior Community ManagerTeam ManagementPublic RelationsProduct MarketingDocumentationReporting
hinthint - Health - Senior Manager, Marketplace & Partnerships4w ago
·Remote - USA or Canada·Equity
RemoteNASeniorFintechDigital HealthSenior Community ManagerProduct MarketingDocumentationSlackReportingClose
1password1password - Senior Renewals Specialist1mo ago
·Remote - USA·$74k - $74k/year + Equity
RemoteNASeniorSoftwareSenior Community ManagerCloseCoachingCSMReportingDocumentation
RoadieRoadie - Senior Manager, Trust & Safety1mo ago
·Remote - USA *
RemoteNASeniorSenior Community ManagerTeam ManagementTeam LeadershipReportingSalesforceCoaching
ezCater, IncezCater, Inc - Senior Manager, Trust & Safety (Remote)1mo ago
·Boston, MA·$145k - $217k/year + Equity
In OfficeNASeniorRestaurantsHigher EducationSenior Community ManagerReportingProduct MarketingCustomer TrainingTraining DevelopmentDocumentation
IonQIonQ - Senior Manager, Enterprise Proposal2mo ago
·Remote, US - Hybrid·$128k - $167k/year + Equity
In OfficeNASeniorInsuranceCybersecuritySenior Community ManagerB2BTechnical WritingDocumentationTeam LeadershipCoaching
Automattic CareersAutomattic Careers - Senior Social and Community Strategist1mo ago
·Remote - Americas·$100k - $140k/year
RemoteNASeniorArtificial IntelligenceSoftwareSenior Community ManagerB2BReportingClose
Radiant IndustriesRadiant Industries - Senior Manager, Regulatory Systems Engineering1mo ago
·Remote - El Segundo, California, United States·$161k - $236k/year + Equity
RemoteNASeniorSenior Community ManagerRegulatory Affairs ManagerReportingDocumentation
DoorDash USADoorDash USA - Senior People Transformation Strategist1mo ago
·Remote - Europe *·$136k - $136k/year + Equity
RemoteNASeniorSenior Community ManagerTraining DevelopmentTalent AcquisitionChange ManagementReporting

Browse more by category

Show 516 moreSenior Community ManagerShow 2,926 moreTeam LeadershipShow 458 moreDecision MakingShow 8,590 moreReportingShow 5,795 moreDocumentationShow 103 moreITILShow 283 morePrometheusShow 322 moreGrafana
Privacy·Terms··Contact·FAQ·Wagey on X