Pro members applied to this job 36 hours before you saw itGet Pro ›

OpenAI - Datacenter Incident Program Manager

United States$126k - $228k2d ago

In Office Senior NA Artificial Intelligence Program Manager Close Program Management Jira Documentation Reporting Governance

Upload My Resume

Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• 7+ years in mission-critical infrastructure, data center operations, or reliability engineering • Direct experience leading major incidents (P1/P0 equivalent) • Strong familiarity with facilities systems, hardware operations, or network infrastructure • Demonstrated experience running war rooms and executive updates • Experience conducting root cause analysis and corrective action tracking • Ability to remain calm and decisive under high-pressure conditionsPreferred Skills • Experience in hyperscale or high-density AI compute environments • Background in facilities commissioning, facility operations, hardware operations, or network reliability • Familiarity with ISO-based quality systems or structured operational documentation frameworks • Experience implementing incident tooling (PagerDuty, ServiceNow, Jira, etc.) • OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

Responsibilities

• Define and maintain incident severity levels (SEV definitions), classification criteria, and escalation thresholds. • Establish end-to-end incident response standards: protocols, lifecycle stages (declare → stabilize → mitigate → recover → close), and operating cadence. • Build and maintain governance artifacts: runbooks, war room formats, reporting templates, and decision/communication standards. • Create and operationalize notification trees, stakeholder comms templates (initial, periodic updates, recovery/closure), and executive escalation criteria. • Define clear RACI across Facilities, Hardware Ops, Network, Security, and vendor/partner teams, including handoffs and accountability paths. • Set and manage SLAs/OLAs for acknowledgment, escalation, containment, mitigation, and reporting. • Implement and run incident management tooling (ticketing, paging, logging) and ensure integrations with monitoring and workflow systems. • Establish dashboards and program health metrics to track incident performance and readiness. • Lead readiness activities: tabletop exercises, cross-functional simulations, IC/Deputy training, and a rotating on-call IC bench with certification standards. • Serve as Incident Commander as needed: declare severity, stand up the war room, assign functional leads, and drive structured execution under pressure. • Maintain real-time documentation (decisions, timelines, impact scope) and ensure clear restoration objectives and scope control during active events. • Run post-incident reviews (PIRs), validate timelines, drive structured RCA (e.g., 5 Whys, Fault Tree), and separate root cause vs contributing factors. • Define corrective/preventative actions (CAPAs), assign accountable owners, track to verified closure, and escalate overdue actions. • Publish trend reporting (incident taxonomy, counts by severity, MTTA/MTTR, repeat failure domains) and feed systemic gaps back into design and operations teams.

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities