Software Engineer, Platform Systems
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Skills needed: Designing distributed failure detection, tracing, and profiling systems; developing tools for identifying slow/faulty nodes. • Years of experience: Not explicitly stated. • Education: Systems engineering background preferred (Bachelor's degree or equivalent). • Certifications: None mentioned. • Must-haves: Experience with hardware, operating systems, networking, concurrency and distributed systems; understanding high-performance computing is a plus.
Responsibilities
• Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs • Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior • Improve observability, reliability, and performance across OpenAI’s training platform • Debug and resolve issues in complex, high-throughput distributed systems • Collaborate with systems, infrastructure, and research teams to evolve platform capabilities • Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads • Care deeply about performance, stability, and observability in distributed systems • Enjoy finding and fixing issues in large-scale systems and automating operational workflows • Have experience writing low-level software where system details matter • Understand hardware, operating systems, networking, concurrency, and distributed systems • Have a background in high-performance computing or low-level systems engineering • Are excited to work on critical infrastructure that powers frontier AI research
Benefits
• Equity options mentioned as part of compensation package. • Paid time off (PTO) is included in benefits. • Insurance coverage provided to employees. • Remote work option available for this role at OpenAI's headquarters or a nearby office location.