Training: Process Management Engineer
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Responsibilities
• Work across our Python and Rust stack • Design, build, and maintain software to orchestrate and monitor machine learning workloads on our largest supercomputers • Profile and optimize our software stack to support computation orchestration at frontier scale • Improve reliability, observability, and fault tolerance for long-running jobs • Debug complex distributed systems issues across large clusters • Respond to the changing shapes and needs of the ML systems to enable our researchers • Have experience developing distributed systems (not just operating them) • Enjoy understanding how large systems behave and fail at scale • Care deeply about performance, correctness, and reliability • Have strong software engineering skills and are proficient in Python and Rust or another systems programming language (e.g. C++) • Have solid Linux knowledge, and are comfortable with systems-level debugging, performance analysis, and memory profiling • Are comfortable and experienced working and developing asynchronous and concurrent systems • Like high-ownership environments with light process and strong engineering agency • OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.