Graphcore - Principal Reliability Scientist
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• Strong background in reliability engineering or reliability science within semiconductor, hardware or complex systems environments • Experience of physics-of-failure approaches in high-performance computing, AI hardware or related domains • Experience with reliability modelling, experimental design and statistical data analysis • Proven ability to work with and interpret experimental reliability data to drive engineering decisions • Experience with key reliability metrics such as MTBF, MTTR, RAS and failure rate analysis • Ability to operate effectively in complex, cross-functional environments with multiple stakeholders • Strong problem-solving skills with the ability to lead technically challenging investigations independently • Excellent communication skills, with the ability to influence design and operations teams using data-driven insights • Preferred Qualification: • · Experience with liquid cooling systems, fluid dynamics or thermally complex hardware environments · Knowledge of soft error mechanisms and SER modeling· Experience contributing to reliability strategy, processes or tooling improvements
Responsibilities
• · Define and refine reliability requirements across silicon, board and system levels, working in partnership with research and design teams · Apply advanced reliability methodologies to highly innovative systems, including challenges associated with liquid-cooled architectures and fluid dynamics · Design and execute experiments to generate high-quality reliability and performance data, ensuring statistical rigour and relevance · Analyse experimental, field and manufacturing data to quantify reliability metrics such as MTBF, MTTR, RAS characteristics and soft error rates (SER) · Use data-driven insights to inform product design trade-offs, reliability targets and spares provisioning strategies · Collaborate with chip, board and system design teams to influence architecture and component selection based on reliability considerations · Support development of system-level reliability models incorporating thermal, mechanical and fluid behaviour · Lead complex root cause investigations into reliability issues, driving corrective and preventative actions across teams · Contribute to the evolution of reliability tools, processes and best practices within the organisation · Communicate complex reliability concepts, risks and recommendations clearly to a wide range of stakeholders
No credit card. Takes 10 seconds.