Reddit - Staff Machine Learning Engineer, GenAI Platform
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• Production Engineering Fundamentals: Hands-on experience with Kubernetes, Docker, and building production-quality, object-oriented code in Python and/or Go. • Production Engineering Fundamentals: • Python • Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle. • Strong organizational & communication skills.
Responsibilities
• Drive GenAI Infrastructure Strategy: Propose, design, and lead the architecture of our next-generation LLM platform, significantly advancing our capabilities to support large-scale foundation models that serve millions of redditors. • Drive GenAI Infrastructure Strategy: • Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters. You will tackle challenges related to automated recovery, cluster-scale health monitoring, and advanced checkpointing to ensure optimal compute efficiency. • Design Resilient, Large-Scale Distributed Systems: • Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning (e.g., SFT, RLHF/DPO). You will abstract away the complexity of distributed training frameworks, integrating them into a seamless platform SDK that handles configuration, experiment tracking, and model lifecycle management. • Build Self-Serve LLM Workflows: • Develop Comprehensive Evaluation & Benchmarking Infrastructure: Treat model evaluation as a first-class platform capability. You will build scalable systems for automated regression detection, structured metrics tracking, and complex inference-heavy evaluation patterns to ensure the quality and safety of models before they hit production. • Develop Comprehensive Evaluation & Benchmarking Infrastructure: • Architect Advanced Data Ingestion Pipelines: Extend our distributed data platforms to natively and efficiently handle the massive, multimodal datasets (text, image, video) required for modern GenAI workloads, optimizing for throughput and dynamic batching. • Architect Advanced Data Ingestion Pipelines: • Provide Technical Leadership & Mentorship: Analyze complex bottlenecks in distributed systems to optimize for performance and cost-efficiency. Mentor senior engineers, champion a rigorous MLOps culture, and partner with cross-functional leadership to define technical roadmaps and de-risk major initiatives. • Provide Technical Leadership & Mentorship: • Who You Might Be: • Who You Might Be: • 10+ years of work experience in a production software development environment or building complex distributed data systems, plus a degree in ML, Engineering, Computer Science, or a related discipline. • GenAI/LLM Infrastructure Expertise: Proven track record of designing and operating large-scale ML systems, specifically working with distributed training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) and LLM serving/inference optimization (e.g., vLLM, TensorRT-LLM). • GenAI/LLM Infrastructure Expertise: • Distributed Systems Mastery: Hands-on experience managing fault-tolerant, petabyte-scale distributed systems and multi-node/multi-GPU training clusters. • Distributed Systems Mastery: • Advanced MLOps Knowledge: Deep understanding of modern ML orchestration, fine-tuning pipelines, and model evaluation methodologies. Experience with tools like Ray, MLflow, or similar ecosystem standards. • Advanced MLOps Knowledge: • GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.
Benefits
• Comprehensive Healthcare Benefits and Income Replacement Programs • 401k with Employer Match • Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support • Family Planning Support • Gender-Affirming Care • Mental Health & Coaching Benefits • Flexible Vacation & Paid Volunteer Time Off • Generous Paid Parental Leave • Pay Transparency: • Pay Transparency: • This job posting may span more than one career level. • In addition to base salary, this job is eligible to receive equity in the form of restricted stock units, and depending on the position offered, it may also be eligible to receive a commission. Additionally, Reddit offers a wide range of benefits to U.S.-based employees, including medical, dental, and vision insurance, 401(k) program with employer match, generous time off for vacation, and parental leave. To learn more, please visit https://www.redditinc.com/careers/. • To provide greater transparency to candidates, we share base salary ranges for all US-based job postings regardless of state. We set standard base pay ranges for all roles based on function, level, and country location, benchmarked against similar stage growth companies. Final offer amounts are determined by multiple factors including, skills, depth of work experience and relevant licenses/credentials, and may vary from the amounts listed below. • The base salary range for this position is: • $253,300—$354,600 USD • In select roles and locations, the interviews will be recorded, transcribed and summarized by artificial intelligence (AI). You will have the opportunity to opt out of recording, transcription and summarization prior to any scheduled interviews. • During the interview, we will collect the following categories of personal information: Identifiers, Professional and Employment-Related Information, Sensory Information (audio/video recording), and any other categories of personal information you choose to share with us. We will use this information to evaluate your application for employment or an independent contractor role, as applicable. We will not sell your personal information or disclose it to any third party for their marketing purposes. We will delete any recording of your interview promptly after making a hiring decision. For more information about how we will handle your personal information, including our retention of it, please refer to our Candidate Privacy Policy for Potential Employees and Contractors.
No credit card. Takes 10 seconds.