Machine Learning Engineer — Training Optimization
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Strong experience training large neural networks (LLMs or similarly large models) • Hands-on experience with training optimization (not just model usage) • Solid understanding of: • Backpropagation, optimization algorithms, and training dynamics • Distributed systems for ML training • Comfort working close to hardware (GPUs, memory, networking constraints) • Ability to move fluidly between research ideas and production-ready code • Experience with large-scale distributed training (multi-node, multi-GPU) • Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks • Experience optimizing training on AMD or NVIDIA GPUs • Contributions to open-source ML infrastructure or research codebases • Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)
Responsibilities
• Optimize large-scale model training pipelines (throughput, convergence, stability, and cost) • Improve distributed training strategies (data, model, and pipeline parallelism) • Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8) • Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements • Collaborate with researchers on architecture-aware training strategies • Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility) • Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels) • Own training performance metrics and continuously push them forward
Benefits
• Real ownership at Series-A stage — your work shapes the company’s trajectory • Work on cutting-edge models and training systems at scale • Small, highly technical team with fast feedback loops • Strong emphasis on engineering quality and research rigor • Competitive compensation + meaningful equity