Machine Learning Engineer — Inference Optimization
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Strong experience in ML inference optimization or high-performance ML systems • Solid understanding of deep learning internals (attention, memory layout, compute graphs) • Hands-on experience with PyTorch (or similar) and model deployment • Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations) • Experience scaling inference for real users (not just research benchmarks) • Comfortable working in fast-moving startup environments with ownership and ambiguity • Experience with LLM or long-context model inference • Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton) • Experience optimizing across different hardware vendors • Open-source contributions in ML systems or inference tooling • Background in distributed systems or low-latency services
Responsibilities
• Optimize inference latency, throughput, and cost for large-scale ML models in production. • Profile and bottleneck GPU/CPU inference pipelines including memory usage, kernel executions, batching strategies, and input/output operations. • Implement and tune quantization techniques such as fp16, bf16, int8, and fp8 to reduce model size and improve performance. • Optimize KV-cache for reuse in inference systems. • Apply speculative decoding strategies along with batching and streaming optimizations. • Perform model pruning or architectural simplifications specifically tailored for the purpose of inference efficiency. • Collaborate closely with research engineers to translate new model architectures into production environments, ensuring they are fast and reliable enough for real user interaction. • Build and maintain robust systems capable of serving ML models (e.g., Triton server or custom runtimes) that can handle various hardware configurations like NVIDIA/AMD GPUs as well as cloud infrastructures. • Benchmark performance across different types of hardware setups, including but not limited to specific GPU and CPU brands from vendors such as NVIDIA and AMD, along with diverse cloud environments. • Enhance system reliability by improving observability features under actual workload conditions. • Work towards optimizing the cost efficiency of inference operations within realistic user scenarios without compromising on performance or accuracy.
Benefits
• Real ownership over performance-critical systems • Direct impact on product reliability and unit economics • Close collaboration with research, infra, and product • Competitive compensation + meaningful equity at Series A • A team that cares about engineering quality, not hype