featherlessai - AI Researcher — Inference Optimization

Remote - (world) - USA *3mo ago

Remote NA Artificial Intelligence Hospitals Diagnostics AI Engineer Python Triton ONNX CUDA Close

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Strong background in machine learning, deep learning, or AI systems. • machine learning, deep learning, or AI systems • Hands-on experience optimizing inference for large-scale models. • large-scale models • Proficiency in Python and modern ML frameworks (e.g., PyTorch). • Python • Experience with inference tooling (e.g., Triton, TensorRT, vLLM, ONNX Runtime). • Ability to design experiments and communicate results clearly. • Experience deploying production inference systems at scale. • production inference systems at scale • Familiarity with distributed and multi-GPU inference. • distributed and multi-GPU inference • Experience contributing to open-source ML or inference frameworks. • open-source ML or inference frameworks • Authorship or co-authorship of peer-reviewed research papers in machine learning, systems, or related fields. • Authorship or co-authorship of peer-reviewed research papers • Experience working close to hardware (CUDA, ROCm, profiling tools). • What Success Looks Like • Measurable gains in latency, throughput, and cost efficiency. • latency, throughput, and cost efficiency • Optimized inference systems running reliably in production. • Research ideas successfully translated into deployable systems. • Clear benchmarks and documentation that inform product decisions. • Relevant Research Areas (Bonus) • Long-context inference optimization • Speculative decoding • KV-cache compression and paging • Efficient decoding strategies • Hardware-aware inference design

Responsibilities

• Research and develop techniques to optimize inference performance for large neural networks. • optimize inference performance • Improve latency, throughput, memory efficiency, and cost per inference. • latency, throughput, memory efficiency, and cost per inference • Design and evaluate model-level optimizations (quantization, pruning, KV-cache optimization, architecture-aware simplifications). • model-level optimizations • Implement systems-level optimizations (dynamic batching, kernel fusion, multi-GPU inference, prefill vs decode optimization). • systems-level optimizations • Benchmark inference workloads across hardware accelerators. • Collaborate with engineering teams to deploy optimized inference pipelines. • deploy optimized inference pipelines • Translate research insights into production-ready improvements. • production-ready improvements

Benefits

• Equity compensation is mentioned as a benefit. Response: EQUITY COMPENSATION AVAILABLE • Paid time off (PTO) options are included among benefits. Response: PTO OFFERED • Remote work options are explicitly mentioned as a benefit. Response: REMOTE WORK OPTIONS AVAILABLE

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities