Fundamental - Model Serving Engineer
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience) • 5+ years of experience in model serving, ML infrastructure, or a closely related backend engineering role • Deep, production-level experience with Triton Inference Server, including custom Python backends, batching configuration, and model repository management • Expert-level Python skills with a thorough understanding of the GIL, multi-threading, multiprocessing, and async concurrency patterns • Strong understanding of neural network inference mechanics, forward passes, batching strategies, memory management, and numerical precision tradeoffs • Hands-on experience with other inference frameworks (TorchServe, TensorFlow Serving, ONNX Runtime, vLLM, etc.) and the ability to evaluate tradeoffs between them • Experience profiling and optimizing inference code for latency and throughput at production scale • Experience with GPU kernel-level optimizations or CUDA profiling tools • Familiarity with model quantization, pruning, or compilation toolchains (TensorRT, torch.compile, ONNX) • Experience with KServe or other Kubernetes-native serving platforms • Experience serving tabular or structured data models, including classical ML models such as XGBoost and CatBoost • Experience with observability tooling such as Prometheus, Grafana, or Datadog in the context of inference monitoring
Responsibilities
• Design, build, and maintain production model serving infrastructure using Triton Inference Server as the primary framework • Implement and optimize inference pipelines including custom backends, dynamic batching strategies, and model ensemble configurations in Triton • Optimize Python inference code for performance, with a strong focus on GIL contention, multi-threading, and concurrency patterns • Tune throughput and latency across the full serving stack, batching policies, thread pool sizing, model instance groups, and memory layout • Work closely with the research team to understand new model architectures at a computational level, batching behavior, dynamic shapes, memory access patterns etc • Own the full resource observability and control loop for production inference - instrument GPU memory, CPU, batch queue depth, and latency metrics, and actively tune model instance groups, concurrency limits, memory budgets, and batching configuration in response to observed behavior • Evaluate and integrate alternative inference frameworks and runtimes as the model ecosystem evolves • Contribute to GPU utilization improvements and resource efficiency across the serving fleet
Benefits
• Competitive compensation with salary and equity • Comprehensive health coverage, including medical, dental, vision, and 401K • Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys • Relocation support for employees moving to join the team in one of our office locations • A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action
No credit card. Takes 10 seconds.