hcompany - Research Engineer, Model Inference & Serving

United Kingdom - Hybrid2mo ago

In Office Staff EMEA Artificial Intelligence Materials Research Engineer Staff Engineer Rust C++Python Go JAX

Requirements

• Strong software engineering track record • Proficient in Python and at least one systems language (Rust, C++, or Go) • Hands-on experience with deep learning frameworks (PyTorch, JAX), preferably in an industry setting • Solid distributed systems fundamentals • Experience working in a modern cloud environment and with production ML infrastructure (Kubernetes, etc.) • Working knowledge of modern ML, including transformers and multimodal architectures • Research engagement: an advanced degree with research output, or publications at top-tier AI or systems venues (e.g., NeurIPS, ICML, MLSys, OSDI), research internships, or substantive open-source contributions • Excellent communication and presentation skills • Strong collaboration and teamwork skills • Passion for inference and AI • Hands-on experience with inference frameworks (vLLM, SGLang, TensorRT-LLM) • Writing or modifying GPU kernels (CUDA, Triton, etc.) • Edge or on-device inference experience (llama.cpp, MLX, ONNX Runtime, etc.) • Experience with quantization, speculative decoding, disaggregated inference or KV-cache compression • Experience with multimodal models and/or agentic systems • Paris or London. • This role is hybrid, and you are expected to be in the office 3 days a week on average. • Please expect some travel between offices on a reasonable cadence (e.g., every 4-6 weeks).

Responsibilities

• Develop scalable, low-latency and cost effective inference pipelines • Optimize model performance: memory usage, throughput, and latency, using advanced techniques like distributed computing, model compression, quantization and caching mechanisms • Develop specialized GPU kernels for performance-critical tasks like attention mechanisms, matrix multiplications, etc. • Collaborate with H research teams on model architectures to enhance efficiency during inference • Review state-of-the-art papers to improve memory usage, throughput and latency (Flash attention, Paged Attention, Continuous batching, etc.) • Prioritize and implement state-of-the-art inference techniques

Benefits

• Join the exciting journey of shaping the future of AI, and be part of the early days of one of the hottest AI startups • Collaborate with a fun, dynamic and multicultural team, working alongside world-class AI talent in a highly collaborative environment • Unlock opportunities for professional growth, continuous learning, and career development • If you want to change the status quo in AI, join us.