spaitial - Machine Learning Systems & Infrastructure Engineer

London, United Kingdom1mo ago

In Office Mid EMEA Cloud Computing Artificial Intelligence Infrastructure Engineer Machine Learning Engineer Docker Kubernetes Terraform Python CUDA

Requirements

• 3+ years writing production-quality Python in a large, multi-author codebase, with strong SWE fundamentals (ML systems experience strongly preferred). • Hands-on with modern ML training stacks (PyTorch; DDP/FSDP or comparable); have personally debugged distributed jobs across many GPUs and nodes. • Have shipped non-trivial end-to-end data pipelines at scale — ingestion, transformation, validation, versioning, republish — ideally including real-world sources with rate limits, auth, or undocumented APIs. • Hands-on GPU compute and performance debugging (CUDA/NCCL, GPU utilization, networking bottlenecks, profiling). • Working knowledge of cloud environments (AWS, GCP, or Azure), including object storage, IAM, and cost awareness. • Proficient with containers (Docker, Kubernetes) and comfortable reading and writing IaC (Terraform) for the surfaces you ship. • Strong working knowledge of how to store and query large datasets at scale: SQL fundamentals; relational (e.g., Postgres), analytical (e.g., BigQuery, Snowflake), and embedded (e.g., SQLite) stores; and object storage with caching layers. Familiarity with ML workflow orchestration and experiment tracking (e.g., Kubeflow Pipelines, MLflow). • Experience with monitoring and observability tooling (e.g., Prometheus/Grafana, OpenTelemetry) and CI/CD for infra and ML workflows (e.g., GitHub Actions).

Responsibilities

• Own and evolve the ML systems that enable training, evaluation, and serving of large foundation models — trainer, dataset loaders, checkpointing, and experiment orchestration code. • Distributed training enablement: Improve high-throughput training stacks (e.g., PyTorch DDP/FSDP, NCCL) for performance, stability, and reproducibility, including preemption-safe and sharded checkpointing. • Data systems and pipelines: Build end-to-end Python pipelines that turn third-party capture sources into clean, versioned training datasets — including scraping (e.g., Playwright) and preprocessing — and optimize the underlying storage at petabyte scale (object storage, fuse mounts, caching layers, shared filesystems, and relational / analytical / embedded metadata stores). • ML workflow orchestration and serving: Operate the systems researchers use to launch experiments, data jobs, and production endpoints — workflow engines (e.g., Kubeflow Pipelines, Airflow), GPU schedulers (e.g., Volcano, Slurm), experiment trackers (e.g., MLflow, Weights & Biases), and managed-inference platforms (e.g., Modal, Triton) — and maintain a launcher SDK for one-command runs. • Containerization and packaging: Ship workloads with Docker and Kubernetes; maintain IaC (Terraform) for the surfaces you own and CI/CD pipelines, including self-hosted GPU runners. • Observability and reliability: Monitoring, logging, and alerting for job performance, data-pipeline health, and cost (e.g., Prometheus/Grafana, OpenTelemetry); define SLOs and incident response for the systems you own. • Security and access: Manage secrets, IAM, and network boundaries (e.g., Tailscale, cloud VPC) for the systems you own. • Collaboration: Partner with ML researchers, engineers, and the platform team to unblock training and data work and improve developer experience.