spaitial - Machine Learning & Cloud Infra Engineer

London, United Kingdom2mo ago

In Office Mid EMEA Cloud Computing Artificial Intelligence Cloud Engineer Bash Python CUDA GCP AWS

Requirements

• 3+ years of professional experience in infrastructure, platform, or cloud engineering (ML infrastructure experience strongly preferred). • Hands-on experience with GPU compute and performance debugging (CUDA/NCCL concepts, GPU utilization, networking bottlenecks, profiling). • Strong experience operating cloud environments (AWS, GCP, or Azure), including networking, IAM, and cost management. • Proficiency with containers and orchestration (Docker, Kubernetes) and infrastructure-as-code (Terraform). • Strong scripting and automation skills (Python plus Bash/PowerShell). • Familiarity with distributed training and modern ML stacks (PyTorch; DDP/FSDP or comparable). • Experience with monitoring and observability tooling (Prometheus/Grafana, OpenTelemetry, ELK, or similar). • Experience building CI/CD for infra and ML workflows (e.g., CircleCI, GitHub Actions).

Responsibilities

• Own and evolve the ML + cloud infrastructure that enables training and evaluation of massive foundation models. • Design and operate GPU clusters: Provision, scale, and maintain multi-node, multi-GPU training environments (on cloud and/or on-prem), including scheduling, quotas, and capacity planning. • Distributed training enablement: Support high-throughput training stacks (e.g., PyTorch DDP/FSDP, NCCL) and ensure performance, stability, and reproducibility across large runs. • Storage and data throughput: Build and optimize storage systems and networking for petabyte-scale datasets and high-bandwidth training (object storage, NVMe, shared filesystems, caching, data locality). • Containerization and orchestration: Package and deploy workloads with Docker and Kubernetes (or comparable systems); maintain infrastructure-as-code (Terraform) and reliable release processes. • Observability and reliability: Implement monitoring, logging, and alerting for cluster health, job performance, and cost; define SLOs and on-call/incident response practices. • Security and access: Manage secrets, IAM, and secure network boundaries for research and production systems. • Collaboration: Partner closely with ML researchers and engineers to unblock training, iterate on tooling, and improve developer experience. • Production pathways: Support model evaluation and serving infrastructure where needed, and ensure smooth transitions from research to deployable systems.