wagey.ggwagey.gg
38,923  jobs38,923  jobs
Browse Tech JobsCompaniesFeaturesPricingFAQs
Log InGet Started Free
Jobs(38,923)/Cloud Engineer Role(289)/spaitial (6) - Machine Learning & Cloud Infra Engineer
spaitial

spaitial - Machine Learning & Cloud Infra Engineer

London, United Kingdom2mo ago
In OfficeMidEMEACloud ComputingArtificial IntelligenceCloud EngineerBashPythonCUDAGCPAWS

Requirements

• 3+ years of professional experience in infrastructure, platform, or cloud engineering (ML infrastructure experience strongly preferred). • Hands-on experience with GPU compute and performance debugging (CUDA/NCCL concepts, GPU utilization, networking bottlenecks, profiling). • Strong experience operating cloud environments (AWS, GCP, or Azure), including networking, IAM, and cost management. • Proficiency with containers and orchestration (Docker, Kubernetes) and infrastructure-as-code (Terraform). • Strong scripting and automation skills (Python plus Bash/PowerShell). • Familiarity with distributed training and modern ML stacks (PyTorch; DDP/FSDP or comparable). • Experience with monitoring and observability tooling (Prometheus/Grafana, OpenTelemetry, ELK, or similar). • Experience building CI/CD for infra and ML workflows (e.g., CircleCI, GitHub Actions).

Responsibilities

• Own and evolve the ML + cloud infrastructure that enables training and evaluation of massive foundation models. • Design and operate GPU clusters: Provision, scale, and maintain multi-node, multi-GPU training environments (on cloud and/or on-prem), including scheduling, quotas, and capacity planning. • Distributed training enablement: Support high-throughput training stacks (e.g., PyTorch DDP/FSDP, NCCL) and ensure performance, stability, and reproducibility across large runs. • Storage and data throughput: Build and optimize storage systems and networking for petabyte-scale datasets and high-bandwidth training (object storage, NVMe, shared filesystems, caching, data locality). • Containerization and orchestration: Package and deploy workloads with Docker and Kubernetes (or comparable systems); maintain infrastructure-as-code (Terraform) and reliable release processes. • Observability and reliability: Implement monitoring, logging, and alerting for cluster health, job performance, and cost; define SLOs and on-call/incident response practices. • Security and access: Manage secrets, IAM, and secure network boundaries for research and production systems. • Collaboration: Partner closely with ML researchers and engineers to unblock training, iterate on tooling, and improve developer experience. • Production pathways: Support model evaluation and serving infrastructure where needed, and ensure smooth transitions from research to deployable systems.

Apply in one click

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click
Apply in One Click

Similar roles

checkout.comcheckout.com - Cloud Platform Security Engineer1mo ago
·London, United Kingdom, Hybrid
In OfficeEMEASeniorPaymentsCloud ComputingSecurity EngineerCloud EngineerBashPythonAWSGCPAzure
n8nn8n - Sr Cloud Engineer | Europe remote1mo ago
·Berlin, Berlin-Brandenburg, Germany - Hybrid·€1k/month/year + Equity
RemoteEMEASeniorCloud ComputingCloud EngineerAzureGCPAWSTerraformKubernetes
salmon-groupsalmon-group - Cloud Engineer2mo ago
·Remote - Kazakhstan, Serbia, European Union
RemoteEMEASeniorCloud ComputingCloud EngineerAWSCDKBashAnsibleTerraform
DV TradingDV Trading - Senior Cloud Engineer (London)2mo ago
·London - Hybrid
In OfficeEMEASeniorCryptocurrencyCloud ComputingCloud EngineerAWSGCPTerraformWebSocketGoogle GKE
CanonicalCanonical - Cloud Field Engineer4mo ago
·Home based - Worldwide - Europe *
In OfficeEMEACloud ComputingHigher EducationCloud EngineerKubernetesAWSGCPAzureLinux
BitpandaBitpanda - Cloud Kafka Engineer1mo ago
·Vienna, Austria·Equity
In OfficeEMEAMidCloud ComputingCloud EngineerKafkaAWSDockerKubernetesTerraform
CanonicalCanonical - Telco Cloud Field Engineer2mo ago
·Home Based - Americas; Home based - EMEA - Hybrid - Europe *·Equity
In OfficeEMEACloud ComputingArtificial IntelligenceCloud EngineerPortugueseLinuxKubernetesPythonCustomer Success
cloudscalercloudscaler - Senior Cloud Engineer (AWS) (Multiple)4mo ago
·Aldgate, London, United Kingdom - Hybrid·£70k - £95k/year/year + Equity
In OfficeEMEASeniorCloud ComputingGovernmentCloud EngineerAWSTerraform
Neo4jNeo4j - IT Engineer (Cloud)3w ago
·London
In OfficeEMEACloud ComputingSoftware EngineerCloud EngineerJavaPythonAWSGCPAzure

Browse more by category

Show 289 moreCloud EngineerShow 466 moreBashShow 6,205 morePythonShow 58 moreCUDAShow 1,526 moreGCPShow 3,747 moreAWS
Privacy·Terms··Contact·FAQ·Wagey on X