wagey.ggwagey.gg
38,923  jobs38,923  jobs
Browse Tech JobsCompaniesFeaturesPricingFAQs
Log InGet Started Free
Jobs(38,923)/Site Reliability Engineer Role(222)/Andromeda Cluster (10) - Senior Site Reliability Engineer - AI Infrastructure
Andromeda Cluster

Andromeda Cluster - Senior Site Reliability Engineer - AI Infrastructure

Remote - San Francisco, California , United States2mo ago
RemoteSeniorNASite Reliability EngineerDocumentationCUDALinuxKubernetesBash

Requirements

• GPU Systems Expertise: Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent). You understand GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes from direct experience not documentation. • High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training. You can diagnose why an all-reduce is slow, identify a degraded link in a fat-tree topology, and reason about congestion control at scale. • Distributed Training & ML Frameworks: Working knowledge of how large training jobs actually run — NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar. You don't need to write the models, but you need to understand what's happening at the systems level when a 1,000-GPU training run stalls. • Linux & Systems Internals: Expert-level Linux knowledge: kernel tuning, driver management (NVIDIA drivers, CUDA toolkit), cgroup/namespace internals, performance profiling at the syscall and hardware level. • Kubernetes & Orchestration: Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators. Experience with Slurm or other HPC schedulers is equally valued. • Automation & Software Engineering: Strong engineering skills in Python, Go, or Bash. You build production-grade tools and services, not just scripts. Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent). • Observability & Monitoring: Hands-on experience building monitoring and alerting for GPU infrastructure, not just Prometheus/Grafana basics, but GPU-specific telemetry (DCGM, nvidia-smi, fabric manager metrics) integrated into actionable dashboards. • Incident Management: Proven track record leading incident response for complex distributed systems where the failure could be in hardware, firmware, networking, drivers, orchestration, or application code and you need to narrow it down fast. • Strong Candidates May Have • Distributed Storage: Experience with high-performance parallel file systems (VAST, Weka, Lustre, GPFS) and the checkpoint I/O and data-loading bottlenecks that come with large training runs. • Training Optimization: Experience profiling and optimizing distributed training performance: identifying stragglers, tuning collective communication strategies, improving MFU (Model FLOPs Utilization), and reducing idle GPU time across large runs. • Cluster Buildout & Hardware: Experience involved in physical cluster design - rack layout, power/cooling constraints, network topology design, and hardware validation/burn-in at scale. • Team Leadership: Experience leading or mentoring a team of infrastructure engineers. We're growing and need people who raise the bar for everyone around them.

Benefits

• This is a high-impact, senior builder’s role. You’ll have significant ownership and autonomy to shape how our systems run at a foundational level, working directly with customers and providers while architecting the infrastructure backbone for reliable, scalable AI compute. You’ll influence technical direction and help define what world-class AI infrastructure operations look like.

Apply in one click

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click
Apply in One Click

Similar roles

Stack AVStack AV - Senior Site Reliability Engineer4d ago
·Remote - Pittsburgh, PA or Remote
RemoteNASeniorCloud ComputingGovernmentSite Reliability EngineerBashPythonLinuxGCPAWSTerraformKubernetesPrometheusIstio
PointClickCarePointClickCare - Senior Site Reliability Engineer, AI Infrastructure3d ago
·Mississauga, Ontario - Hybrid·$139k - $155k/year
In OfficeNASeniorLife SciencesCybersecuritySite Reliability EngineerDocumentationTerraformAzureKubernetesDatabricks
Parallel DomainParallel Domain - Senior Site Reliability Engineer1mo ago
·Remote - Pacific Northwest Area·$145k - $185k/year + Equity
RemoteNASeniorCloud ComputingArtificial IntelligenceSite Reliability EngineerTerraformAWSKubernetesHelmBash
Stack AVStack AV - Site Reliability Engineer4d ago
·Remote - Pittsburgh, PA or Remote
RemoteNAGovernmentNonprofitSite Reliability EngineerLinuxKubernetesPrometheus
Circle.soCircle.so - Senior Site Reliability Engineer2w ago
·Remote - Americas·Equity
RemoteNASeniorCloud ComputingArtificial IntelligenceSite Reliability EngineerAWSKubernetesDocumentationClickHouseMySQLRedis
Mistral AIMistral AI - Site Reliability Engineer2mo ago
·Remote - New York, NY
RemoteNASeniorSite Reliability EngineerKubernetesTerraform
PinterestPinterest - Site Reliability Engineer II, tvScientific1w ago
·San Francisco, California, United States·$114k - $114k/year + Equity
In OfficeNAMidCloud ComputingSite Reliability EngineerBashPythonAWSKubernetesTerraformHelmLinuxChange ManagementGovernance
AccelaAccela - Site Reliability Engineer 23d ago
·Remote - Based - US·$125k - $145k/year + Equity
RemoteNAMidInsuranceCloud ComputingSite Reliability EngineerBashPythonChange ManagementAzureKubernetesGitTerraformAnsibleLinuxClaude
TinesTines - Senior Site Reliability Engineer - Government Cloud2d ago
·Remote - USA·$210k - $220k/year + Equity
RemoteNASeniorCloud ComputingSoftwareSite Reliability EngineerAWSDocumentationObservableTerraformCDKRubyRedisReactKubernetesRailsTypeScript

Browse more by category

Show 222 moreSite Reliability EngineerShow 5,779 moreDocumentationShow 58 moreCUDAShow 989 moreLinuxShow 1,919 moreKubernetesShow 479 moreBash
Privacy·Terms··Contact·FAQ·Wagey on X