perplexity - Member of Technical Staff (AI Infrastructure Engineer)
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization • Experience with deploying and managing distributed training systems at scale • Deep understanding of container orchestration and distributed systems architecture • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies) • Experience managing GPU clusters and optimizing compute resource utilization • Expert-level Kubernetes administration and YAML configuration management • Proficiency with Slurm job scheduling, resource management, and cluster configuration • Python and C++ programming with focus on systems and infrastructure automation • Hands-on experience with ML frameworks such as PyTorch in distributed training contexts • Strong understanding of networking, storage, and compute resource management for ML workloads • Experience developing APIs and managing distributed systems for both batch and real-time workloads • Solid debugging and monitoring skills with expertise in observability tools for containerized environments • Experience with Kubernetes operators and custom controllers for ML workloads • Advanced Slurm administration including multi-cluster federation and advanced scheduling policies • Familiarity with GPU cluster management and CUDA optimization • Experience with other ML frameworks like TensorFlow or distributed training libraries • Background in HPC environments, parallel computing, and high-performance networking • Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices • Experience with container registries, image optimization, and multi-stage builds for ML workloads • Demonstrated experience managing large-scale Kubernetes deployments in production environments • Proven track record with Slurm cluster administration and HPC workload management • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure • Experience supporting both long-running training jobs and high-availability inference services • Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management
Responsibilities
• Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads • Manage and optimize Slurm-based HPC environments for distributed training of large language models • Develop robust APIs and orchestration systems for both training pipelines and inference services • Implement resource scheduling and job management systems across heterogeneous compute environments • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands
No credit card. Takes 10 seconds.