wagey.ggwagey.ggv1.0-4558734-20-Apr
Browse Tech JobsCompaniesFeaturesPricingFAQs
Log InGet Started Free
Jobs/DevOps Engineer Role/FirstPrinciples - Member of Technical Staff, DevOps / Infrastructure Engineering
FirstPrinciples

FirstPrinciples - Member of Technical Staff, DevOps / Infrastructure Engineering

Hybrid1mo ago
In OfficeStaffWWCloud ComputingArtificial IntelligenceDevOps EngineerChefGoRustBashPythonRecords Management

Upload My Resume

Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities, observability, and safety built in. • Develop tooling for developer workflows including reproducible builds, ephemeral environments, secrets management, and cluster resource allocation. • Create self-service infrastructure patterns that empower researchers and engineers. • Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility. • HPC & GPU Cluster Management: • Manage and extend HPC environments including GPU clusters, InfiniBand networks, job schedulers (Slurm/Kubernetes hybrid), and container orchestration. • Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments. • Optimize cluster scheduling and resource allocation for high-performance GPU workloads. • Debug GPU driver quirks, Slurm job issues, and InfiniBand networking hiccups as they arise. • Monitoring, Observability & Reliability: • Implement comprehensive monitoring, logging, and alerting across all infrastructure layers using Prometheus, Grafana, ELK/EFK, and OpenTelemetry. • Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs. • Build observability stacks that provide visibility into both system health and job-level performance. • Proactively detect and resolve infrastructure issues before they impact research workflows. • Security & Compliance: • Implement and manage secrets management and identity security solutions (Vault, KMS, IAM). • Champion security best practices, IAM policies, and compliance standards across hybrid infrastructure. • Design infrastructure with least privilege principles and strong security hygiene from the start. • Maintain zero-trust security posture and comprehensive auditing capabilities. • Collaboration: • Partner closely with training engineers and researchers to translate research needs into robust infrastructure solutions. • Document best practices, create runbooks, and evangelize DevOps culture across the organization. • Mentor teammates on infrastructure patterns, automation techniques, and operational excellence. • Enable efficient pre-training runs and safe deployment of new infrastructure patterns through collaboration. • Educational Background: Bachelor's or Master's degree in Computer Science, Engineering, or related field. • Educational Background • Strong Unix/Linux systems background including kernel tuning, networking, storage, and process control experience. • Infrastructure-as-Code experience with Terraform, Pulumi, or CloudFormation. • Expertise building CI/CD systems and reproducible build pipelines (GitHub Actions, GitLab CI, Jenkins, etc.). • Hands-on experience with AWS (EC2, S3, IAM, VPC, etc.) and cloud infrastructure fundamentals. • Cluster orchestration and job scheduling experience with Kubernetes and Slurm. • Monitoring and observability stack experience (Prometheus, Grafana, ELK/EFK, OpenTelemetry). • Demonstrated success scaling infrastructure for high-performance or GPU workloads. • Track record of managing GPU-accelerated clusters or HPC infrastructure. • Experience in automating workflows that reduced toil and scaling deployments safely. • Skills: Strong programming skills in at least one compiled/systems language (Python, Go, or Rust) plus Bash fluency. • Collaboration & Communication: Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences. • Collaboration & Communication • Mindset: Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history. • Mindset • Demonstrated passion for physics and for making scientific knowledge accessible and impactful. • Prior work with HPC vendors or AI compute providers (Buzz HPC, NVIDIA DGX, Lambda, CoreWeave). • Experience designing self-service infrastructure or internal developer platforms. • Deep familiarity with GPU cluster management, scheduling, and high-throughput networking (InfiniBand). • Cost management and optimization experience for large-scale compute infrastructure. • Build system fluency and comfort with modern build tools (CMake, Bazel, Meson, Buck, Ninja). • Experience supporting AI/ML research environments and training pipeline infrastructure. • What Excites Us (Beyond the technical qualifications, we're looking for someone who): • Thinks automation first - You reflexively reduce toil by codifying repeatable operations rather than clicking through UIs. • Builds system love - Reproducibility and robust CI/CD excite you, not bore you. You're eager to build a state-of-the-art platform — your own death star — that researchers love using. • DevOps philosophy - You understand why DevOps exists and live and breathe the philosophy, not just use the tools. • HPC comfort - You can (or want to learn to) debug Slurm jobs, GPU driver quirks, or InfiniBand hiccups without blinking. • Cloud + HPC pragmatism - You know (or are eager to learn) when to use AWS primitives versus optimizing HPC schedulers. • Security from day one - You design infrastructure with least privilege and secrets management from the start, not as an afterthought. • Collaborative builder - You help mentor and elevate the team, not just build in isolation. • Application Process: • Interested candidates are invited to submit their resume, a cover letter detailing their qualifications and vision for the role, and references. Please include "Member of Technical Staff, DevOps / Infrastructure Engineering" in the cover letter.

Responsibilities

• Infrastructure Architecture & Automation: • Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs. • Architect hybrid infrastructure solutions that span cloud and on-premises HPC environments seamlessly. • Automate configuration management and drift detection using tools like Ansible, Salt, or Chef. • Build systems that reduce operational toil and establish guardrails that let researchers focus on experiments, not operations.

Get Started Free

No credit card. Takes 10 seconds.

Privacy·Terms··Contact·FAQ·Wagey on X
Loading...