The Voleon Group - Senior Cluster Site Reliability Engineer

Berkeley, California, United States - Hybrid$205k - $235k1mo ago

In Office Senior NA Cloud Computing Artificial Intelligence Site Reliability Engineer Tech Lead AWS GCP MLflow Ruby Kubeflow

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead • Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod) • Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.) • Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible) • Experience with cloud infrastructure (AWS or GCP) • Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry) • Experience with distributed storage technologies (Lustre, Ceph, S3) • Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation • Bachelor degree in computer science • Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark) • Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed) • Familiarity with hybrid/on-prem environments • Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments • Experience with HPC networking (InfiniBand, RDMA) • Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust) • “Friends of Voleon” Candidate Referral Program • If you have a great candidate in mind for this role and would like to have the potential to earn $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program.

Responsibilities

• Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise • Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability • Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams • Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do • Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies • Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Benefits

• Base Salary $205K – $235K • Offers Bonus • The listed base salary range for this position is based upon the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. • Our benefits package includes medical, dental, and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match. • Upload your resume here to autofill key application fields. • Drop your resume here! • Parsing your resume. Autofilling key fields... • or drag and drop here • If you answered yes to any of the above, please provide more detail. • Recruiting Privacy Policy

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities

Benefits