Runpod, Inc. - Manager, HPC Storage Engineer

Remote - USA$150k - $240k+ Equity4mo ago

Remote Mid NA Dental Cloud Computing Director of Engineering Solidity Engineer Team Management Performance Management Change Management Close Slack

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Engineering Leadership Experience: 3+ years managing storage, systems, or infrastructure engineering teams in production environments. • Distributed Storage Expertise: 8+ years designing and operating large-scale storage systems, including SAN and NFS architectures at multi-petabyte scale. • Distributed Storage Expertise: • SAN and NFS architectures • VAST Data Experience: Hands-on experience deploying, operating, or deeply integrating VAST Data in production environments is required. • VAST Data • Parallel Filesystems: Experience with Lustre or comparable HPC filesystems (e.g., GPFS, BeeGFS) supporting high-concurrency workloads. • Parallel Filesystems: • Lustre or comparable HPC filesystems • Low-Level Storage Knowledge: Deep understanding of NAND, NVMe, PCIe, storage controllers, and performance characteristics across the stack. • Low-Level Storage Knowledge: • NAND, NVMe, PCIe, storage controllers • High-Performance Data Paths: Proven experience with NFS over RDMA, RDMA-capable transports, or similar technologies. Familiarity with GPU Direct Storage strongly preferred. • High-Performance Data Paths: • NFS over RDMA, RDMA-capable transports • GPU Direct Storage • Linux Systems Expertise: Strong Linux internals knowledge, including filesystems, I/O scheduling, memory management, and tuning for performance workloads. • Linux Systems Expertise: • Operational Excellence: Experience running 24/7 storage platforms with strong incident response, change management, and post-mortem discipline. • Operational Excellence: • Communication & Leadership: Ability to clearly communicate complex technical tradeoffs and lead teams through high-stakes infrastructure decisions. • Communication & Leadership: • Successful completion of a background check. • Experience supporting AI training pipelines, large-scale model checkpointing, and dataset streaming workloads. • Familiarity with RDMA fabrics and close collaboration with datacenter networking teams. • Experience designing storage systems for multi-tenant isolation and secure data access. • Background in hyperscale, HPC, or AI-focused infrastructure environments. • Experience building internal storage platforms or abstractions consumed by product teams. • What You’ll Receive: • The competitive base pay for this position ranges from $150,000 - $240,000 USD. This salary range may be inclusive of several career levels at Runpod and will be narrowed during the interview process based on a number of factors, including the candidate’s experience, qualifications, and location • Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside. • Generous medical, dental & vision plans — we cover 100% for all employees and partial for dependents. • Flexible PTO- take the time you need to recharge • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication

Responsibilities

• Own Distributed Storage Architecture: Define, evolve, and operate Runpod’s global storage platforms, supporting training, inference, checkpointing, and dataset access at scale. • Own Distributed Storage Architecture: • Build the Storage Engineering Team: Manage and grow a team of storage and systems engineers. Set clear ownership, technical direction, and operational standards across regions. • Build the Storage Engineering Team: • High-Performance Shared Filesystems: Design and operate large-scale SAN and NFS deployments, including performance-sensitive shared storage for GPU clusters.= • High-Performance Shared Filesystems: • SAN and NFS deployments • Advanced Filesystems & Platforms: Lead deployments and operations of VAST Data and experience with Lustre or similar parallel filesystems used in HPC and AI environments. • Advanced Filesystems & Platforms: • VAST Data • Lustre or similar parallel filesystems • End-to-End Performance Ownership: Drive performance optimization from NAND and NVMe media through controllers, networking, and client access patterns. • End-to-End Performance Ownership: • NAND and NVMe media • Next-Generation Storage Technologies: Evaluate and deploy cutting-edge capabilities such as NFS over RDMA, GPU Direct Storage (GDS), and low-latency data paths for accelerated workloads. • Next-Generation Storage Technologies: • NFS over RDMA, GPU Direct Storage (GDS) • Reliability & Scale: Establish best practices for replication, data tiering, data protection, failure recovery, capacity planning, and lifecycle management. • Reliability & Scale: • Automation & Observability: Build automation for provisioning, expansion, upgrades, and monitoring. Ensure deep observability into throughput, latency, and error characteristics. • Automation & Observability: • Cross-Functional Collaboration: Partner with Datacenter Networking, GPU Platform, SRE, and Product teams to ensure storage systems meet evolving workload and customer needs. • Cross-Functional Collaboration: • Vendor & Partner Management: Own technical relationships with storage vendors, hardware partners, and colocation providers; drive roadmap alignment and issue resolution. • Vendor & Partner Management:

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities