Manager, Datacenter Network Engineering
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Engineering Leadership Experience: 3+ years managing network or infrastructure engineering teams, with experience scaling teams and systems in production environments. • Datacenter Networking Expertise: 8+ years designing and operating large-scale datacenter networks, including spine-leaf architectures, BGP-based routing, and high-throughput fabrics. • Datacenter Networking Expertise: • Encapsulation & Overlays: Strong hands-on experience with VXLAN/EVPN or equivalent encapsulation protocols, including control-plane and data-plane considerations. • Encapsulation & Overlays: • VXLAN/EVPN or equivalent encapsulation protocols • High-Performance Networking: Proven experience with InfiniBand and/or RoCE, including congestion management, lossless Ethernet concepts, and performance tuning for GPU workloads. • High-Performance Networking: • InfiniBand and/or RoCE • Global WAN Experience: Deep familiarity with global WAN technologies, including private backbone design, inter-region connectivity, routing policy, and traffic engineering. • global WAN technologies • Linux & Network OS Fluency: Comfortable working with Linux-based systems, network operating systems, and automation tooling. • Linux & Network OS Fluency: • Operational Excellence: Strong background in network observability, incident management, capacity forecasting, and change control. • Operational Excellence: • Communication & Leadership: Clear written and verbal communication skills, with the ability to align stakeholders and lead teams through complex technical challenges. • Communication & Leadership: • Successful completion of a background check. • Experience operating networks for GPU clusters, HPC environments, or AI/ML platforms. • GPU clusters, HPC environments, or AI/ML platforms • Familiarity with RDMA tuning, NCCL traffic patterns, and distributed training communication models. • Experience with automation frameworks and network-as-code (e.g., Terraform, Ansible, internal tooling). • Background in multi-region or multi-cloud networking architectures. • Experience working in high-growth or hyperscale infrastructure environments. • What You’ll Receive: • The competitive base pay for this position ranges from ($150,000 - $240,000). This salary range may be inclusive of several career levels at Runpod and will be narrowed during the interview process based on a number of factors, including the candidate’s experience, qualifications, and location • Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside. • Generous medical, dental & vision plans — we cover 100% for all employees and partial for dependents. • Flexible PTO- take the time you need to recharge • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
Responsibilities
• Lead the Datacenter Networking Team: Manage and grow a team of network engineers responsible for datacenter fabrics, interconnects, and global WAN connectivity. Provide mentorship, technical guidance, and clear ownership boundaries. • Lead the Datacenter Networking Team: • Own Datacenter Network Architecture: Define and evolve network designs for GPU-heavy clusters, including spine-leaf topologies, ECMP routing, and high-bandwidth east-west traffic patterns. • Own Datacenter Network Architecture: • High-Performance GPU Networking: Oversee design and operation of InfiniBand and RoCE-based fabrics supporting distributed training and inference workloads. Ensure performance, loss characteristics, and congestion control meet AI workload requirements. • High-Performance GPU Networking: • InfiniBand and RoCE-based fabrics • Encapsulation & Overlay Protocols: Guide implementation and operations of encapsulation technologies such as VXLAN, EVPN, Geneve, or similar, enabling scalable multi-tenant isolation and flexible network provisioning. • Encapsulation & Overlay Protocols: • VXLAN, EVPN, Geneve • Global WAN & Backbone Connectivity: Lead strategy and execution for global WAN connectivity, including private backbone links, IX connectivity, and hybrid connectivity with cloud providers and partners. • Global WAN & Backbone Connectivity: • global WAN connectivity • Reliability & Operations: Establish operational best practices for monitoring, capacity planning, change management, incident response, and post-mortems across the network stack. • Reliability & Operations: • Cross-Functional Collaboration: Partner closely with Infrastructure, SRE, Hardware, and Product Engineering teams to ensure network capabilities align with platform and customer requirements. • Cross-Functional Collaboration: • Vendor & Partner Management: Work with hardware vendors, colocation providers, and transit partners on network design, procurement, deployment timelines, and escalations. • Vendor & Partner Management: • Security & Segmentation: Ensure network designs support secure isolation, DDoS resilience, and compliance requirements without compromising performance. • Security & Segmentation: