vultr - AI Cluster Architect
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• 7+ years designing or building large-scale HPC, AI, or hyperscale GPU clusters. • Expert understanding of GPU and accelerator system design, including node topology, PCIe/NVLink/NVSwitch/ROCm, and NIC-to-GPU affinity considerations. • Strong familiarity with InfiniBand, RoCE, and SpectrumX networking, including multi-tier, multi-plane, Clos/dragonfly variants, and large-radix switch design. • Demonstrated experience modeling power draw and thermal characteristics of servers, GPUs, NICs, switches, optics, and storage systems. • Ability to design networks that maintain full non-blocking performance or intentionally introduce over/under-subscription while understanding impacts on workload performance. • Proven ability to gather and analyze vendor SKU-level specifications and incorporate them into scalable cluster architectures. • Experience balancing customer-driven requirements for compute, storage, and service density in combination with overall GPU count. • Strong documentation, communication, and cross-functional collaboration skills.
Responsibilities
• Architect large-scale GPU clusters within fixed site power budgets that optimizes for maximum GPU density while reserving necessary headroom for compute services, storage, and networking. • Model and validate power consumption across the full cluster bill of materials (GPUs, CPUs, NICs, switches, fabric components, storage, and facility limits). • Evaluate tradeoffs across multiple fabric networking architectures (InfiniBand, RoCE, SpectrumX) as well as multi-plane, 2-tier/3-tier, and rail-optimized topologies. • Determine network scale limits based on switch radix, link speed, topology, and blocking requirements. • Gather, interpret, and maintain detailed SKU-level power and thermal specifications for GPUs, NICs, switches, DPUs, storage, and server platforms. • Develop power-aware cluster configuration templates and capacity-planning models that can scale across sites with varying constraints and allow for quick iteration and ideation. • Document architecture, design choices, tradeoff analyses, and operational considerations for deployment and lifecycle management. • Provide guidance on future-proofing, including the ability to incorporate next-gen GPUs, NICs, or fabrics. • Collaborate with vendors on novel fabric architectures that enable large-scale cluster deployments (100k+ GPUs)
Benefits
• Excellent Medical Benefits with company-paid premiums for employee only plan + dental & vision coverage. • Company matches up to 4% of the 401(k) contributions with immediate vesting. • Professional Development Reimbursement of $2,500 each year. • Paid Time Off Accrual including holidays and a rollover plan; birthday off is also included. • Increased PTO at 3 years & 10 years anniversary with an additional month paid sabbatical every five years plus annual Anniversary Bonus (benefits not quantified). • $500 first year remote office setup + subsequent $400 each for new equipment. • Internet reimbursement up to $75 per month. • Gym membership reimbursement of up to $50 per month. • Company pays Wellable subscription (benefit not quantified).
No credit card. Takes 10 seconds.