deepgram - Systems Architect AI/ML Infrastructure

Remote, California, United States - Hybrid$160k - $220k3w ago

In Office Principal NA Cloud Computing Artificial Intelligence Software Architect AI Engineer AWS Kubernetes Procurement Ray Documentation

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• 7+ years of experience in infrastructure engineering, systems architecture, or a senior technical role focused on large-scale infrastructure • Proven experience designing multi-cloud architectures spanning AWS and at least one other major cloud provider or on-premises environment • Deep expertise in storage system design -- block, object, and file storage, including performance tuning for large-scale data workloads • Strong experience with compute orchestration using Kubernetes, and an understanding of how to schedule diverse workloads efficiently • Hands-on experience with GPU infrastructure -- procurement considerations, cluster design, driver and runtime management • Track record of capacity planning and infrastructure scaling for high-growth environments • Ability to communicate complex architectural decisions clearly to both technical and non-technical stakeholders • Strong understanding of networking fundamentals as they relate to infrastructure architecture (see our Network Engineer role for the deep specialist) • IT WOULD BE GREAT IF YOU HAD • Direct experience architecting infrastructure for ML training workloads -- distributed training, large dataset management, experiment infrastructure • Background in cost optimization and FinOps practices for large-scale cloud and bare metal infrastructure • Experience operating and managing bare metal infrastructure in colocation facilities • Expertise in network architecture design, including high-bandwidth GPU interconnects and global traffic routing • Experience with infrastructure modeling and simulation for capacity planning • Familiarity with Slurm, Ray, or other HPC/ML job scheduling systems • Understanding of power, cooling, and physical infrastructure considerations for GPU-dense deployments

Responsibilities

• Define and drive the end-to-end infrastructure architecture for Deepgram's AI/ML workloads across production inference and research training • Design multi-cloud and hybrid infrastructure strategies that balance performance, reliability, cost, and vendor flexibility • Architect compute orchestration systems that efficiently schedule and manage GPU and CPU workloads across heterogeneous infrastructure • Design storage architectures that handle the massive datasets required for speech and audio ML -- from high-throughput training data pipelines to low-latency model serving • Lead capacity planning across all infrastructure dimensions, modeling growth and ensuring Deepgram can scale ahead of demand • Drive cost optimization and FinOps practices, identifying opportunities to reduce infrastructure spend without compromising performance or reliability • Design burstable, elastic training infrastructure that can scale up for large training runs and scale down to minimize idle cost • Architect research compute infrastructure that gives ML teams the resources they need while maintaining operational efficiency • Establish architectural standards, design review processes, and technical documentation practices for infrastructure decisions • Collaborate with engineering leadership to align infrastructure strategy with product roadmap and business objectives • Evaluate emerging hardware, cloud services, and infrastructure technologies for potential adoption • YOU'LL LOVE THIS ROLE IF YOU • Think in systems -- you naturally see the connections between compute, storage, network, and how they interact under load • Are motivated by designing infrastructure that operates at the intersection of real-time production systems and large-scale ML training • Enjoy making architectural trade-offs where cost, performance, reliability, and velocity are all in tension • Want to work across the full infrastructure stack -- from bare metal and GPUs to cloud services and container orchestration • Are excited about building cost-effective, burstable infrastructure that enables world-class AI research • Like operating at a strategic level while staying technically deep enough to validate designs and debug complex issues

Benefits

• HOLISTIC HEALTH • Annual wellness stipend • Mental health support • Life, STD, LTD Income Insurance Plans • WORK/LIFE BLEND • Unlimited PTO • Generous paid parental leave • Flexible schedule • 12 Paid US company holidays • Quarterly personal productivity stipend • One-time stipend for home office upgrades • 401(k) plan with company match • Tax Savings Programs • CONTINUOUS LEARNING • Learning / Education stipend • Participation in talks and conferences • Employee Resource Groups • AI enablement workshops / sessions • For candidates outside of the US, we use an Employer of Record model in many countries, which means benefits are administered locally and governed by country-specific regulations. Because of this, benefits will differ by region — in some cases international employees receive benefits US employees do not, and vice versa. As we scale, we will continue to evaluate where we can create more alignment, but a 1:1 global benefits structure is not always legally or operationally possible. • Backed by prominent investors including Y Combinator, Madrona, Tiger Global, Wing VC and NVIDIA, Deepgram has raised over $215M in total funding. If you're looking to work on cutting-edge technology and make a significant impact in the AI industry, we'd love to hear from you!

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities

Benefits