wagey.ggwagey.gg
38,923  jobs38,923  jobs
Browse Tech JobsCompaniesFeaturesPricingFAQs
Log InGet Started Free
Jobs(38,923)/Infrastructure Engineer Role(237)/FirstPrinciples (6) - AI & HPC Infrastructure Engineer
FirstPrinciples

FirstPrinciples - AI & HPC Infrastructure Engineer

Ontario, Canada - Remote - Hybrid4w ago
In OfficeNACloud ComputingArtificial IntelligenceInfrastructure EngineerAI EngineerLinuxKubernetesAWSGCPAzure

Requirements

• Strong infrastructure builder with experience operating production, research, cloud, or high-performance compute systems • Deeply comfortable with Linux administration, including debugging networking, storage, system services, permissions, performance issues, and node-level failures • Experienced with Kubernetes in real environments, including cluster operations, deployments, networking, observability, scaling, and troubleshooting • Comfortable working with cloud infrastructure on AWS, GCP, Azure, or equivalent platforms • Familiar with infrastructure automation and configuration tools such as Terraform, Ansible, Helm, ArgoCD, GitOps workflows, or similar systems • Experienced with GPU-heavy, compute-heavy, or HPC-style workloads, especially in environments involving AI, ML, research computing, or scientific workloads • Able to work across bare metal and cloud environments, and interested in the practical tradeoffs between the two • Comfortable reasoning about resource scheduling, cluster utilization, autoscaling, storage, networking, and observability for distributed workloads • Practical and ownership-oriented; you can take ambiguous infrastructure needs and turn them into working systems • Comfortable collaborating across disciplines, especially with researchers and engineers who may not think in infrastructure terms • Able to operate independently as a senior or strong intermediate contributor, while knowing when to bring others into important technical decisions • Motivated by building foundational systems that make ambitious technical and scientific work possible • Hands-on experience with production-grade LLM inference and serving engines, such as vLLM, SGLang, or TensorRT • Experience working at an AI company, ML infrastructure team, research lab, university compute environment, HPC center, or scientific computing organization • Experience supporting model inference, model serving, distributed training, high-throughput batch workloads, or internal ML platforms • Hands-on experience with Slurm or similar HPC schedulers, including job scheduling, resource allocation, queue management, and cluster configuration • Experience operating GPU infrastructure, including NVIDIA drivers, CUDA, container runtimes, scheduling, utilization, and hardware failure modes • Experience with RDMA, InfiniBand, high-performance networking, distributed filesystems (ie. Lustre, BeeGFS), object storage, or storage systems for compute-heavy workloads • Experience with Kubernetes operators, custom controllers, CRDs, or platform tooling for AI/ML workloads • Experience with Prometheus, Grafana, Loki, OpenTelemetry, Datadog, or similar monitoring, logging, and observability tools • Experience with container registries, image optimization, CI/CD systems, deployment pipelines, and secure software delivery • Experience leading engineering operations or infrastructure efforts while remaining hands-on technically • Familiarity with security, access control, secrets management, and reliability practices in production or research environments • WHAT YOU’LL GET

Responsibilities

• Design, deploy, and operate Kubernetes infrastructure for AI inference, research, and engineering workloads • Set up and manage GPU and HPC-style compute environments, including scheduling, utilization, job management, and node-level troubleshooting • Work with systems such as Kubernetes, Slurm or similar schedulers, container runtimes, GPU drivers & libraries (ie; CUDA), storage systems, and observability tools • Build and manage Linux-based compute environments, including provisioning, networking, storage, monitoring, access control, and lifecycle management • Help architect bare metal, cloud, and hybrid infrastructure across AWS, GCP, Azure, or equivalent platforms • Own the reliability and operational health of infrastructure systems, including monitoring, alerting, incident response, capacity planning, and performance tuning • Improve deployment workflows, automation, configuration management, secrets management, and infrastructure-as-code practices • Partner with ML engineers, researchers, and software engineers to understand workload requirements and translate them into practical infrastructure designs • Evaluate tradeoffs between managed cloud services, self-managed Kubernetes, HPC schedulers, bare metal deployments, and multi-cloud architectures • Build tooling, documentation, runbooks, and operational practices that help the team move quickly without making infrastructure fragile or opaque • Balance speed and robustness, knowing when to prototype quickly and when to harden systems for long-term use

Benefits

• We’re building the next generation of infrastructure for AI-driven scientific discovery, and we need someone who can help own the systems that make our research and inference workloads reliable, scalable, and fast. • This role is about building and operating the compute foundation behind our AI Physicist: Kubernetes clusters, Linux systems, GPU infrastructure, cloud environments, HPC-style compute, deployment workflows, monitoring, and automation. As our workloads grow, we need infrastructure that can support both experimentation and production-like inference across cloud, bare metal, and hybrid environments. • You’ll play a central role in shaping how we run compute at FirstPrinciples. That includes provisioning and managing clusters, improving reliability and observability, reducing operational toil, supporting researchers and engineers, and helping us make practical decisions about when to use managed cloud services, self-managed Kubernetes, Slurm-style systems, or owned hardware.

Apply in one click

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click
Apply in One Click

Similar roles

Arize AIArize AI - Forward Deployed AI Engineer, West1w ago
·Remote - PT (Pacific)·$125k - $175k/year + Equity
RemoteNAMidCloud ComputingArtificial IntelligenceAI EngineerJavaTypeScriptPythonMLOpsAWSGCPAzureDockerKubernetes
Grafana LabsGrafana Labs - Staff AI Engineer4mo ago
·Remote - Canada (Remote)
RemoteNAStaffCloud ComputingArtificial IntelligenceAI EngineerStaff EngineerGCPAWSAzureKubernetesDocker
ElasticElastic - AI Engineer3d ago
·Unknown - USA *·$94k - $94k/year
In OfficeNAJuniorCloud ComputingArtificial IntelligenceAI EngineerElasticsearchElastic StackSalesforcePythonTypeScriptKubernetesTerraformDockerDocumentationGCPAWSAzure
SauceSauce - AI Operations Engineer3w ago
·Remote - ET (Eastern)
RemoteNACloud ComputingArtificial IntelligenceAI EngineerCursorNode.jsReactRESTAzureAWSGCPPlaywright
GraphcoreGraphcore - 2026 Graduate IT Infrastructure Engineer3d ago
·Bristol, UK
In OfficeEMEAJuniorCloud ComputingArtificial IntelligenceInfrastructure EngineerLinuxWindows ServerDockerKubernetesAWSGCPBashAzurePython
ConduitConduit - Infrastructure Engineer1mo ago
·USA - Hybrid
In OfficeNASeniorCryptocurrencyCloud ComputingInfrastructure EngineerGoPythonKubernetesGCPAWS
HackerOneHackerOne - Infrastructure Engineer III1mo ago
·Remote - PT (Pacific)·$140k - $175k/year + Equity
RemoteNAMidCloud ComputingInfrastructure EngineerLinuxAWSTerraformKubernetesCloudflare
TRM LabsTRM Labs - Senior Infrastructure Engineer3mo ago
·United States·$210k - $230k/year + Equity
In OfficeNASeniorCryptocurrencyCloud ComputingInfrastructure EngineerAWSGCPReportingAirflowLinux
Earth Species ProjectEarth Species Project - Senior Infrastructure Engineer (Backend/Data Performance)4mo ago
·Remote - USA *·$226k - $236k/year
RemoteNASeniorCloud ComputingInfrastructure EngineerPythonAWSGCPAzureDocker

Browse more by category

Show 237 moreInfrastructure EngineerShow 1,044 moreAI EngineerShow 992 moreLinuxShow 1,928 moreKubernetesShow 3,841 moreAWSShow 1,568 moreGCPShow 1,657 moreAzure
Privacy·Terms··Contact·FAQ·Wagey on X