Orcrist Technologies - Infrastructure Engineer
Responsibilities
• Design, size, provision, and operate bare-metal GPU server fleets across on-prem and air-gapped environments (firmware/BIOS, BMC via Redfish/IPMI, OS, drivers) with zero-touch provisioning (PXE/iPXE, MAAS/Metal3/Tinkerbell) and automation (Ansible/Salt, Terraform/Pulumi). • Own the NVIDIA GPU stack end to end: drivers, CUDA, GPU Operator, Container Toolkit, MIG, and DCGM, tuned for inference throughput, latency, and utilization. • Build the bare-metal substrate Kubernetes runs on: node lifecycle, container runtime, GPU device plugins, node feature discovery, and kernel/NUMA tuning. • Engineer data-center networking and resilient storage (VLANs/switching, RDMA, Ceph/ZFS/NVMe) sized to scale without replacing the core, with encryption at rest. • Partner with ML and MLOps on on-prem inference serving (Triton, KServe, vLLM): model deployment, GPU scheduling and sharing, and performance tuning. • Plan and run on-site build-outs: rack integration, power/UPS and cooling sizing, commissioning, capacity planning, runbooks, and operator handover. • 5+ years in bare-metal, HPC/GPU, data-center, or systems infrastructure engineering, with hands-on ownership of physical and compute infrastructure. • Strong bare-metal Linux (RHEL/Rocky/Ubuntu): firmware, BMC, PXE, kernel and storage tuning, plus solid networking and storage fundamentals. • Real experience with the NVIDIA GPU stack (drivers, CUDA, GPU Operator, MIG, DCGM) and serving GPU models in production. • Comfortable operating in air-gapped or on-prem environments and traveling to customer sites for builds and deployments. • Documentation-focused, methodical, and calm during hardware incidents. Eligible to work in Germany. • Nice‑to‑haves • German language (B1+), NVIDIA DGX/HGX or Slurm experience, InfiniBand/RDMA fabrics, and inference optimization (TensorRT-LLM, vLLM, quantization). • Certifications such as NVIDIA NCP-AIO, Red Hat RHCSA/RHCE, or CKA/CKS. • Field-engineering experience and familiarity with secure or regulated deployment environments.
Benefits
• Modern architecture & stack. • Remote‑first in Germany with occasional team events in Berlin. • Remote‑first • 30 days vacation. • 30 days vacation. • Direct impact on critical missions across private and public‑sector customers.
Apply in one click
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT