Senior DevOps & Infrastructure Engineer
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• You are an infrastructure owner, not a dashboard watcher • You don’t wait for tickets—you proactively find bottlenecks, measure them, fix them, and prove the gains. You ship improvements that compound. • You care about tail latencies and failure modes • You think in SLOs, load patterns, saturation curves, and blast radius. You design for the real world: retries, backpressure, partial failures, and noisy neighbors. • You love performance • You enjoy turning “slow and expensive” into “fast and efficient.” You benchmark, profile, tune, and iterate. • You can operate autonomously • You are comfortable making high-stakes engineering decisions with good judgment, and communicating tradeoffs clearly to the team. • You'll own and evolve HUD’s infrastructure so it is: • Extremely performant (fast sandbox provisioning, fast cold starts, low tail latency, high throughput) • Extremely reliable (predictable behavior, graceful failure, robust scaling, low operational risk) • Operationally excellent (systems scale, clear SLOs, deep observability, incident readiness, cost discipline) • Secure and compliant (SOC 2-aligned practices, strong security posture by default) • What you’ll work on • Own our AWS + EKS-based sandbox platform that runs Dockerized workloads for customers and internal teams. • Optimize sandbox lifecycle end-to-end: provisioning, scheduling, image pulls, startup, execution, teardown, and caching. • Design for massive parallelism while maintaining reliability, fairness, and predictable performance. • Kubernetes + AWS excellence • Evolve our cluster architecture: node groups, autoscaling strategies, spot/on-demand mixes, scheduling policies, and workload isolation. • Build safe-by-default patterns: quotas, resource limits, network policies, pod security, secrets management, and guardrails. • Improve cluster resiliency and operational ergonomics (upgrades, rollouts, disaster recovery, fail-safes). • Cross-stack DevOps ownership • Address infrastructure bottlenecks as we scale. • Improve developer experience for internal teams: safer deploys, better CI/CD, smoother local/dev workflows, faster iteration. • Provide architectural input and raise the infra maturity of the team via docs, patterns, and coaching. • Interface with our backend/workers (Railway), frontend (Vercel/Next.js), and data (Supabase/Postgres) to ensure the whole system is cohesive. • Performance engineering and ruthless measurement • Establish “infra product metrics” and instrument everything: P50/P95/P99 sandbox startup times, queue times, job success rates, noisy-neighbor rates, image pull latencies, cluster saturation, and cost-per-run. • Build benchmarking harnesses for sandboxes and workloads to track regressions and validate improvements. • Treat efficiency as a first-class metric: optimize utilization without sacrificing latency or reliability. • Observability + incident readiness • Implement gold-standard observability across logs/metrics/traces with actionable dashboards and alerting tied to SLOs. • Create runbooks, incident processes, and postmortem culture that meaningfully improves the system each time. • Deep AWS experience, including operating production systems at scale (networking, IAM, compute, storage, observability, cost). • Strong Kubernetes/EKS experience: cluster design, workload isolation, autoscaling (cluster + pod), upgrades, reliability practices. • Excellent Docker + container runtime knowledge: image optimization, build pipelines, caching strategies, and runtime security considerations. • Systems-level competence: Linux fundamentals, networking, performance debugging, resource contention, concurrency basics. • Infrastructure automation: strong ability to implement infrastructure as code (Terraform/CDK/CloudFormation) and repeatable environments. • Observability expertise: metrics/logging/tracing design, SLOs/SLIs, alerting that avoids noise and catches real issues. • Security + compliance mindset: experience working in SOC 2-aligned environments; ability to implement least privilege, auditability, and operational controls. • Strong engineering communication: can write clear docs, propose designs, and upskill the team. • Experience building ephemeral compute / sandbox / job execution platforms (multi-tenant, Dockerized workloads, queueing, isolation). • Proven wins reducing cold start / startup time and improving p95/p99 latency for infra-critical paths. • Deep familiarity with: • Karpenter / Cluster Autoscaler, HPA/VPA, pod scheduling strategies, priority classes, taints/tolerations, topology spread constraints • Container performance: image layering, registry optimization, pull-through caches, snapshotters, prewarming strategies • Service mesh / networking (where appropriate), network policies, ingress design, egress controls • Experience migrating from mixed hosting providers into a more cohesive platform architecture. • Experience with CI/CD at high velocity (safe deploys, progressive delivery, canaries, rollbacks). • Experience with GPU infrastructure and orchestration (if applicable to workloads). • Security depth beyond basics: threat modeling, hardening, secure supply chain for containers, audit-readiness workflows. • Ability to contribute across the stack: • Python (our SDK and backend systems) and Next.js/TypeScript, enough to collaborate effectively with other engineers. • Strong fluency with AI coding tools (using them to accelerate debugging, automation, and implementation without sacrificing correctness). • What success looks like • Sandbox startup times drop dramatically and stay low as load increases. • Reliability improves: fewer failed runs, better isolation, clearer error modes, faster recovery. • Costs become intentional and explainable, with clear cost-per-run and utilization targets. • Internal teams feel the difference: faster iteration, fewer footguns, smoother deployments. • The organization gains durable infra patterns, not just one-off fixes. • Why you’ll love it here • Hard problems with real impact: your work directly shapes the product experience for cutting-edge AI teams. • High-caliber peers: teammates who value clarity, rigor, and craft. • Meaningful ownership: you’ll own critical infrastructure and set the standard for how we operate at scale. • Locations: San Francisco / Singapore • Type: Full-time, In-Person • Visa/Relocation: Available for strong candidates (US/Singapore) • Compensation: $160,000-$280,000 salary, meaningful equity, full healthcare, daily team meals.
Responsibilities
• Own DevOps, infrastructure, and architecture decisions as we scale. • Optimize sandbox lifecycle end-to-end: provisioning, scheduling, image pulls, startup, execution, teardown, caching. • Design for massive parallelism while maintaining reliability, fairness, predictable performance in Kubernetes + AWS environment. • Evolve cluster architecture with node groups, autoscaling strategies, spot/on-demand mixes, scheduling policies and workload isolation.
Benefits
• Hard problems with real impact: your work directly shapes the product experience for cutting-edge AI teams. • High-caliber peers: teammates who value clarity, rigor, and craft. • Meaningful ownership: you’ll own critical infrastructure and set the standard for how we operate at scale. • Locations: San Francisco / Singapore • Type: Full-time, In-Person • Visa/Relocation: Available for strong candidates (US/Singapore) • Compensation: $160,000-$280,000 salary, meaningful equity, full healthcare, daily team meals.