Senior DevOps & Infrastructure Engineer

HUDSan Francisco, California, United States$160k – $280k+ Equity3w ago

In Office Senior NA Healthcare Cloud Computing Logistics Infrastructure Engineer Senior DevOps Engineer AWS Docker Kubernetes Coaching Compound

Upload My Resume

Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

You are an infrastructure owner, not a dashboard watcher
You don’t wait for tickets—you proactively find bottlenecks, measure them, fix them, and prove the gains. You ship improvements that compound.
You care about tail latencies and failure modes
You think in SLOs, load patterns, saturation curves, and blast radius. You design for the real world: retries, backpressure, partial failures, and noisy neighbors.
You love performance
You enjoy turning “slow and expensive” into “fast and efficient.” You benchmark, profile, tune, and iterate.
You can operate autonomously
You are comfortable making high-stakes engineering decisions with good judgment, and communicating tradeoffs clearly to the team.
You'll own and evolve HUD’s infrastructure so it is:
Extremely performant (fast sandbox provisioning, fast cold starts, low tail latency, high throughput)
Extremely reliable (predictable behavior, graceful failure, robust scaling, low operational risk)
Operationally excellent (systems scale, clear SLOs, deep observability, incident readiness, cost discipline)
Secure and compliant (SOC 2-aligned practices, strong security posture by default)
What you’ll work on
Own our AWS + EKS-based sandbox platform that runs Dockerized workloads for customers and internal teams.
Optimize sandbox lifecycle end-to-end: provisioning, scheduling, image pulls, startup, execution, teardown, and caching.
Design for massive parallelism while maintaining reliability, fairness, and predictable performance.
Kubernetes + AWS excellence
Evolve our cluster architecture: node groups, autoscaling strategies, spot/on-demand mixes, scheduling policies, and workload isolation.
Build safe-by-default patterns: quotas, resource limits, network policies, pod security, secrets management, and guardrails.
Improve cluster resiliency and operational ergonomics (upgrades, rollouts, disaster recovery, fail-safes).
Cross-stack DevOps ownership
Address infrastructure bottlenecks as we scale.
Improve developer experience for internal teams: safer deploys, better CI/CD, smoother local/dev workflows, faster iteration.
Provide architectural input and raise the infra maturity of the team via docs, patterns, and coaching.
Interface with our backend/workers (Railway), frontend (Vercel/Next.js), and data (Supabase/Postgres) to ensure the whole system is cohesive.
Performance engineering and ruthless measurement
Establish “infra product metrics” and instrument everything: P50/P95/P99 sandbox startup times, queue times, job success rates, noisy-neighbor rates, image pull latencies, cluster saturation, and cost-per-run.
Build benchmarking harnesses for sandboxes and workloads to track regressions and validate improvements.
Treat efficiency as a first-class metric: optimize utilization without sacrificing latency or reliability.
Observability + incident readiness
Implement gold-standard observability across logs/metrics/traces with actionable dashboards and alerting tied to SLOs.
Create runbooks, incident processes, and postmortem culture that meaningfully improves the system each time.
Deep AWS experience, including operating production systems at scale (networking, IAM, compute, storage, observability, cost).
Strong Kubernetes/EKS experience: cluster design, workload isolation, autoscaling (cluster + pod), upgrades, reliability practices.
Excellent Docker + container runtime knowledge: image optimization, build pipelines, caching strategies, and runtime security considerations.
Systems-level competence: Linux fundamentals, networking, performance debugging, resource contention, concurrency basics.
Infrastructure automation: strong ability to implement infrastructure as code (Terraform/CDK/CloudFormation) and repeatable environments.
Observability expertise: metrics/logging/tracing design, SLOs/SLIs, alerting that avoids noise and catches real issues.
Security + compliance mindset: experience working in SOC 2-aligned environments; ability to implement least privilege, auditability, and operational controls.
Strong engineering communication: can write clear docs, propose designs, and upskill the team.
Experience building ephemeral compute / sandbox / job execution platforms (multi-tenant, Dockerized workloads, queueing, isolation).
Proven wins reducing cold start / startup time and improving p95/p99 latency for infra-critical paths.
Deep familiarity with:
Karpenter / Cluster Autoscaler, HPA/VPA, pod scheduling strategies, priority classes, taints/tolerations, topology spread constraints
Container performance: image layering, registry optimization, pull-through caches, snapshotters, prewarming strategies
Service mesh / networking (where appropriate), network policies, ingress design, egress controls
Experience migrating from mixed hosting providers into a more cohesive platform architecture.
Experience with CI/CD at high velocity (safe deploys, progressive delivery, canaries, rollbacks).
Experience with GPU infrastructure and orchestration (if applicable to workloads).
Security depth beyond basics: threat modeling, hardening, secure supply chain for containers, audit-readiness workflows.
Ability to contribute across the stack:
Python (our SDK and backend systems) and Next.js/TypeScript, enough to collaborate effectively with other engineers.
Strong fluency with AI coding tools (using them to accelerate debugging, automation, and implementation without sacrificing correctness).
What success looks like
Sandbox startup times drop dramatically and stay low as load increases.
Reliability improves: fewer failed runs, better isolation, clearer error modes, faster recovery.
Costs become intentional and explainable, with clear cost-per-run and utilization targets.
Internal teams feel the difference: faster iteration, fewer footguns, smoother deployments.
The organization gains durable infra patterns, not just one-off fixes.
Why you’ll love it here
Hard problems with real impact: your work directly shapes the product experience for cutting-edge AI teams.
High-caliber peers: teammates who value clarity, rigor, and craft.
Meaningful ownership: you’ll own critical infrastructure and set the standard for how we operate at scale.
Locations: San Francisco / Singapore
Type: Full-time, In-Person
Visa/Relocation: Available for strong candidates (US/Singapore)
Compensation: $160,000-$280,000 salary, meaningful equity, full healthcare, daily team meals.

Responsibilities

Own DevOps, infrastructure, and architecture decisions as we scale.
Optimize sandbox lifecycle end-to-end: provisioning, scheduling, image pulls, startup, execution, teardown, caching.
Design for massive parallelism while maintaining reliability, fairness, predictable performance in Kubernetes + AWS environment.
Evolve cluster architecture with node groups, autoscaling strategies, spot/on-demand mixes, scheduling policies and workload isolation.

Benefits

Hard problems with real impact: your work directly shapes the product experience for cutting-edge AI teams.
High-caliber peers: teammates who value clarity, rigor, and craft.
Meaningful ownership: you’ll own critical infrastructure and set the standard for how we operate at scale.
Locations: San Francisco / Singapore
Type: Full-time, In-Person
Visa/Relocation: Available for strong candidates (US/Singapore)
Compensation: $160,000-$280,000 salary, meaningful equity, full healthcare, daily team meals.