wagey.ggwagey.gg
Open Tech JobsCompaniesPricing
Log InGet Started Free
© 2026 Dominic Morris. All rights reserved.·Privacy·Terms·
Jobs/AWS Jobs/Senior DevOps & Infrastructure Engineer

Senior DevOps & Infrastructure Engineer

HUDSan Francisco, California, United States$160k – $280k+ Equity3w ago
In OfficeSeniorNAHealthcareCloud ComputingLogisticsInfrastructure EngineerSenior DevOps EngineerAWSDockerKubernetesCoachingCompound

Upload My Resume

Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

  • You are an infrastructure owner, not a dashboard watcher
  • You don’t wait for tickets—you proactively find bottlenecks, measure them, fix them, and prove the gains. You ship improvements that compound.
  • You care about tail latencies and failure modes
  • You think in SLOs, load patterns, saturation curves, and blast radius. You design for the real world: retries, backpressure, partial failures, and noisy neighbors.
  • You love performance
  • You enjoy turning “slow and expensive” into “fast and efficient.” You benchmark, profile, tune, and iterate.
  • You can operate autonomously
  • You are comfortable making high-stakes engineering decisions with good judgment, and communicating tradeoffs clearly to the team.
  • You'll own and evolve HUD’s infrastructure so it is:
  • Extremely performant (fast sandbox provisioning, fast cold starts, low tail latency, high throughput)
  • Extremely reliable (predictable behavior, graceful failure, robust scaling, low operational risk)
  • Operationally excellent (systems scale, clear SLOs, deep observability, incident readiness, cost discipline)
  • Secure and compliant (SOC 2-aligned practices, strong security posture by default)
  • What you’ll work on
  • Own our AWS + EKS-based sandbox platform that runs Dockerized workloads for customers and internal teams.
  • Optimize sandbox lifecycle end-to-end: provisioning, scheduling, image pulls, startup, execution, teardown, and caching.
  • Design for massive parallelism while maintaining reliability, fairness, and predictable performance.
  • Kubernetes + AWS excellence
  • Evolve our cluster architecture: node groups, autoscaling strategies, spot/on-demand mixes, scheduling policies, and workload isolation.
  • Build safe-by-default patterns: quotas, resource limits, network policies, pod security, secrets management, and guardrails.
  • Improve cluster resiliency and operational ergonomics (upgrades, rollouts, disaster recovery, fail-safes).
  • Cross-stack DevOps ownership
  • Address infrastructure bottlenecks as we scale.
  • Improve developer experience for internal teams: safer deploys, better CI/CD, smoother local/dev workflows, faster iteration.
  • Provide architectural input and raise the infra maturity of the team via docs, patterns, and coaching.
  • Interface with our backend/workers (Railway), frontend (Vercel/Next.js), and data (Supabase/Postgres) to ensure the whole system is cohesive.
  • Performance engineering and ruthless measurement
  • Establish “infra product metrics” and instrument everything: P50/P95/P99 sandbox startup times, queue times, job success rates, noisy-neighbor rates, image pull latencies, cluster saturation, and cost-per-run.
  • Build benchmarking harnesses for sandboxes and workloads to track regressions and validate improvements.
  • Treat efficiency as a first-class metric: optimize utilization without sacrificing latency or reliability.
  • Observability + incident readiness
  • Implement gold-standard observability across logs/metrics/traces with actionable dashboards and alerting tied to SLOs.
  • Create runbooks, incident processes, and postmortem culture that meaningfully improves the system each time.
  • Deep AWS experience, including operating production systems at scale (networking, IAM, compute, storage, observability, cost).
  • Strong Kubernetes/EKS experience: cluster design, workload isolation, autoscaling (cluster + pod), upgrades, reliability practices.
  • Excellent Docker + container runtime knowledge: image optimization, build pipelines, caching strategies, and runtime security considerations.
  • Systems-level competence: Linux fundamentals, networking, performance debugging, resource contention, concurrency basics.
  • Infrastructure automation: strong ability to implement infrastructure as code (Terraform/CDK/CloudFormation) and repeatable environments.
  • Observability expertise: metrics/logging/tracing design, SLOs/SLIs, alerting that avoids noise and catches real issues.
  • Security + compliance mindset: experience working in SOC 2-aligned environments; ability to implement least privilege, auditability, and operational controls.
  • Strong engineering communication: can write clear docs, propose designs, and upskill the team.
  • Experience building ephemeral compute / sandbox / job execution platforms (multi-tenant, Dockerized workloads, queueing, isolation).
  • Proven wins reducing cold start / startup time and improving p95/p99 latency for infra-critical paths.
  • Deep familiarity with:
  • Karpenter / Cluster Autoscaler, HPA/VPA, pod scheduling strategies, priority classes, taints/tolerations, topology spread constraints
  • Container performance: image layering, registry optimization, pull-through caches, snapshotters, prewarming strategies
  • Service mesh / networking (where appropriate), network policies, ingress design, egress controls
  • Experience migrating from mixed hosting providers into a more cohesive platform architecture.
  • Experience with CI/CD at high velocity (safe deploys, progressive delivery, canaries, rollbacks).
  • Experience with GPU infrastructure and orchestration (if applicable to workloads).
  • Security depth beyond basics: threat modeling, hardening, secure supply chain for containers, audit-readiness workflows.
  • Ability to contribute across the stack:
  • Python (our SDK and backend systems) and Next.js/TypeScript, enough to collaborate effectively with other engineers.
  • Strong fluency with AI coding tools (using them to accelerate debugging, automation, and implementation without sacrificing correctness).
  • What success looks like
  • Sandbox startup times drop dramatically and stay low as load increases.
  • Reliability improves: fewer failed runs, better isolation, clearer error modes, faster recovery.
  • Costs become intentional and explainable, with clear cost-per-run and utilization targets.
  • Internal teams feel the difference: faster iteration, fewer footguns, smoother deployments.
  • The organization gains durable infra patterns, not just one-off fixes.
  • Why you’ll love it here
  • Hard problems with real impact: your work directly shapes the product experience for cutting-edge AI teams.
  • High-caliber peers: teammates who value clarity, rigor, and craft.
  • Meaningful ownership: you’ll own critical infrastructure and set the standard for how we operate at scale.
  • Locations: San Francisco / Singapore
  • Type: Full-time, In-Person
  • Visa/Relocation: Available for strong candidates (US/Singapore)
  • Compensation: $160,000-$280,000 salary, meaningful equity, full healthcare, daily team meals.

Responsibilities

  • Own DevOps, infrastructure, and architecture decisions as we scale.
  • Optimize sandbox lifecycle end-to-end: provisioning, scheduling, image pulls, startup, execution, teardown, caching.
  • Design for massive parallelism while maintaining reliability, fairness, predictable performance in Kubernetes + AWS environment.
  • Evolve cluster architecture with node groups, autoscaling strategies, spot/on-demand mixes, scheduling policies and workload isolation.

Benefits

  • Hard problems with real impact: your work directly shapes the product experience for cutting-edge AI teams.
  • High-caliber peers: teammates who value clarity, rigor, and craft.
  • Meaningful ownership: you’ll own critical infrastructure and set the standard for how we operate at scale.
  • Locations: San Francisco / Singapore
  • Type: Full-time, In-Person
  • Visa/Relocation: Available for strong candidates (US/Singapore)
  • Compensation: $160,000-$280,000 salary, meaningful equity, full healthcare, daily team meals.

Similar Jobs

Senior DevOps Engineer
1h ago
·Remote - EMEA·Equity
RemoteEMEASeniorCryptocurrencyCloud ComputingSenior DevOps EngineerTerraformPrometheusMongoDBGitHubGitHub ActionsAWSRedisGCPPipeline ManagementRecords ManagementHashiCorp VaultLinuxPythonDockerGoogle GKEcontainerdKubernetesGrafanaMBAReportingNotionGo
Software Engineer, Risk
1h ago
·Remote - United States (Remote)·Equity
RemoteNAMidFintechPaymentsSoftware EngineerSecurity EngineerAWSKubernetesTypeScriptReact Native
AI Solutions Engineer (12-month Fixed Term Contract, Renewable/Convertible)
1h ago
·Asia Timezone OR Europe Timezone OR London (UK)·Equity
In OfficeEMEAMidCloud ComputingSolutions EngineerAI EngineerLearning & DevelopmentAWS
Partner Sales Engineer (France)
1h ago
·France - Remote - Paris - Hybrid
In OfficeEMEAMidCloud ComputingSales EngineerJavaScriptGroovyNode.jsVue.jsTraining DevelopmentACCAOracleSQLAWSAzureGCPGovernanceMAUClose
Forward Deployed Engineer
1h ago
·London, England, United Kingdom - Hybrid
In OfficeEMEASeniorCloud ComputingSoftwareMobile EngineerStaff EngineerC#JavaFull StackFront-endBack-endPythonTypeScriptChange ManagementTeam ManagementAWSTerraformReactClaudeDockerKubernetesCursorMentoring

Stop filling. Start chilling.Start chilling.

Get Started Free

No credit card. Takes 10 seconds.