Pro members applied to this job 36 hours before you saw itGet Pro ›

withwisdom - Staff Software Engineer, Infrastructure

Remote - USA2d ago

Remote Staff NA Cloud Computing Artificial Intelligence Staff Engineer Software Engineer Performance Management Mentoring AWS GCP Terraform Kubernetes Claude Go JavaScript TypeScript Python HIPAA Compliance

Requirements

• 8+ years running production systems, with a track record of operating at staff/principal scope — you've owned reliability for systems where downtime had real consequences and left them measurably better • You've operated at scale under pressure — services that had to stay up, incidents you led to resolution, and reliability practices you established that outlived your tenure and changed how teams worked • You multiply the people around you — your impact shows up in what others ship reliably, not only in what you touch directly. You've set standards, mentored engineers, and driven technical decisions across teams without needing the authority to mandate them • Deep AWS (or GCP) experience — you've deployed, operated, and debugged distributed services in production, and can reason from first principles when the runbook runs out • Strong with infrastructure as code (Terraform), containers and orchestration (ECS/Kubernetes), and CI/CD — the deploy path is yours to make boring • Hands-on production experience operating at least one major LLM API — OpenAI, Anthropic (Claude), or Google Vertex AI — with a focus on the operational reality: rate limits, retries, latency, cost, and what happens when the model misbehaves in a live system • Strong command of TypeScript/JavaScript — you can read and fix the application code, not just the infra around it; Python or Go a plus • Deep experience with relational databases — connection management, query performance, and reasoning about data integrity under load • You default to ownership and move toward the pager, not away from it • You're direct, intellectually honest, and collaborative — you surface bad news early, change your mind when the evidence warrants it, and write the postmortem that makes the whole team sharper • Experience operating LLM / agentic systems in production — or with frameworks like Mastra, LangChain, LlamaIndex, or CrewAI — where reliability, cost, and latency were yours to define and manage • Working knowledge of HIPAA compliance and what it means to run infrastructure responsibly in a healthcare context • Experience at a Series A or early-stage startup where you built the reliability function from scratch rather than inheriting one