PostHog - Site Reliability Engineer

Remote - USA3w ago

Remote NA Cloud Computing Site Reliability Engineer AWS Account Management Kubernetes Terraform Linux

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• We’re looking for people that like deep ownership of production systems, people that are not afraid of working with stateful infrastructure and love working in AWS, VMs, automation, and making messy systems reliable. • In general we seek SRE’s who are: • Enthusiastic drivers. We need proactive people that can fully own projects and get them done, and know to get help when needed. "Are we there yet?" is the wrong question. • Enthusiastic drivers. • Optimistic problem solvers. Things get hard here sometimes, whether it's scaling, shipping complex products, handling a stream of support requests, or trying to ship something that touches multiple teams. We need people who won't get disheartened, and will collaborate, iterate, and ship their way out of anything. • Optimistic problem solvers. • Grown ups. We’re an international bunch of weirdos, but one thing unites us: everyone is kind, considerate, and professional towards each other. This isn't about age or experience, it's about being low-ego, flexible, and respectful. • Grown ups. • Genuine builders. PostHog is full of people who just love building stuff, people who would still be building software even if there wasn't a paycheck at the end. If this sounds like you, we should talk. • Genuine builders. • Deep hands-on experience with Kubernetes in production (EKS preferred). You've debugged node pressure, networking issues, and deployment failures at scale (thousands of nodes) • Strong experience operating production infrastructure on AWS. Not just one account, but understanding organizational boundaries, IAM, and networking between many • Experience automating infrastructure using Terraform or Terragrunt at scale, including module design and state management • Solid understanding of Linux systems (disk, memory, networking, failure modes) • Experience supporting stateful systems (databases, queues, storage systems, etc.) • Ability to debug and reason about performance and reliability issues in production • You're comfortable owning systems end-to-end, including on-call responsibilities • You don't need to be an expert in every system we run on day one. But you do need to enjoy owning complex infrastructure and learning how the pieces fit together. • Experience with GitOps workflows (ArgoCD) and CI/CD pipelines (GitHub Actions) • Experience with building AI agent-enabled base-level infra services for teams that move fast • Familiarity with multi-region infrastructure and the consistency/availability tradeoffs that come with it • If this sounds like you, we should talk. • We are committed to ensuring a fair and accessible interview process. If you need any accommodations or adjustments, please let us know.

Responsibilities

• You won’t be in a typical “keep the lights on” SRE role. The work is about turning a fast-growing, stateful system into a predictable, well-automated platform. (provisioning, scaling, rebalancing, recovery) • That means reducing operational stress, designing safe automation for traffic-heavy workloads, and building the tooling and patterns that let the system scale without scaling human effort. • You'll work on the kind of problems that only show up at large scale (petabytes of data, thousands of cores, constant ingestion) across a multi-region, multi-account AWS platform running many services on Kubernetes. • Operating EKS clusters across several environments with Karpenter autoscaling, Cilium networking, and ArgoCD-driven GitOps deployments • Managing and evolving a multi AWS account organization, provisioning, networking, access control, and cross-account connectivity • Maintaining the Terraform/Terragrunt IaC platform - modules, automated plan-on-PR / apply-on-merge pipelines, and safe patterns for shared infrastructure • Improving operational tooling around deploys, schema changes, backups, restores, and incident response • Reducing operational load by identifying repeat pain points and eliminating them through code and self-healing automation • Optimizing cloud spend as you go • Participating in on-call and incident response, with a strong focus on making incidents rarer over time • You'll have room to design and automate, not just respond to alerts. You should join this team if you like deep ownership of production systems and enjoy building the platform layer that everything else runs on. • You don't need to be an expert in every system we run on day one. But you do need to enjoy owning complex infrastructure and learning how the pieces fit together.

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities