signal-ai - Site Reliability Engineer
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• What we've shipped recently • Cut ~$50k/year off our Elasticsearch bill by migrating compute to more efficient chips. (Apr 2026) • Built the foundation for our MCP server platform: leveraging and contributing to open-source tooling to give the whole company extensible, production-grade AI integrations. (2025–2026) • Rebuilt production from scratch in a full DR gameday. End-to-end restore validated across our multi-account AWS setup. (Jan 2026) • What we're working on next • AI-augmented operations: Claude Enterprise is deployed across Signal. We want this team to help define what good looks like for SRE: incident triage, runbook generation, capacity planning, cost analysis. This is a strategic investment, not a side project: and we'd love someone genuinely curious about what these tools can and can't do. • Acquisition integration: Bringing a recently acquired product's infrastructure under our reliability, security, and operational standards. A substantial, multi-quarter piece of work with real technical and organisational complexity, and plenty of room to make your mark. • Batch workload consolidation: Moving disparate batch jobs onto EKS for unified scheduling, cost visibility, and operational tooling. • Your first six months • We want to set you up to thrive. Here's what that looks like in practice: • Month 1: You're onboarded across our AWS estate, Terraform, and observability stack. You've completed your first on-call shift with support from the team, landed your first PR in the DevOps repo, and started working Claude Enterprise into your daily flow. • Month 3: You're owning a workstream end-to-end. You've led the SRE response to at least one production incident and hosted your first post-mortem. You’ve surfaced a real opportunity that you've pushed to a measurable result. • Month 6: You're driving a multi-quarter workstream with clear direction, and you're contributing insights to our AI-in-operations playbook: including where Claude adds real leverage and where it doesn't. • You have solid AWS and Terraform experience, and you're comfortable writing Python or Go to solve operational problems. You think in distributed systems: failure modes, observability, blast radius: and you take problems end-to-end rather than stopping at the edges of your own work. • You're pragmatic about AI tooling. Not evangelical, not dismissive. You can tell us when you'd reach for an LLM and when you wouldn't, and you'd have a clear reason either way. • You communicate openly and you're comfortable pushing back when you think something could be better. We want to leverage your experience and perspective to grow our platform. • We know not every strong candidate will have every skill on this list. If you're excited about the work and you're close on the experience, we'd encourage you to apply. • Networking depth. You're comfortable below the load balancer: TCP/IP fundamentals, DNS, VPC design, and what actually happens when a service can't reach another one. • Operational security instincts. You follow the threat landscape with genuine interest: not just CVEs, but shifts in how attacks happen and how the industry is responding. You have a point of view on what actually matters right now. • Linux internals comfort. When something behaves strangely under load, you know where to look. • Communication across technical levels. You can collaborate with your infrastructure teammates and explain the same concepts clearly to a product manager. You've worked alongside colleagues with a wide range of technical backgrounds and adapted naturally. • We're dedicated to creating an inclusive environment where every Signaller feels welcomed, valued, and heard—a place where you can truly thrive as yourself.
No credit card. Takes 10 seconds.