Pro members applied to this job 36 hours before you saw itGet Pro ›

havocai - Senior Site Reliability Engineer

Remote - USA *4d ago

Remote Senior NA Cloud Computing Robotics Site Reliability Engineer Go Linux Kubernetes Python AWS Prometheus Terraform ELK Grafana Datadog Pulumi Observable Change Management

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• 7+ years of experience in SRE, infrastructure engineering, systems engineering, or related roles • Strong experience operating large-scale distributed production systems • Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals • Hands-on experience with Kubernetes and container orchestration • Programming or scripting experience in Go, Python, or similar languages • Experience designing and operating observability systems for production environments • Proven ability to lead incident response and drive reliability improvements • Strong communication skills and ability to collaborate across engineering teams • Ability to operate calmly and effectively under pressure • Must be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if required • Experience supporting autonomy, robotics, simulation, real-time systems, or data-intensive platforms • Familiarity with AWS and large-scale cloud infrastructure • Experience with chaos engineering, fault injection, or resilience testing • Knowledge of CI/CD systems and progressive delivery practices • Experience working in high-reliability, safety-critical, defense, or mission-critical environments • Experience with Infrastructure as Code tools such as Terraform or Pulumi • Experience with Prometheus, Grafana, OpenTelemetry, Datadog, ELK/OpenSearch, or similar observability tools

Responsibilities

• RELIABILITY ENGINEERING & ARCHITECTURE • Design and evolve reliability architecture for distributed and cloud-hosted systems • Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning • Partner with platform and application teams to design systems for reliability, scalability, and operability • Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines • Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads • OPERATIONS & INCIDENT MANAGEMENT • Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews • Conduct root cause analysis for complex production incidents and drive long-term corrective actions • Improve operational readiness through runbooks, automation, resilience testing, and production-readiness reviews • Reduce operational toil through tooling, automation, and process improvements • Help build a culture of ownership, accountability, and continuous improvement across production systems • OBSERVABILITY & PERFORMANCE • Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health • Ensure services and data pipelines are observable, debuggable, and performant in production • Drive performance analysis and tuning across infrastructure, application, and service layers • Improve alert quality, reduce noise, and ensure operational signals are actionable • Partner with engineering teams to define meaningful reliability and performance metrics • AUTOMATION & PLATFORM COLLABORATION • Build automation to improve system reliability, deployment safety, and recovery processes • Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns • Support and improve Kubernetes-based environments and containerized workloads • Contribute to infrastructure-as-code practices and platform automation • Help define operational standards for cloud infrastructure, deployment workflows, and production services • SECURITY & RESILIENCE • Collaborate with security teams to ensure secure and resilient system design • Participate in disaster recovery planning, backup strategy, and resilience testing • Maintain strong operational practices around access control, secrets management, change management, and production access • Support secure operations for systems that may serve defense, autonomy, or mission-sensitive use cases • A successful Senior Site Reliability Engineer at HavocAI will raise the reliability, performance, and operational maturity of the systems that support our autonomy and cloud platform work. • You will help ensure that mission-critical services are observable, resilient, scalable, and recoverable. You will bring structure to incident response, reduce operational toil, improve deployment safety, and partner with engineering teams to design systems that can handle real-world operational demands. • This role is ideal for someone who combines deep systems expertise with strong ownership, practical judgment, and a bias toward building durable solutions.

Benefits

• 100% Employer paid Health, Dental and Vision Insurance for you and your families • Life Insurance (Employer Paid) • Ability to participate in the companies 401k program (Matching) • Unlimited PTO policy with an enforced 2 week minimum • Work / Home Office Stipend • 16 Week Paid Parental Leave • Monthly Health and Wellness Stipend • Innovation: We are driven to break new ground. Every day presents an opportunity to challenge the status quo, think boldly, and deliver advanced solutions that transform the future of defense technology. • Integrity: We hold ourselves to the highest ethical standards, ensuring transparency, accountability, and trust in all our actions and partnerships. • Mission-Driven: We are focused on achieving impactful outcomes that align with our core mission—protecting lives through innovation. • Forward-Leaning: We continuously seek out new opportunities and remain at the forefront of technological advancements. We embrace change and anticipate the challenges of tomorrow with confidence and creativity. • Ownership of All Tasks: At HavocAI, no problem is too complex or too trivial. We believe that greatness comes from tackling the hardest challenges, but also in handling the smallest, sometimes thankless, tasks with the same level of commitment and care. • Servant Leadership: We lead by serving others, whether it’s supporting our employees, partners, or the broader community. Empowering those around us is key to achieving long-term success and making a lasting impact.