Onebrief - Senior Site Reliability Engineer, Colorado Springs

Remote - Colorado Springs, Colorado, United States$180k - $220k+ Equity3w ago

Remote Senior NA Cloud Computing Site Reliability Engineer Fellow Recruiter Go Bash Shell Python Documentation

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You’ll work closely with fellow SREs, security, and customer success. • You will be the first line of support for our mission critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments. Your lessons from the field will shape how our team works, from policy to implementation. • In addition to working at the customer, you will contribute directly to solutions that increase stability, performance, and security of our deployments, and improve the overall experience of deploying and managing Onebrief on premise. • You care deeply about reliability and treat it as a core feature of any application or platform, with a bias toward “reliability over novelty.” You think about infrastructure and operability as products to be automated, well-documented, and continuously improved, and you aim to leave systems easier to operate than you found them. • You are equally comfortable leading a post-incident review, or diving into a kubectl shell to triage a complex production issue. You don't just fix problems; you translate constraints and failure modes into clear, automated guardrails and scalable, resilient architecture. For you, robust monitoring, actionable alerting, and insightful runbooks are core parts of the engineering process, not afterthoughts. • You mentor others, fostering a culture of blameless postmortems and proactive reliability. You collaborate naturally with application and platform teams, helping them move quickly but safely by building the tools, processes, and observability that make "fast recovery" a reality. • Experience in DoD environments and compliance frameworks (RMF, STIGs, ICD 503). • GitOps practices and toolchains. • Security‑minded design for sensitive environments. • Experience designing and implementing meaningful SLIs/SLOs (including error budgets) for complex, distributed systems. • Familiarity with on‑prem virtualization(VMware, Proxmox, Nutanix, Hyper-V, etc). • Service mesh exposure (Istio, Linkerd). • Relevant certifications (e.g., AWS DevOps Engineer, CKA/CKAD). • Active Security+ or another DoD 8570.01-approved security credential, or the ability to obtain the valid credentials within 3 months of employment. • Notice to Third Party Recruitment AgenciesPlease note that Onebrief does not accept unsolicited resumes from recruiters or employment agencies. In the absence of an executed Recruitment Services Agreement, there will be no obligation to any referral compensation or recruiter fee. In the event a recruiter or agency submits a resume or candidate without an agreement Onebrief explicitly reserves the right to pursue and hire those candidate(s) without any financial obligation to the recruiter or agency. Any unsolicited resumes, including those submitted to hiring managers, shall be deemed the property of Onebrief. • Notice to Third Party Recruitment Agencies

Responsibilities

• You'll own the reliability, scalability, and security of the production application and/or platform. You will do this by: • Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana). You won't just track metrics; you'll create the actionable insights and automated alerting that allow teams to identify and resolve issues before they impact users. • Implementing a World-Class Observability Platform: • Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs), increasing trust internally and externally. You will be the organization's expert on what it means for our systems to be reliable and how to measure it. • Defining and Upholding Reliability: • Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents who will lead blameless post-mortems / After Action Reviews (AARs) that identify true root causes and drive automated, long-term solutions to prevent recurrence. • Leading Incident Response: • Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible). You will embed security and compliance controls (RMF, STIGs) directly into this automation. • Automating for Scale and Security: • Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation. You will partner with other teams to share best practices for air-gapped environments and support their readiness for production. • Eliminating Toil and Scaling the Team: • What We Look For • What We Look For • An active Top Secret clearance • 5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus. • Proven partner to DevOps/Platform and application teams; collaborates well across functions and shares context openly. • A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement. • Technical expertise • Technical expertise • Infrastructure as Code: Terraform (or CloudFormation), Ansible. • Containers and orchestration: Kubernetes design, deployment, and operations. • CI/CD: experience building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions). • Scripting: proficiency with at least one of Python, Go, or Bash. • Cloud: Familiarity with AWS or AWS GovCloud. • Observability: Grafana stack, ELK stack, or Datadog. • Networking fundamentals: core protocols and secure configurations.

Benefits

• $180K – $220K • Offers Equity • Equity: Share in the company's success. • Equity • Flexible Work Environment: Remote-first organization* with flexible work hours and unlimited PTO.(*note that some roles are in-person, on-site with customers) • Flexible Work Environment • Comprehensive Health Coverage: Health, dental, vision, and life insurance. • Comprehensive Health Coverage • Retirement Plan: 401(k) plan with company match to secure your future. • Retirement Plan • Parental Leave: 8 weeks at 100% regardless of state. • Parental Leave • Company Retreats: Annual company summit trips. • Company Retreats • Upload your resume here to autofill key application fields. • Drop your resume here! • Parsing your resume. Autofilling key fields... • Please Note: we have set up limits for applications for this role. It is in the Infrastructure & Security group. The following limits apply to applications for all jobs within this group: • Infrastructure & Security • Candidates may not apply more than 3 times in any 120 day span for any job in the Infrastructure & Security Group. • Candidates may not re-apply to the same role within 180 days if not presented with an offer • or drag and drop here • This role requires an active Top Secret clearance; SCI eligibility is a plus. • I currently do not hold active security clearance. • Note: If relocation is required, Onebrief would support your relocation • Yes, I currently live in a commutable distance to Colorado Springs, Colorado. • No, but I'm open to relocation • No, and I'm not willing to relocate at this time • Decline to self-identify • Hispanic or Latino - A person of Cuban, Mexican, Puerto Rican, South or Central American, or other Spanish culture or origin regardless of race. • Hispanic or Latino • White (Not Hispanic or Latino) - A person having origins in any of the original peoples of Europe, the Middle East, or North Africa. • White • Black or African American (Not Hispanic or Latino) - A person having origins in any of the black racial groups of Africa. • Black or African American • Native Hawaiian or Other Pacific Islander (Not Hispanic or Latino) - A person having origins in any of the peoples of Hawaii, Guam, Samoa, or other Pacific Islands. • Native Hawaiian or Other Pacific Islander • Asian (Not Hispanic or Latino) - A person having origins in any of the original peoples of the Far East, Southeast Asia, or the Indian Subcontinent, including, for example, Cambodia, China, India, Japan, Korea, Malaysia, Pakistan, the Philippine Islands, Thailand, and Vietnam. • Asian • American Indian or Alaska Native (Not Hispanic or Latino) - A person having origins in any of the original peoples of North and South America (including Central America), and who maintain tribal affiliation or community attachment. • American Indian or Alaska Native • Two or More Races (Not Hispanic or Latino) - All persons who identify with more than one of the above five races. • Two or More Races • Hispanic or Latino • White (Not Hispanic or Latino) • Black or African American (Not Hispanic or Latino) • Native Hawaiian or Other Pacific Islander (Not Hispanic or Latino) • Asian (Not Hispanic or Latino) • American Indian or Alaska Native (Not Hispanic or Latino) • Two or More Races (Not Hispanic or Latino) • I identify as one or more of the classifications of protected veteran listed above • I am not a protected veteran

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities

Benefits