Blink Health - Staff Site Reliability Engineer

Remote2mo ago

Remote Staff WW Healthcare Cloud Computing Software Site Reliability Engineer Go Bash Linux Python React

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Bachelor’s or Master’s degree in Computer Science or equivalent practical experience. • 7+ years of experience in site reliability engineering, infrastructure engineering, or platform engineering roles, with demonstrated impact at scale. • Reliability & Troubleshooting • Expert-level, methodical troubleshooting across the entire stack, from application to kernel to network. • entire stack • Strong command-line proficiency and deep expertise in Linux systems and operating system fundamentals. • Linux systems and operating system fundamentals • Advanced understanding of networking concepts including load balancing, proxies, DNS, TCP/IP, NAT, and service-to-service communication. • load balancing, proxies, DNS, TCP/IP, NAT, and service-to-service communication • Software & Automation • Experience working across multiple languages (e.g., Python, Go, Bash, and familiarity troubleshooting application stacks such as React or similar). • Python, Go, Bash • Strong track record of automating repetitive and complex operational work to reduce toil and increase reliability. • automating repetitive and complex operational work • Ability to design and build internal tools (Python or Go) that standardize and scale engineering practices. • standardize and scale engineering practices • Comfortable operating in an agile environment, with disciplined testing and quality practices. • agile environment • Cloud & Platform Engineering • Deep experience with cloud platforms (AWS preferred, GCP/Azure acceptable), particularly managed services and production-grade architectures. • AWS preferred • Strong expertise in Kubernetes and container orchestration (EKS, Helm), including lifecycle management and operational best practices. • Kubernetes and container orchestration • Proven experience designing and implementing observability systems, including metrics, logging, tracing, dashboards, and alerting. • observability systems • Deep understanding of container technologies, security scanning, secrets management, dynamic configuration, and microservices architectures. • microservices architectures • Familiarity with service meshes and advanced traffic management concepts. • Infrastructure as Code • Experience designing and maintaining company-wide IaC codebases using tools such as Terraform, Pulumi, CloudFormation, or Ansible. • company-wide IaC codebases • Ability to think holistically about infrastructure design, cost, reliability, security, and long-term maintainability.

Responsibilities

• Establish and evolve SRE best practices across the organization, including reliability principles, error budgets, incident response, postmortems, and operational readiness standards. • Establish and evolve SRE best practices • Define and drive observability strategy for system health, performance, and reliability, including SLIs/SLOs, alerting quality, dashboards, and service health indicators. • Define and drive observability strategy • Design and implement software-driven solutions within the infrastructure domain, automating manual processes and eliminating operational complexity and toil. • software-driven solutions within the infrastructure domain • Act as a technical leader and force multiplier, helping set priorities and influencing decision-making across core cloud infrastructure, reliability tooling, and platform architecture. • technical leader and force multiplier • Take ownership of large, ambiguous initiatives, driving them from concept to delivery while aligning stakeholders across engineering, security, and product. • large, ambiguous initiatives • Combine deep knowledge of software development, infrastructure, and security to improve platform resilience, scalability, performance, and compliance. • software development, infrastructure, and security • Proactively identify systemic risks and reliability gaps, recommending and leading platform upgrades and architectural improvements before they become incidents. • recommending and leading platform upgrades • Partner with engineering teams to improve developer workflows, tooling, and operational maturity, increasing productivity while reducing cognitive load. • Provide technical mentorship, architecture guidance, and high-quality design and code reviews for engineers across infrastructure and product teams. • technical mentorship • Lead by example in documentation and knowledge sharing, ensuring systems and processes are well-understood and not dependent on individual ownership. • documentation and knowledge sharing • Participate in and help mature incident response, escalation practices, and post-incident learning across the organization. • incident response

Benefits

• It is rare to have a company that both deeply impacts its customers and is able to provide its services across a massive population. At Blink, we have a huge impact on people when they are most vulnerable: at the intersection of their healthcare and finances. We are also the fastest growing healthcare company in the country and are driving that impact across millions of new patients every year. Our business model not only helps people, but drives economics that allow us to build a generational company. We are a relentlessly learning, constantly curious, and aggressively collaborative cross-functional team dedicated to inventing new ways to improve the lives of our customers.

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities

Benefits