Abacus Insights - Principal Site Reliability & Forward Deployed Engineer

Remote - USA+ Equity1mo ago

Remote Principal NA Health Insurance Insurance Principal Principal Engineer Site Reliability Engineer Python AWS Databricks Kubernetes Snowflake

Upload My Resume

Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• 10+ years of experience in software engineering, SRE, sustaining engineering, or production operations • Deep hands-on experience operating production systems in AWS • Strong experience troubleshooting Databricks and large-scale data platforms • Proficiency in Python and experience building production services or tooling • Distributed systems • Incident management and RCA practices • Monitoring, alerting, and observability • CI/CD Pipelines that leverage Infrastructure as Code. • Proven ability to own problems end-to-end, from detection to permanent resolution • Excellent communication skills, especially during incidents and customer escalations • Ability to work backward from customer impact to root cause across systems and codebases, delivering fixes in environments with minimal documentation. • Strong instinct for operational risk, with the ability to proactively identify failure modes and harden systems before they impact customers. • What we would like to see, but not required: • Experience in healthcare, health insurance, or regulated data environments • Kubernetes (EKS), EMR, Lambda • Spark internals • Snowflake or similar data warehouses • Experience with FHIR, MDM systems, or entity resolution • Prior experience in SWAT, escalation engineering, or tiger-team roles • Experience contributing to or operating within SRE/on-call programs • Compensation: Compensation for this role is based on experience, skills, and location, and includes base salary plus eligibility for performance bonuses and equity grants.

Responsibilities

• Production Operations & Incident Response • Act as a senior technical escalation point during production incidents • Lead real-time incident triage, mitigation, and recovery efforts • Drive root cause analysis (RCA) with a focus on systemic, long-term fixes • Identify recurring failure patterns and push for architectural or operational improvements • Partner with Customer Success and Engineering to manage customer impact during incidents • Sustaining Engineering & Post‑Launch Ownership • Own post-launch reliability, stability, and operational quality of core systems • Investigate and resolve complex field issues and production defects • Ensure fixes developed during incidents or customer escalations are up streamed into the core product • Improve operational readiness of services through better runbooks, monitoring, and alerting • Reduce operational toil by converting repeated manual work into automation • Forward Deployed / Customer‑Facing Engineering • Engage directly with strategic customers to solve real-world, production-grade technical challenges • Support complex deployments, integrations, and escalations in customer environments • Act as a trusted technical partner to customers during high-impact issues • Translate customer learnings into concrete product, platform, and operational improvements • Contribute to reusable tools, playbooks, and best practices that accelerate future deployments • AWS & Databricks Technical Expertise • Serve as a subject matter expert for AWS-hosted production systems • AWS compute, storage, networking, IAM, and security • Databricks jobs, clusters, and Spark-based data pipelines • Debug performance degradation, scalability issues, job failures, and data correctness problems • Partner with platform and data teams to harden systems for reliability, scale, and operability • Software Development & Automation • Automate operational workflows • Improve reliability and observability • Eliminate manual intervention and reduce incident frequency • Contribute primarily in Python, with exposure to JVM-based systems as needed • Review code with a strong emphasis on operability, resiliency, and maintainability • Advocate for “build it so it can be operated” engineering standards • Technical Leadership & Collaboration • Provide technical leadership without formal authority, influencing design and operational decisions • Mentor engineers through pairing, reviews, and incident leadership • Collaborate closely with Product, Engineering, Data, and Customer teams • Operate effectively in high-pressure, ambiguous environments, especially during customer-impacting incidents

Benefits

• What you’ll get in return: • What you’ll get in return • Unlimited paid time off – recharge when you need it • Work from anywhere – flexibility to fit your life • Comprehensive health coverage – multiple plan options to choose from • Equity for every employee – share in our success • Growth-focused environment – your development matters here • Monthly cell phone allowance – stay connected with ease#LI-MS1