EarnIn - Senior Site Reliability Engineer

Mexico City, Mexico; Remote, Mexico - Hybrid1mo ago

Remote Senior LATAM Artificial Intelligence Site Reliability Engineer Python Coaching Go Datadog Documentation

Requirements

• A bachelors or masters degree in computer science or equivalent industry experience • 4+ years of experience in an SRE or Software Engineering role. • Hands-on coding experience in Python and/or Go. • Distributed Systems Expertise — Proven experience designing, operating, and shepherding large-scale distributed systems from design through production, including incident learnings that make on-call quieter over time. • Distributed Systems Expertise • Reliability Engineering Mindset — Deep fluency in SLOs, SLIs, error budgets, and MTTR — using them to drive decisions and explain tradeoffs, not just decorate dashboards. • Reliability Engineering Mindset • Observability & Incident Response — Treats observability as essential, not optional; stays calm under pressure; can diagnose incidents from logs and metrics and translate findings into durable process and technical improvements. • Observability & Incident Response • Cross-functional Communication — Able to work across technical and non-technical teams, reduce silos through documentation and runbooks, and explain reliability concepts in plain language. • Cross-functional Communication • Operational Tooling & AI Fluency — Selects the right tools for production management and leverages AI-assisted development to reduce toil, accelerate RCA, and streamline infrastructure-as-code workflows. • Operational Tooling & AI Fluency • Leadership & Mentorship — Can plan and lead strategic reliability initiatives across engineering, and invests in mentoring engineers as a high-leverage path to long-term reliability improvements. • Leadership & Mentorship

Responsibilities

• Act as a senior technical owner for reliability initiatives. Collaborate across systems, teams, and failure modes to strengthen how EarnIn designs, observes, deploys, and manages production services. • You will combine software engineering fundamentals with reliability thinking. Rather than just responding to incidents, you will apply lessons learned to improve systems, alerts, runbooks, and ownership, reducing repeat failures. • Leverage AI-assisted engineering practices, such as machine learning monitoring tools and anomaly detection systems, to minimize operational toil, accelerate investigations, refine infrastructure workflows, and enable teams to analyze production behavior more effectively. • Mentor engineers and coach product teams to embed reliability practices that clarify, streamline, and safeguard their services. • Reliable system design • Engineer and refine systems focusing on resilience, graceful degradation, capacity, and understanding failure modes. • Collaborate with engineering teams to surface and address reliability risks during design, implementation, launch, and operation. • Transform services to be simpler to debug, easier to operate, and more predictable under failure. • SLOs, observability, and production signals • Define and measure SLIs and SLOs that reflect real customer experience. • Elevate alerting quality so pages drive action, reach the right people, and warrant human intervention. • Incident lifecycle improvement • Direct and optimize incident response practices from detection and triage to communication, resolution, postmortems, and follow-up. • Extract incident learnings to implement lasting technical and process improvements. • Guide teams to reduce repeated incidents and cultivate a quieter on-call environment. • Operational tooling and AI-assisted leverage • Develop or refine tooling that eliminates toil, accelerates root-cause analysis, and streamlines infrastructure-as-code workflows. • Help teams adopt practical AI-assisted workflows where they measurably improve quality, speed, or operational clarity. • Mentorship and engineering enablement • Coach engineers in reliability practices, observability, incident response, and production ownership. • Write documentation and runbooks that reduce silos and make operational knowledge easier to use. • Articulate reliability tradeoffs persuasively to both technical and non-technical partners.

Benefits

• EarnIn’s community members rely on our products to perform consistently, respond promptly, and instill trust. Reliability goes beyond infrastructure; it shapes the customer experience. Product teams must deploy rapidly, but they must also develop systems that are observable, resilient, easy to operate, and safe to update. • This role exists to elevate the reliability of EarnIn’s production systems while empowering engineering teams to advance swiftly with assurance. As a Senior Site Reliability Engineer, you will spearhead reliability enhancements that fortify services, streamline incident management, and foster sustainable on-call practices.