Site Reliability Engineer
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Minimum of 3 years of experience supporting enterprise applications as an SRE or similar role with proficiency in writing code in Java, Go or Python • Java, Go or Python • Good understanding of distributed systems concepts, microservices architecture and software design patterns. • Hands-on experience with Kubernetes. You have managed applications on a major cloud provider (GCP, AWS, or Azure), and can troubleshoot common container issues. • Experience setting up dashboards in Grafana and using APM tools like Datadog, New Relic, Signoz.You have a Solid understanding of metrics, logs, and traces. • Proficiency in SQL (e.g., PostgreSQL, MySQL). Ability to write complex queries to debug data issues and a basic understanding of database performance. • What we can offer you • Culture - We put our people first and prioritize the well-being of every team member. We’ve built a company where all opinions carry weight and where all voices are heard. We value and respect each other and always look out for one another. Above all, we are human. • Learning - We have a learning and development-focused environment with an emphasis on knowledge sharing, training, and regular internal technical talks. • Compensation - You’ll receive an attractive salary, pension, health insurance, annual bonus, plus other benefits. • What to expect in the hiring process • A preliminary phone call with the recruiter • A technical interview with the Hiring Manager • A behavioural and technical interview with a member of the Executive team.
Responsibilities
• Participate in on-call rotations to detect and triage service and reliability issues across all environments. Act as the Incident Commander during major incidents: initiating war room or bridge calls, coordinating cross-functional teams, providing timely and clear status updates to all stakeholders. • Create and maintain meaningful dashboards and alerts. Work with development teams to instrument their code to ensure visibility. • Develop automation to eliminate manual and repetitive operational tasks (toil) related to reliability across both applications and infrastructure. • Implement and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) defined by the engineering leadership. • Investigate and resolve customer complaints escalated beyond L1 and L2 support, especially those involving performance, reliability, or complex system behavior.
Similar Jobs
No credit card. Takes 10 seconds.