hatchit - Hatch IT - Site Reliability Engineer (SRE)

Remote - USA *2mo ago

Remote Mid NA Cloud Computing Site Reliability Engineer Chef Java Linux Shell Python AWS

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Bachelor’s degree in Computer Science, Engineering, or equivalent experience. • 3–7 years of experience in SRE, DevOps, or Systems Engineering roles. • Strong proficiency with Linux systems and shell scripting. • Experience with cloud platforms (AWS, Azure). • Hands-on experience with Kubernetes/ECS and container technologies (Docker). • Proficiency in at least one programming language: Python or Java • Experience with CI/CD pipelines and DevOps tooling. • Strong understanding of distributed systems, networking, and security fundamentals. • Strong analytical and problem-solving skills. • Excellent communication and cross-team collaboration. • Ability to thrive in fast-paced, high-stakes environments. • A mindset focused on continuous improvement and operational excellence. • Experience with observability stacks (OpenTelemetry). • Knowledge of database management (PostgreSQL). • Experience with configuration management tools (Ansible, Chef, Puppet). • Familiarity with zero-downtime deployments and chaos engineering practices.

Responsibilities

• Ensure high availability, scalability, and performance of production systems. • Implement and maintain SLIs, SLOs, and SLAs for critical services. • Conduct capacity planning and performance tuning. • Automate infrastructure provisioning using IaC tools such as Terraform and Terragrunt , ansible • Develop automation to minimize manual operations and improve deployment workflows. • Build CI/CD pipelines to support rapid and reliable deployments. • Design and maintain monitoring, logging, and alerting systems (Datadog). • Participate in on-call rotations and lead incident response efforts. • Perform root-cause analysis and develop postmortems to prevent recurring issues. • Manage cloud infrastructure (AWS, Azure) and container orchestration platforms (Kubernetes, ECS). • Optimize system architecture for reliability and fault tolerance. • Implement best practices for security, networking, and service resilience. • Work closely with development teams to design reliable microservices and distributed systems. • Advocate for SRE principles and drive operational excellence across engineering teams. • Mentor engineers on reliability practices, tooling, and automation strategies.

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities