DevOps & SRE Engineer
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Experience & Education: At least 2+ years of hands-on experience in Systems Operations, DevOps, or Site Reliability Engineering (SRE) and a Bachelor's degree in Computer Science, Engineering, or related technical field preferred. • Cloud & Infrastructure: Highly valued experience with public cloud platforms such as AWS, Azure, or GCP; strong understanding of large-scale internet architecture and distributed systems is required. Experience with infrastructure monitoring, logging, and observability tools like Prometheus/Grafana, Elastic Stack (Elasticsearch, Kibana), Datadog, New Relic, Splunk, or Zabbix preferred; proficiency in scripting and automation using Shell, Python, or similar languages is required. • Technical Skills: Strong knowledge of containerization technologies such as Docker and Kubernetes with hands-on experience operating production-grade container clusters and managing CI/CD pipelines needed. Hands-on familiarity with common infrastructure components like Nginx, MySQL, Redis, Kafka, Elasticsearch required; strong networking skills including Service Mesh architectures (Cilium CNI), eBPF technologies preferred but not mandatory if no experience stated in the job posting itself. • High Availability & Reliability: Experience ensuring maximum uptime for production services through proactive monitoring and incident response required; continuous optimization of service architecture, deployment strategies, operational processes needed with implementation and maintenance of SLA/SLO frameworks and reliability engineering practices expected but not explicitly stated. • Automation & Process Improvement: Lead the development of automated operations and maintenance systems required; create self-service tools and workflows to improve team productivity necessary; establish best practices for infrastructure such as code and configuration management needed, though no specific certifications or must-haves mentioned in job posting.
Responsibilities
• Cluster Operations & Management • Manage and maintain container clusters (Kubernetes, Docker) and open-source component clusters (Kafka, Redis, Elasticsearch) across multiple business units • Ensure optimal performance, scalability, and reliability of distributed systems • Infrastructure Platform Development • Design, build, and enhance infrastructure operation platforms • Develop and maintain systems for infrastructure management, CI/CD pipelines, monitoring/alerting, and centralized logging • Drive platform standardization and automation initiatives • High Availability & Reliability • Ensure maximum uptime for production services through proactive monitoring and incident response • Continuously optimize service architecture, deployment strategies, and operational processes • Implement and maintain SLA/SLO frameworks and reliability engineering practices • Automation & Process Improvement • Lead the development of automated operations and maintenance systems • Create self-service tools and workflows to improve team productivity • Establish best practices for infrastructure such as code and configuration management