NICE - Site Reliability Engineer

Remote - United Kingdom2mo ago

Remote EMEA Cloud Computing Site Reliability Engineer Go Splunk Datadog Bash Python

Responsibilities

• Unlike a traditional NOC analyst, an SRE‑NOC is expected to engineer problems away, not just respond to alerts. • How will you make an impact? • Incident Response & Operations • Act as a primary or escalation responder in a 24x7 on‑call rotation • Lead or support Major Incident (MI) response, including triage, mitigation, and resolution • Coordinate across Engineering, Infrastructure, Security, and Product teams • Execute and improve runbooks, playbooks, and escalation paths • Drive blameless post‑incident reviews (PIRs) and track corrective actions • Monitoring, Alerting & Observability • Own service health monitoring across infrastructure, applications, and dependencies • Design and maintain alerting strategies that align with SLIs/SLOs • Reduce alert fatigue through signal‑to‑noise improvements • Build dashboards using tools such as: • Datadog / Splunk / CloudWatch • Reliability Engineering & Automation • Automate repetitive operational tasks to reduce manual toil • Improve mean time to detect (MTTD) and mean time to resolve (MTTR) • Develop scripts and tools (Python, Bash, Go, etc.) to support NOC/SRE workflows • Implement self‑healing and auto‑remediation where possible • Partner with engineering teams to improve system design for reliability • Platform & Infrastructure Support • Support and troubleshoot: • Linux‑based systems • Cloud platforms (AWS, Azure, GCP) • Kubernetes / containerized environments • Assist with capacity planning and availability reviews • Ensure operational readiness for production releases • Have you got what it takes? • Strong Linux systems administration • Experience with incident management and production support • Familiarity with: • Cloud infrastructure (AWS preferred) • Containers & orchestration (Docker, Kubernetes) • Monitoring/alerting platforms • Scripting or programming experience in Python, Bash, Go, or similar • Understanding of networking fundamentals (DNS, TCP/IP, load balancing) • Experience working in 24x7 NOC or production operations environments • Ability to handle high‑pressure incidents calmly and effectively • Strong written and verbal communication for incident coordination • Comfort working from runbooks—but improving them when they fall short • Preferred / Differentiators • Experience defining or operating to SLOs / SLIs • Prior migration from traditional NOC → SRE model • Infrastructure as Code experience (Terraform, Ansible, etc.) • Exposure to security, compliance, or regulated environments • Requisition ID: 10579. • Reporting into: Manager, Network Operations. • Role Type: Individual Contributor.