• Oversee and manage IT operations within the data centre, including day-to-day monitoring, incident management, and problem management
• Lead the end-to-end incident management lifecycle that encompass immediate troubleshooting, root cause identification, and resolution implementation to restore services, followed by comprehensive post-incident analysis
• Develop and maintain documentation on IT infrastructure, operations, and procedures within the data centre
• Perform capacity planning to ensure IT infrastructure is scalable for future demands
• Collaborate and coordinate with Data Centre Facilities teams on matters related to power, cooling, and physical infrastructure
• Design and implement robust observability platform alongside network monitoring tools for performance monitoring and real-time alerting of IT devices and networks
• Implement and manage remote management tools for out-of-band access and control of IT devices and servers
• Define, implement, and track SRE metrics, including SLO, SLI, and error budgets to improve data centre IT reliability