Galaxy - Vice President Site Reliability Engineering (Data Centers)

Hybrid - Asia-Pacific *1mo ago

In Office Vp APAC Cloud Computing Public Sector President Site Reliability Engineer Go Bash Git Python Jenkins

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• A collaborative and strategic leader with deep hands-on experience in Site Reliability Engineering (SRE) and infrastructure Automation. You are comfortable steering the vision for an enterprise automation roadmap while remaining technical enough to dive into the code. You treat infrastructure as a product, ensuring that your automation workflows are as reliable as the services they deploy. You have a proven track record of managing complex hybrid environments and are proactive in building self-service platforms that enhance engineering velocity and system stability. • 6-10 years’ experience in Infrastructure, SRE or DevOps, specifically focused on infrastructure automation at scale. • Deep proficiency with Terraform (providers, modules, state management) and Ansible (roles, playbooks, Tower/AWX). • Hands-on experience with Image Creation (i.e. Packer, Ansible, SCCM) to build standardized, hardened images for both Windows and Linux in hybrid environments. • Strong experience managing and automating virtual platforms such as VMware (vSphere/vCenter) as well as Cloud providers such as Azure and AWS. • High-level scripting skills in mediums such as Python, Go, PowerShell, and Bash. • Experience with observability tools (Splunk, ELK, Prometheus, or Grafana) to monitor infrastructure health and automation telemetry. • Good understanding of Network topology and design as well as experience with platforms such as Juniper Networks or Palo Alto. • Strong mastery of Git (branching strategies, PR workflows) and CI/CD platforms (Jenkins, GitLab CI, or GitHub Actions). • Equal comfort managing, troubleshooting, and tuning performance for both Windows Server and Linux. • Previous work experience includes notable periods of team leadership and or management. • Experience with IAM platforms such as Entra ID, Active Directory, and Okta. • Experience with Storage solutions both block based and object based hosted either on-prem (HP Alletra, EMC, DDN) or in cloud (S3, Azure Blob). • Storage Backup/DR administration and management with Commvault or Veeam. • Galaxy respects diversity and seeks to provide equal employment opportunities to all employees and job applicants for employment without regard to actual or perceived age, race, color, creed, religion, sex or gender (including pregnancy, childbirth, lactation and related medical conditions), gender identity or gender expression (including transgender status), sexual orientation, marital or partnership or caregiver status, ancestry, national origin, citizenship status, disability, military or veteran status, protected medical condition as defined by applicable state or local law, genetic information or predisposing genetic characteristic, or other characteristic protected by applicable federal, state, or local laws and ordinances. • We will endeavor to make a reasonable accommodation to the known limitations of a qualified applicant with a disability unless the accommodation would impose an undue hardship on the operation of our business. If you believe you require such assistance to complete the application process or to participate in an interview, please contact [email protected].

Responsibilities

• Automation Platform Leadership: Oversee a specialized SRE team focused on the design, deployment, and maintenance of automation toolsets as well as the systems they interact with. • Automation Platform Leadership • Infrastructure as Code (IaC) Governance: Establish and enforce standards for IaC to ensure consistent, repeatable, and secure deployments across an entire infrastructure ecosystem. Strong proficiency in Terraform is required. • Infrastructure as Code (IaC) Governance • Configuration Management: Lead the strategy for automated configuration and state management, ensuring Ansible playbooks and Packer image pipelines are optimized for both Windows, Linux, and ESXi Platforms. • Configuration Management • Monitoring & Observability: Manage the monitoring and health of the automation platforms themselves. Implement SLIs/SLOs to ensure the "tools that build the servers" are highly available and performant. • Monitoring & Observability • Lifecycle Management: Drive the automated lifecycle of both physical and virtual assets, from initial template creation/deployment to automated patching, scaling, and decommissioning. • Lifecycle Management • Custom Tooling & Scripting: Lead the development of custom scripts and internal providers (Python, Go, PowerShell, Bash) to provide better insights and tooling for our systems. • Custom Tooling & Scripting • Collaboration: Outside of the automation team you will need to be able to collaborate and foster workflows alongside the rest of the Datacenter team and be able to facilitate needs for the team as a whole. • Collaboration: • Capacity & Performance: Analyze system behavior and resource utilization in virtual environments to optimize the performance of automated deployments. • Capacity & Performance • Mentorship & Growth: Provide technical guidance and career mentorship to SREs, fostering a culture of "automate-first" and continuous improvement. • Mentorship & Growth

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities