GitLab - Senior Site Reliability Engineer, Tenant Services: Geo

Remote - India+ Equity2mo ago

Remote Senior APAC Cloud Computing Software Site Reliability Engineer Chef Go Ruby Shell Python Documentation

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Experience operating highly-available distributed systems at scale, ideally in a SaaS environment with customer-facing SLAs. • Hands-on experience with at least one major cloud provider (e.g., Google Cloud Platform or Amazon Web Services), including networking, storage, and managed services. • Experience with Kubernetes and its ecosystem (e.g., Helm), including deploying and troubleshooting workloads. • Experience with infrastructure as code and configuration management tools such as Terraform, Ansible, or Chef. • Strong programming skills in at least one general-purpose language (preferably Go or Ruby) and proficiency with scripting (e.g., Shell, Python). • Experience with observability systems (e.g., Prometheus, Grafana, logging stacks) and using metrics and logs to troubleshoot performance and reliability issues. • Practical exposure to data replication, backup/restore, or migration scenarios (e.g., database replication, storage replication, or Geo-like technologies) where data integrity and downtime risk must be carefully managed. • Comfort participating in an on-call rotation, investigating incidents across the stack, and driving follow-through on corrective actions. • Ability to engage directly with enterprise customers during migrations and incidents, including on live calls and through clear written updates. • Ability to clearly define problems, propose options, and think beyond immediate fixes to improve systems and processes over time. • Ability to be a “manager of one”: self-directed, organized, and able to drive work to completion in a remote, asynchronous environment. • Strong written and verbal communication skills, with a bias toward clear, asynchronous documentation and collaboration. • Alignment with our company values and a commitment to working in accordance with those values. • Experience working with disaster recovery technologies. • Experience with managed/hosted environments similar to GitLab Dedicated, including regulated or compliance-sensitive customers (e.g., SOC2, ISO). • Prior work on large-scale data migrations or cutovers where customer data integrity, performance, and downtime risk had to be carefully balanced. • Hands-on experience designing and operating database replication, backup/restore, and cutover workflows (for example, PostgreSQL or cloud-managed equivalents such as AWS RDS), including planning and executing low-risk migrations for large datasets. • Experience with multi-tenant architectures, sharding, or routing strategies in high-traffic SaaS platforms. • Familiarity with GitLab (self-managed or SaaS), and/or contributions to open source projects. • How GitLab Supports Full-Time Employees • Benefits to support your health, finances, and well-being • Flexible Paid Time Off • Team Member Resource Groups • Equity Compensation & Employee Stock Purchase Plan • Growth and Development Fund • Please note that we welcome interest from candidates with varying levels of experience; many successful candidates do not meet every single requirement. Additionally, studies have shown that people from underrepresented groups are less likely to apply to a job unless they meet every single qualification. If you're excited about this role, please apply and allow our recruiters to assess your application. • Country Hiring Guidelines: GitLab hires new team members in countries around the world. All of our roles are remote, however some roles may carry specific location-based eligibility requirements. Our Talent Acquisition team can help answer any questions about location after starting the recruiting process. • Country Hiring Guidelines:

Responsibilities

• Execute Dedicated Geo migrations and cutovers end-to-end, including planning, pre-cutover validation, execution, and post-cutover verification and cleanup. • Operate and improve the Geo operational surface for Dedicated, including: • Environment preparation and data hygiene checks prior to migrations. • Execution of replication, validation, and cutover procedures. • Handling Geo-related escalations from Support and internal partners. • Design, build, and maintain automation, tooling, and runbooks that make migrations, cutovers, and Geo escalations as “boring” and repeatable as possible. • Run our infrastructure with tools such as Ansible, Chef, Terraform, GitLab CI/CD, and Kubernetes; contribute improvements back to GitLab’s product and infrastructure where appropriate. • Build and maintain monitoring, alerting, and dashboards that: • Detect symptoms early, not just outages. • Track migration and cutover success rates, duration, rollback frequency, and related SLOs. • Collaborate closely with: • The core Geo team on improving Geo features and operability. • Dedicated migrations and Support on migration planning, customer communications, and escalation handling. • Other Infrastructure teams on capacity planning, disaster recovery, and reliability improvements. • Contribute to readiness reviews, incident reviews, and root cause analyses, turning learnings into changes in automation, process, or product. • Document every action, including runbooks, architecture decisions, and post-incident reviews, so your findings turn into repeatable practices and automation. • Proactively identify and reduce toil by automating repetitive operational work and simplifying migration workflows.

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities