Vultr - Senior Site Reliability Engineer, Core Cloud Engineering
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• Proficiency in PHP with strong scripting and automation skills. • Experience running large-scale distributed systems and control plane infrastructure in production. • Strong background in hypervisor technologies (libvirt, QEMU, KVM) and Linux systems administration. • Expertise in networking protocols and tools, particularly BGP and Open vSwitch (OVS), with automation experience. • Deep knowledge of observability and monitoring frameworks (Grafana, Sentry, SumoLogic) and incident management. • Advanced troubleshooting skills across compute, networking, and storage subsystems. • Experience building and maintaining CI/CD pipelines (GitLab) and configuration management (Puppet). • Familiarity with MySQL or similar databases, with an understanding of operational considerations for reliability and scale. • Strong problem-solving abilities and the drive to tackle complex, low-level reliability challenges. • Effective cross-team communication and collaboration skills. • A commitment to continuous improvement and fostering a culture of operational excellence.
Responsibilities
• Production Control Plane Operations: Operate and scale Vultr’s control plane, ensuring availability, correctness, and performance across global datacenters. • Hypervisor & Infrastructure Reliability: Design, implement, and maintain automation to manage hypervisor fleets (KVM, QEMU, libvirt) and supporting infrastructure at scale. • Networking & Systems Automation: Develop tooling and automation for Open vSwitch (OVS), BGP routing, and other networking components to ensure resilient and self-healing network operations. • Performance & Reliability Tuning: Continuously analyze and improve system performance across compute, storage, and network layers, with an emphasis on reducing toil and eliminating single points of failure. • Observability & Incident Response: Implement advanced monitoring, logging, and tracing solutions (Grafana, Sentry, SumoLogic) while leading incident response to minimize impact and drive postmortem culture. • CI/CD & Configuration Management: Maintain and evolve infrastructure pipelines (GitLab CI/CD, Puppet) to enable safe, fast, and reliable changes to both control plane and hypervisor infrastructure. • Collaboration: Work closely with Software Engineers, Network Engineers, and Product teams to align platform reliability with business and user needs. • Documentation & Standards: Produce clear technical documentation for runbooks, operational procedures, and automation frameworks to improve team efficiency and reliability standards. • Mentorship & Leadership: Coach and mentor team members in best practices for site reliability, incident handling, automation, and low-level Linux systems debugging.
Benefits
• Excellent Medical Benefits with company paid premiums for employee only plan + dental & vision premiums. • $500 first year remote office setup and up to $400 each following year for new equipment. • Internet reimbursement of up to $75 per month. • Gym membership reimbursement of up to $50 per month. • 401(k) plan that matches 100% up to 4% with immediate vesting. • Professional Development Reimbursement of $2,500 each year. • Increased PTO at 3 and 10 years anniversary with a month paid sabbatical every 5 years, plus an Anniversary Bonus each year.
Similar Jobs
No credit card. Takes 10 seconds.