TrueML - Senior Manager, DevOps

San Francisco, CA$150k - $220k1w ago

Remote Senior NA Cloud Computing Senior Community Manager Senior DevOps Engineer Go Bash Python Team Management Goal Setting

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• What You'll Do (Technical Leadership & Strategy): • Define and execute the long-term strategic vision for Infrastructure as Code (IaC), CI/CD evolution, and cloud-native architecture to support TrueML’s scaling needs. • Lead the design and implementation of self-service internal platforms to reduce developer cognitive load, enabling feature teams to deploy and manage services with minimal friction at increased velocity. • Act as the primary stakeholder for cloud spend (AWS); drive cost-optimization initiatives and lead contract negotiations for the DevOps toolstack and third-party vendors. • Ensure the infrastructure architecture supports strict High Availability (HA) requirements and robust Disaster Recovery (DR) protocols, maintaining system integrity across multiple regions. • Oversee the implementation and evolution of comprehensive monitoring, logging, and distributed tracing systems, leveraging AIOps to move from reactive to predictive system maintenance. • Champion security by design by integrating automated vulnerability scanning, secret management, and compliance checks directly into the automated build pipelines. • Serve as the ultimate escalation point for major production outages, facilitating blameless post-mortem reviews that focus on systemic improvements rather than individual error. • Maintain deep technical currency in container orchestration (Kubernetes), serverless patterns, and modern automation frameworks to provide meaningful mentorship and architectural guidance to senior engineering staff. • What You'll Do (Hands-On Engineering & Technical Execution): • Maintain the ability to write and review high-quality code in languages like Python, Go, or Bash to automate complex operational tasks and system integrations. • Hands-on development of Terraform Infrastructure as Code for resource provisioning. • Directly architect and troubleshoot complex CI/CD workflows (GitHub Actions, ArgoCD, Atlantis), ensuring build-and-deploy cycles are optimized for speed and reliability. • Proactively manage and tune container orchestration environments, including hands-on configuration of Ingress controllers, declarative GitOps workflows, and cluster autoscaling. • Lead from the front during critical incidents by conducting deep-dive technical analysis across the EKS stack, troubleshooting Node-level kernel panics, VPC CNI networking bottlenecks, and RDS performance constraints to minimize MTTR • Conduct hands-on audits of cloud configurations and IAM policies, implementing "least privilege" access controls and automated remediation scripts. • Directly manage the integration and API configurations between various tools in the DevOps stack (e.g., connecting Jira, VictorOps, Slack, and Observe for seamless incident flow). • What You'll Do (People Leadership & Engineering Collaboration): • Recruit, hire, and develop a world-class team of DevOps Engineers; provide career pathing and technical mentorship to foster a culture of continuous learning. • Partner closely with Engineering Managers to align infrastructure deliverables with product roadmap, ensuring DevOps is an accelerator rather than a bottleneck. • Collaborate with the Quality Engineering and Security leadership to define and enforce "Definition of Done" standards that include automated testing and security gates. • Set clear, measurable goals (KPIs and OKRs) for the team, conducting regular performance reviews and providing feedback to drive individual and collective excellence. • Lead internal Brunch & Learns to educate the broader engineering organization on modern cloud-native patterns and self-service capabilities. • Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience. • 10+ years of experience in DevOps, Site Reliability Engineering (SRE), or Software Engineering; 5+ years of experience managing engineers • Expert-level mastery with AWS and experience managing multi-region, high-availability deployments • Advanced experience with Kubernetes (K8s) and Docker, including cluster management, networking, and scaling in a production environment. • Proficiency in Terraform to drive consistency and automation across all infrastructure layers. Experience with Atlantis is a plus. • Deep experience designing and maintaining complex pipelines (GitHub Actions, GitLab CI, or Jenkins) and mastery of scripting languages like Python, Go, or Bash. • Hands-on experience with modern monitoring, observability, and tracing stacks (Datadog, Observe) and a firm grasp of SRE principles (SLIs/SLOs/Error Budgets). • Experience acting as an Incident Commander for high-severity outages and fostering a "blameless" post-mortem culture. • Demonstrated ability to influence executive leadership and collaborate cross-functionally with Product, Engineering, and Security teams. • Experience integrating AI-assisted productivity tools (Cline, GitHub Copilot) into the engineering workflow to accelerate delivery. • Ways to "Stand Out": • Experience leading organizational platform migration, including the development of rollback strategies, stakeholder communication plans, and post-migration validation • Prior experience working with high-velocity, product-driven early-to-mid stage technology companies where reliability, extensibility, and availability were mission-critical to success • AWS or Kubernetes Certifications a plus -- but not in lieu of hands-on experience with the same within production environments • Notable contributions to Open Source projects or communities • $150,000 - $220,000 a year

Get Started Free

No credit card. Takes 10 seconds.

Requirements