telus-digital - Lead DevOps Engineer
Requirements
• Significant infrastructure engineering experience combining DevOps and SRE disciplines at scale • Deep GCP expertise (AWS a strong plus); relevant cloud certifications welcome • Production experience with SRE fundamentals: SLO/SLI design, error budgets, toil reduction, blameless incident review • Strong background in distributed systems failure modes and resilience patterns • Expert-level infrastructure-as-code (Terraform), container orchestration (Kubernetes), and CI/CD • Hands-on with modern observability stacks (i.e., OpenTelemetry, Sentry) and AI-specific observability tooling (Arize, LangSmith, Braintrust, or similar) • Experience with API management platforms, particularly Apigee and Cloud Run • Comfort working across Python, Javascript, and Bash for infra tooling • Strong spoken and written communication in english with teams and stakeholders • Presents production experience with LLM-provider integrations (OpenAI, Anthropic, Google, Azure OpenAI) and the reliability quirks of inference at scale, such as, rate limits, latency tails, provider failover, cost controls • Has experience with event-driven architecture experience (Pub/Sub, Kafka, EventBridge) • Shows understanding of chaos engineering practices (Litmus, Gremlin, or homegrown equivalents) • Holds one or more GCP certifications, such as Cloud Architect, Cloud DevOps Engineer, or equivalent.
Benefits
• You will have a clear technical mandate, direct partnership with product and engineering leadership, and real ownership over infrastructure that powers AI workloads in production. Reliability at this scale is not a support function, it is a first-class engineering discipline with direct commercial impact. • If you want to define how cloud infrastructure and site reliability engineering work together for a suite of AI-powered products at a critical growth stage, this is it.
Apply in one click
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT