Smarsh - Manager, ML Engineering
Requirements
• 2+ years of engineering management experience, ideally in an AI/ML, platform, or MLOps context. • Solid track record of delivering production ML or AI services at scale. • Experience working at the interface between applied research or ML teams and production engineering — you understand how to take a model from handoff to a reliable, monitored service. • Experience managing distributed teams across geographies and time zones. • Demonstrated ability to build trusted relationships with Product, TPM, and platform stakeholders — translating business priorities into engineering plans and vice versa. • Experience with COGs analysis and FinOps practices for AI/ML workloads — you understand how to track, attribute, and optimise infrastructure and inference costs, and can make informed build vs. buy decisions on managed services. • Solid release management and planning experience — you can own a release calendar, coordinate cross-team dependencies, manage risk gates, and ensure smooth, well-communicated production deployments. • Hands-on engineering background — you can credibly engage with technical design decisions and code reviews. • Proficiency in Python; familiarity with Kotlin or JVM-based frameworks is a plus. • Experience with cloud-native AI/ML infrastructure: AWS (Bedrock, SageMaker, EKS), Kubernetes, and Kafka. • Solid understanding of audio and NLP model architectures — specifically Parakeet (ASR), NeMo framework, and XLM-R based multilingual models. • Solid MLOps foundations with practical production experience across the full model lifecycle — including experiment tracking, model registry management, gated deployment strategies (shadow mode, canary, blue/green with automated quality gates), drift detection, rollback handling, and SLO-linked promotion criteria. • Understanding of ML model serving patterns, including inference optimisation and managed inference platforms (e.g. Triton Inference Server). • Solid understanding of modern AI/ML architectures and system design patterns — including transformer-based models, agentic workflows, RAG, and multi-agent orchestration — with the ability to engage credibly in technical design discussions and evaluate trade-offs. • Comfort with operational excellence practices: SLOs, observability, incident management, and on-call culture. • AI Productivity & Tooling • Active user of AI-powered coding assistants (Windsurf, Claude Code, or similar). • Genuine conviction that AI tooling meaningfully accelerates engineering teams — and a desire to prove it. • Ability to coach engineers on effective use of AI tools and to separate hype from practical value. • Leadership & Communication • Clear, direct communicator — able to translate complex technical topics for non-technical stakeholders. • Data-driven approach to engineering decisions and team health. • Comfortable with ambiguity and capable of bringing structure to new problem spaces. • Collaborative leader who builds trust quickly across cultures and time zones.
Responsibilities
• Team Leadership & People Management • Lead, mentor and grow a team of 4 ML Engineers and 1 Delivery Engineer across India and the UK. • Run effective 1:1s, performance conversations, and career development planning. • Foster a high-trust, high-performance team culture grounded in continuous improvement. • Manage hiring, onboarding, and team capacity planning as Cortex expands. • Technical Delivery & Model Operations • Own end-to-end delivery of Cortex initiatives — from planning and scoping to production release and post-go-live operational support. • Drive delivery of new capabilities including Audio Analytics as a Service, In App Translation and Intelligent Agent Review. • Work closely with the Applied ML team to take in-house models from research handoff through to production-grade deployment — managing integration, validation, and operational readiness. • Own and evolve Cortex's gated model deployment pipeline: ensuring models progress through automated quality gates, shadow mode, canary, and full rollout stages with clear promotion and rollback criteria. • Establish model evaluation and monitoring frameworks — tracking quality, performance drift, and SLO compliance in production. • Maintain and improve Cortex's operational SLOs, reliability posture, and incident response process. • Ensure engineering practices, code quality, and architectural decisions meet Smarsh engineering standards. • AI-First Ways of Working • Actively use and champion AI productivity tooling: Windsurf, Claude Code, and similar tools. • Set the standard for how the team leverages AI-assisted development to increase velocity and code quality. • Identify and help to introduce new AI tooling where it adds measurable value to the team. • Technical Strategy, Stakeholder Management & Developer Experience • Contribute to the Cortex technical roadmap, working with engineering leadership, Product Management, and TPM to align delivery to business priorities. • Build strong working relationships with the Applied Machine Learning team — acting as a bridge between model development and production AI service deployment. • Partner closely with sister Cognition teams — Cognition Logic and Cognition Analytics — to align on shared platform patterns, APIs, and service contracts within the Enterprise Conduct organisation. • Engage proactively with the Fabric organisation on infrastructure, platform standards, and shared tooling dependencies. • Represent Cortex in cross-team forums, architecture reviews, and planning sessions — advocating for Cortex consumers' developer experience. • Help to drive the AI Service Catalogue vision: discoverable, well-documented, and operationally excellent services that product engineers across Smarsh can consume with confidence.
Apply in one click
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT