synthesia - Senior ML Platform Engineer
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Operating ML infrastructure or model serving systems in production. • Supporting research or data-intensive workloads. • Working with GPU-based systems or other performance-sensitive infrastructure. • Experience with observability and debugging in distributed systems. • Familiarity with Terraform, Datadog, GitHub Actions, or similar tools. • Experience building agentic or LLM-powered internal tools. • Experience with workflow orchestration systems such as Temporal. • Experience working at the boundary between research and production engineering. • Familiarity with performance optimization, scheduling, or resource allocation problems. • Experience building lightweight product or developer-facing tools.
Responsibilities
• Design and improve the platform systems that support model training, evaluation, and production serving. • Build infrastructure and tooling that make ML workloads more reliable, scalable, and cost-efficient. • Develop internal tools and workflows that are easy to operate both by humans and by agents. • Work on the architecture behind how models are deployed, served, and operated across research and product environments. • Improve how we schedule, monitor, and debug workloads running on GPUs and cloud infrastructure. • Develop internal tools and abstractions and agentic systems that reduce operational overhead for researchers and engineers. • Drive improvements across observability, automation, reliability, and developer experience. • Collaborate closely with researchers and product engineers to understand pain points and turn them into robust platform capabilities. • Contribute to technical direction and make pragmatic architectural tradeoffs as the platform grows. • Strong experience building or operating production systems with a focus on reliability, scalability, and maintainability. • A systems mindset: you naturally think in terms of bottlenecks, failure modes, interfaces, resource usage, and long-term operability. • Solid hands-on experience with cloud infrastructure, Linux, and infrastructure automation. • Experience with Kubernetes and operating distributed workloads in production. • Strong coding skills, ideally in Python or similar languages used for backend systems and tooling. • Strong judgment around where automation adds leverage, and where human control and reliability matter most. • Experience building internal platforms, developer tooling, or infrastructure abstractions used by other engineers. • Comfort working in ambiguous environments and taking ownership of open-ended technical problems. • A pragmatic approach: you care about solving the right problem well, not over-engineering.
Similar Jobs
No credit card. Takes 10 seconds.