fal-ai - Machine Learning Engineer, Reliability
Requirements
• 3+ years of professional experience, with 1 year experience operating production ML or high-scale API systems, ideally with on-call ownership • Strong systems fundamentals: distributed systems, networking, observability, and incident management • Working knowledge of modern generative models (diffusion, transformers) and their failure modes in production • Familiarity with security and safety practices for ML systems ,abuse prevention, content safety, or trust & safety engineering experience is a strong plus • A bias toward automation, measurement, and blameless postmortems • Location: Remote (India, Australia, New Zealand)
Responsibilities
• Own availability, latency, and throughput SLOs across a large fleet of generative media model APIs serving production traffic at scale • Build the monitoring, alerting, and observability needed to catch ML-specific failures, output quality degradation, pipeline breakage, model regressions before customers do • Harden model deployment workflows with canary releases, shadow testing, automated rollbacks, and validation gates so new model versions ship safely • Drive the security posture of the model fleet: secure model serving, abuse and misuse detection, rate limiting, and protection against adversarial usage patterns • Operationalize safety systems for generative media, content moderation pipelines, safety classifiers, and guardrails that run reliably at inference time without compromising performance • Lead incident response for model API outages and degradations, run postmortems, and drive the engineering work that prevents recurrence • Improve capacity planning, autoscaling, and GPU fleet efficiency for inference workloads under highly variable traffic • Partner with model and infrastructure teams to make reliability, security, and safety requirements part of how new models get onboarded to the platform • You will have access to our massive GPU cluster for inference and evaluation • Some core technologies we use include Python, torch, diffusers, Kubernetes, and the fal Python SDK • You'll work alongside a team dedicated to quickly iterating on and deploying new AI breakthroughs — your job is to make sure that speed never comes at the cost of reliability
Apply in one click
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT