Pro members applied to this job 36 hours before you saw itGet Pro ›

fal-ai - Machine Learning Engineer, Reliability

Hybrid - Asia-Pacific *2d ago

In Office Junior APAC Artificial Intelligence Machine Learning Engineer Transformers Kubernetes Python

Requirements

• 3+ years of professional experience, with 1 year experience operating production ML or high-scale API systems, ideally with on-call ownership • Strong systems fundamentals: distributed systems, networking, observability, and incident management • Working knowledge of modern generative models (diffusion, transformers) and their failure modes in production • Familiarity with security and safety practices for ML systems ,abuse prevention, content safety, or trust & safety engineering experience is a strong plus • A bias toward automation, measurement, and blameless postmortems • Location: Remote (India, Australia, New Zealand)

Responsibilities

• Own availability, latency, and throughput SLOs across a large fleet of generative media model APIs serving production traffic at scale • Build the monitoring, alerting, and observability needed to catch ML-specific failures, output quality degradation, pipeline breakage, model regressions before customers do • Harden model deployment workflows with canary releases, shadow testing, automated rollbacks, and validation gates so new model versions ship safely • Drive the security posture of the model fleet: secure model serving, abuse and misuse detection, rate limiting, and protection against adversarial usage patterns • Operationalize safety systems for generative media, content moderation pipelines, safety classifiers, and guardrails that run reliably at inference time without compromising performance • Lead incident response for model API outages and degradations, run postmortems, and drive the engineering work that prevents recurrence • Improve capacity planning, autoscaling, and GPU fleet efficiency for inference workloads under highly variable traffic • Partner with model and infrastructure teams to make reliability, security, and safety requirements part of how new models get onboarded to the platform • You will have access to our massive GPU cluster for inference and evaluation • Some core technologies we use include Python, torch, diffusers, Kubernetes, and the fal Python SDK • You'll work alongside a team dedicated to quickly iterating on and deploying new AI breakthroughs — your job is to make sure that speed never comes at the cost of reliability