cantina - Machine Learning Engineer (Singapore)

Singapore+ Equity1w ago

In Office APAC Cloud Computing Artificial Intelligence Machine Learning Engineer Python Ray Airflow Docker Kubernetes

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Strong hands-on experience building or scaling large-scale data systems and pipelines for machine learning, including dataset curation, filtering, and quality improvement • Experience with distributed data processing frameworks such as PySpark or Ray, and orchestration tools such as Airflow or equivalent • Familiarity with containerization and container orchestration, including Docker and Kubernetes • Experience working with cloud-based data storage and compute (AWS, GCS, and/or Azure), including tradeoffs around cost, throughput, storage layout, and access patterns • Experience with VLM-based captioning pipelines or quality/aesthetic scoring models for video or image data, including curation of image-text pair datasets for joint image-video training • Familiarity with CLIP-based or embedding-based filtering and semantic data selection techniques • Familiarity with video and media processing tools such as FFmpeg, PyAV, DALI, or OpenCV, and relevant libraries such as Decord, torchvision, PyTorchVideo, or torchaudio • Proficiency in Python • Strong problem-solving, communication, and documentation skills

Responsibilities

• Design and scale distributed data pipelines for preprocessing, dataset generation, and repeated dataset refreshes • Own workflow orchestration, job scheduling, monitoring, and failure recovery for large-scale data processing jobs • Implement and maintain containerized pipeline infrastructure using Kubernetes or equivalent orchestration systems • Optimize cloud-based data storage and movement across providers (AWS, GCS, or Azure) for cost, throughput, and operational efficiency • Define and implement best practices for dataset storage layout, versioning, caching, retention, and access patterns • Design and implement curation pipelines that determine which video and image content is selected, filtered, and retained for model training, including image-text pair datasets used in joint training regimes • Build and improve VLM-based captioning and metadata generation workflows at scale across both video and image data • Develop and apply quality and aesthetic scoring models, CLIP-based semantic filtering, and other signal-extraction approaches for data selection • Build tooling to support deduplication workflows at scale, including near-dedup and exact deduplication pipelines over large video corpora • Analyze dataset composition, identify quality issues, and iterate on curation logic to improve training outcomes • Define and evolve standards for what constitutes high-quality, training-ready video data across different training regimes

Benefits

• Competitive salary and generous company equity • Personal time off and paid holidays • Health insurance • Global travel insurance: Covers you when traveling internationally • Monthly spending stipend: $500 (~S$635) • Equipment: All equipment needed for your home office

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities