cantina - Research Scientist (Singapore)

Singapore+ Equity1w ago

In Office APAC Cloud Computing Artificial Intelligence Research Scientist Ray Airflow Docker Kubernetes AWS

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Strong hands-on experience building or scaling large-scale data systems or pipelines for machine learning workflows • Experience with distributed data processing frameworks such as PySpark or Ray, and orchestration tools such as Airflow or equivalent • Familiarity with containerization and container orchestration, including Docker and Kubernetes • Experience working with cloud-based data storage and compute (AWS, GCS, and/or Azure), including tradeoffs around cost, throughput, storage layout, and access patterns • Familiarity with video and media processing tools such as FFmpeg, PyAV, DALI, or OpenCV • Familiarity with multimodal or media data, including video, image, text, and audio • Strong research background in post-training methods for large-scale diffusion or flow-based generative models, with deep hands-on experience in distillation across both inference efficiency and quality preservation • Experience with reward modeling or preference-based fine-tuning for generative models, including RLHF, DPO or equivalent alignment approaches • Solid understanding of the interplay between pretraining and post-training, and how base model properties affect distillation and fine-tuning outcomes • Proficiency in Python and modern machine learning frameworks, with a strong preference for PyTorch or JAX • Track record of independent research, with the ability to drive projects from initial idea through experimental validation • Publications at top-tier venues (NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV) preferred • Good understanding of the practical challenges involved in building reliable, scalable, and reproducible data workflows for machine learning systems

Responsibilities

• Build and maintain scalable systems for ingesting, preprocessing, and delivering large-scale video data for model training • Design and scale distributed data pipelines for preprocessing, dataset generation, and repeated dataset refreshes • Own workflow orchestration, job scheduling, monitoring, and failure recovery for large-scale data processing jobs • Implement and maintain containerized pipeline infrastructure using Kubernetes or equivalent orchestration systems • Optimize cloud-based data storage and movement across providers (AWS, GCS, or Azure) for cost, throughput, and operational efficiency • Define and implement best practices for dataset storage layout, versioning, caching, retention, and access patterns • Build tooling to support deduplication workflows at scale, including near-dedup pipelines over large video corpora • Research and develop distillation methods for large-scale diffusion and flow-based video generation models, including guidance distillation and adversarial distillation, with a focus on preserving or improving generation quality while reducing inference cost • Develop reward models and preference-based fine-tuning pipelines that align video generation quality with human judgments across dimensions such as aesthetics, motion quality, and prompt adherence • Analyze the relationship between base model behavior and post-training outcomes, and work with the foundation model team to inform pretraining decisions accordingly

Benefits

• Competitive salary and generous company equity • Personal time off and paid holidays • Health insurance • Global travel insurance: Covers you when traveling internationally • Monthly spending stipend: $500 (~S$635) • Equipment: All equipment needed for your home office

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities