Machine Learning Engineer — Multilingual Data
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• 3+ years of experience as an ML Engineer, Applied Scientist, or similar role • Strong experience working with multilingual or non-English datasets • Solid understanding of NLP fundamentals (tokenization, embeddings, language modeling) • Experience building scalable data pipelines (Python, Spark, Ray, or similar) • Familiarity with Unicode, scripts, tokenization challenges, and language-specific quirks • Comfort collaborating with researchers and translating research needs into production systems • Experience with low-resource languages or multilingual benchmarks (e.g. FLORES, XTREME) • Exposure to LLM training, fine-tuning, or distillation • Linguistics background or experience working with native language experts • Contributions to open-source datasets or ML tooling • Experience with data quality evaluation at scale
Responsibilities
• Design, build, and maintain large-scale multilingual datasets across high- and low-resource languages • Develop data pipelines for collection, cleaning, normalization, deduplication, and labeling • Implement quality filters using statistical, heuristic, and model-based methods • Work with researchers to define language coverage, benchmarks, and evaluation metrics • Analyze dataset bias, coverage gaps, and failure modes across regions and scripts • Support training, fine-tuning, and distillation workflows with high-quality multilingual data • Continuously iterate on datasets based on model performance and real-world usage
Benefits
• Real ownership over a core differentiator of the product • Work on models used globally, not just in English-speaking markets • Small, high-caliber team with deep ML and systems experience • Competitive compensation + meaningful equity at Series A stage