protege - Machine Learning Researcher - RL and Agentic Systems
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• PhD or equivalent Master’s Degree + 4+ years industry experience in machine learning, computer science, statistics, engineering, mathematics, economics, or related quantitative fields. • Strong understanding of AI model training pipelines, evaluation methodology, and the role of data in shaping model performance. • Experience working with large, unstructured, or semi-structured datasets used to train or evaluate ML systems. • Experience with reinforcement learning, sequential decision-making, agentic systems, tool-using models, or multi-step model evaluation. • Experience designing tasks, benchmarks, environments, simulations, or evaluation frameworks for real-world model behavior. • Strong intuition for realism, coverage, difficulty, fidelity, and meaningful outcome structure in datasets. • Strong experimental design, evaluation, benchmarking, and data-validation skills. • High ownership and ability to independently identify and solve high-impact problems. • Experience developing evaluation frameworks or performance metrics for datasets, agentic systems, or training data. • Experience translating real-world workflows into structured tasks or environments for model evaluation. • Experience with RLHF, RLAIF, imitation learning, reward modeling, online or offline RL, or related methods. • Experience with Harbor or other agent evaluation frameworks. • Publications or open-source contributions in reinforcement learning, agents, evaluation, or data-centric AI. • Experience collaborating cross-functionally with product, infrastructure, or partnership teams. • Experience with synthetic data generation, trajectory generation, or simulation-based environments. • PROTEGE'S VALUES • Pass the Loved Ones' Test • We act with integrity and do the right thing - especially when it's hard and no one is watching. • Always Find a Way • We are resourceful, resilient builders who solve hard problems and push through obstacles. • Go Fast and Grow Fast • Velocity matters. We move with urgency, learn quickly, and continuously improve as individuals and as a company. • Practice Kindness and Candor • We communicate directly and respectfully, building trust through honest feedback and genuine care for one another. • Deliver Together • We win as one team. Collaboration, accountability, and shared ownership drive our success. • Own the Outcome. Hone the Craft. • We take pride in our work, sweat the details, and continuously raise the bar for excellence.
Responsibilities
• DESIGN AND BUILD DATASETS, TASKS, AND ENVIRONMENTS • Design and build datasets, tasks, environments, and evaluation assets for benchmarking agentic systems and multi-step model behavior. • Translate real-world workflows into structured tasks, interaction traces, trajectories, stateful environments, and verifiable outcomes that can be used to evaluate advanced AI systems. • DEVELOP FRAMEWORKS FOR EVALUATING REAL-WORLD DATA QUALITY • Develop frameworks that assess diversity, realism, coverage, fidelity, informativeness, and downstream usefulness of datasets for agentic systems. • Build quality scorecards and evaluation methods that make dataset strengths, weaknesses, and failure modes legible across teams. • BENCHMARK MODEL BEHAVIOR IN RL AND AGENTIC SETTINGS • Evaluate planning, tool use, robustness, recovery from failure, task completion, and generalization behavior in RL-style or agentic environments. • Connect model failures back to concrete dataset, environment, or task-design gaps and recommend improvements grounded in empirical evidence. • BUILD SCALABLE EVALUATION AND VALIDATION TOOLING • Contribute to tools and systems that automate dataset validation, environment generation, rollout analysis, benchmark construction, and evaluation workflows. • Improve internal infrastructure for reproducible experimentation, benchmark management, and evaluation quality. • PARTNER ACROSS RESEARCH, ENGINEERING, AND PRODUCT • Collaborate closely with research and engineering teams to identify data bottlenecks, improve evaluation methodology, and shape internal best practices around task-grounded AI training data. • Represent DataLab’s perspective in cross-functional discussions around dataset quality, benchmark design, and frontier agentic-system evaluation. • WHAT SUCCESS LOOKS LIKE • NEAR-TERM: ESTABLISH A STRONG EVALUATION BASELINE • Create clear benchmark frameworks, evaluation assets, and dataset-quality scorecards that help Protege reason about how real-world data impacts advanced agentic systems. • Use rigorous evaluation methods to identify meaningful dataset improvements, improve benchmark fidelity, and sharpen the company’s understanding of what high-impact agentic data actually looks like in practice.
No credit card. Takes 10 seconds.