protege - Research Scientist, Benchmarks & Evaluations

Remote1mo ago

Remote WW Artificial Intelligence Machinery Research Scientist Go

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field — applied econometrics with AI experience, quantitative finance, computer science, engineering, statistics/mathematics or any applied research discipline. • Hands-on experience evaluating LLMs, agents, or other ML systems — including prompting, scaffolding, and fluency with the tooling researchers use to run evals at scale. • Experience with annotator quality and inter-rater reliability — designing labeling protocols, computing agreement statistics, and reasoning about annotator bias and calibration. • Excellent scientific writing and communication — you can synthesize technical findings into narratives that frontier labs, enterprise customers, and policymakers can act on. • A bias toward velocity. You know which pipelines need to be production-grade and which can be scrappy, and you get reliable results fast. • Experience with RL evaluation techniques — reward modeling, off-policy evaluation, evals for RLHF/RLAIF or agentic RL pipelines. • Ability to navigate new customer architectures, data systems, and requirements quickly. • Experience with latent-variable models of annotator skill (Dawid-Skene, MACE, IRT-style approaches) or with running large expert-annotator panels in regulated domains. • Track record of published benchmarks or evaluation papers the field has adopted. • Pass the Loved Ones’ Test • We act with integrity and do the right thing — especially when it’s hard and no one is watching. • Always Find a Way • We are resourceful, resilient builders who solve hard problems and push through obstacles. • Go Fast and Grow Fast • Velocity matters. We move with urgency, learn quickly, and continuously improve as individuals and as a company. • Practice Kindness and Candor • We communicate directly and respectfully, building trust through honest feedback and genuine care for one another. • Deliver Together • We win as one team. Collaboration, accountability, and shared ownership drive our success. • Own the Outcome. Hone the Craft. • We take pride in our work, sweat the details, and continuously raise the bar for excellence.

Responsibilities

• Design tasks and benchmarks that distinguish capability levels across frontier models — including agentic, reasoning-heavy, and domain-specific (healthcare, finance, scientific) settings. • Validate evaluations rigorously: run human baselines, analyze inter-rater reliability, study how elicitation and scaffolding shift results, and quantify what’s signal versus noise. • Develop the “science of evals” at Protege — including item response theory, contamination analysis, predictive validity studies, and statistical frameworks for comparing models with appropriate uncertainty. • Run evaluations on current frontier models, sometimes in collaboration with partners at AI labs, enterprises, and government. • Publish research that establishes Protege as the standard-setter for evaluation data, and contribute to the broader AI community’s understanding of what good evals look like. • Translate findings into product, working closely with the data and engineering teams to turn research into evaluation datasets customers can deploy. • Partnering with outsourced annotation vendors - Evaluation data is only as good as the people producing it. A meaningful share of this role is owning the statistical machinery that determines which annotators we trust, on which tasks, and by how much — and translating that into trustworthiness scores Protege’s customers can rely on..

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities