Wizard - AI Applied Scientist
Responsibilities
• Define and evolve accuracy metrics across the full shopping experience (retrieval, ranking, recommendations, outcomes) • Design and run experiments to measure improvements and regressions • Build and maintain evaluation datasets, benchmarks, and scoring frameworks • Improve the LLM judges that power our evaluation pipeline: prompting, calibration, and fine-tuning where it matters • Translate ambiguous product questions into clear, measurable hypotheses and analysis • Partner with ML Engineers to validate model changes and guide iteration • Identify failure modes and edge cases, and drive improvements through data • Make agent performance visible, trusted, and actionable across product and engineering • First 3 months • Go deep on the agent, the current eval pipeline, and the metrics we use today • Audit existing accuracy metrics and benchmarks; identify gaps, blind spots, and signals that aren’t trustworthy • Build relationships with ML, AI Engineering, and Product • Ship one quick win: a missing benchmark, an improved metric, or a fix to a misleading signal • Establish a baseline view of agent performance the team can rally around • Months 3 to 6 • Own the evaluation framework: datasets, metrics, scoring, reporting, both offline and online • Drive measurable improvements to LLM judge quality (calibration, fine-tuning where appropriate) • Run experiments that influence at least one significant model or product change • Stand up automated evaluation the team trusts before and after every launch • Build dashboards and reporting that make agent performance legible to leadership • Beyond 6 months • Lead applied science work on the next frontier as the agent grows: multi-turn evaluation, multimodal, personalization, ranking quality, conversational understanding • Influence team-level strategy on what we measure, what we improve, and why • Mentor and help grow the science function as it expands • What Success Looks Like • Clear, trusted accuracy metrics are consistently used across product and engineering • A robust automated evaluation framework for both offline and live experiments • Model and product changes are consistently measured before and after launch • Demonstrable improvements in LLM judge quality and eval coverage • Science leadership that informs what we build, not just whether it works • Career Growth • Career Growth • Depth track: become the org’s authority on AI evaluation: eval strategy, judge models, agent benchmarking • Breadth track: expand into other applied science problems (recommendations, personalization, ranking, multimodal, conversational understanding) as those areas come online • Leadership track: Senior / Staff Applied Scientist, with technical leadership across the science function • As the agent gets more capable, the science problems get richer • Ideal Background • Ideal Background • 5+ years in Applied ML, AI Research, or Applied Science (PhD or equivalent depth strongly preferred) • Hands-on experience evaluating modern AI/ML systems: LLMs, agents, ranking, or recommendations • Direct experience with LLM-based systems: judge models, RAG, prompt engineering, fine-tuning, RLHF, or similar • Strong experimentation foundations: A/B testing, causal inference, statistical rigor • Proven ability to operate in ambiguity: defining problems, not just solving pre-defined ones • Clear, structured communication that influences across ML, engineering, and product
Benefits
• The expected base salary range for this role is $225,000 - $280,000 USD, and will vary based on skills, experience, role level, and geographic location. Final compensation will be determined by considering these factors alongside overall role scope and responsibilities. • In addition to base salary, Wizard offers: • Equity in the form of stock options • Medical, dental, and vision coverage • Flexible PTO and company holidays • Fully remote work within the United States • Periodic company offsites and team gatherings • Wizard is committed to fair, transparent, and competitive compensation practices.
Apply in one click
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT