Fieldguide - AI Engineer, Quality

Remote - San Francisco, California, United States$170k - $220k+ Equity2mo ago

Remote NA Artificial Intelligence AI Engineer React TypeScript Python Vector

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• What You'll Own • Measurable AI Agents • Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows • Build observability systems that surface agent behavior, trace execution, and failure modes in production, and feedback loops that turn production failures into first-class evaluation cases • Own the evaluation infrastructure stack including integration with LangSmith and LangGraph. • Translate customer problems into concrete agent behaviors and workflows • Integrate and orchestrate LLMs, tools, retrieval systems, and logic into cohesive, reliable agent experiences • Rapid Model Evaluation • Build automated pipelines that evaluate new models against all critical workflows within hours of release • Design evaluation harnesses for our most complex Agentic systems and workflows • Implement comparison frameworks that measure effectiveness, consistency, latency, and cost across model versions • Design guardrails and monitoring systems that catch quality regressions before they reach customers • AI-native engineering execution • Use AI as core leverage in how you design, build, test, and iterate • Prototype quickly to resolve uncertainty, then harden systems for enterprise-grade reliability • Build evaluations, feedback mechanisms, and guardrails so agents improve over time • Work with SMEs and ML Engineers to create evaluation datasets by curating production traces. • Design prompts, retrieval pipelines, and agent orchestration systems that perform reliably at scale • Ownership of Quality and Large Product Areas • Define and document evaluation standards, best practices, and processes for the engineering organization • Advocate for evaluation-driven development and make it easy for the team to write and run evals • Partner with product and ML engineers to integrate evaluation requirements into agent development from day one • Take full ownership of large product areas rather than executing on narrow tasks • You are an engineer who believes that evaluations are foundational to building reliable AI systems, not a nice-to-have. The following operating principles should resonate with you: • Evaluation-first mindset: You understand that for an AI company, not being able to evaluate a new model quickly is unacceptable • AI-native instincts: You treat LLMs, agents, and automation as fundamental building blocks and parts of the craft of engineering • Data-driven rigor: You make decisions based on metrics and are obsessed with measuring what matters • Production-oriented: You understand that evaluations must work on real production behavior, not just offline datasets • Strong product judgment: You can decide what matters and why, without waiting for guidance, not just how to implement it • Bias to building: You move fast and build working systems rather than perfect specifications • We care more about capability and trajectory than years on a resume, but most strong candidates will have: • Multiple years of experience shipping production software in complex, real-world systems • Experience with TypeScript, React, Python, and Postgres • Built and deployed LLM-powered features serving production traffic • Implemented evaluation frameworks for model outputs and agent behaviors • Designed observability or tracing infrastructure for AI/ML systems • Worked with vector databases, embedding models, and RAG architectures • Experience with evaluation platforms (LangSmith, Langfuse, or similar) • Comfort operating in ambiguity and taking responsibility for outcomes • Deep empathy for professional-grade, mission-critical software (experience with audit and accounting workflows are not required) • What Should Excite You • Agent reliability at enterprise scale: Building systems that professionals depend on • Balancing automation with human oversight: Knowing when to automate and when to surface decisions to experts • Production feedback loops: Turning real-world agent failures into systematic improvements • Explaining AI decisions: Making all forms of AI outputs and agent reasoning transparent and trustworthy • Evaluation for nuanced domains: Structuring data and feedback for workflows where ground truth requires expert judgment • High-impact visibility: Your work directly enables leadership to confidently communicate AI quality to the board and customers

Benefits

• Competitive compensation packages with meaningful ownership • Wellness benefits, including a bundle of free therapy sessions • Technology & Work from Home reimbursement • Flexible work schedules

Get Started Free

No credit card. Takes 10 seconds.

Requirements