Hive Financial Systems - Peach Pilot — Principal QA Engineer (AI Systems & Platform) Remote — Latin America
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• 5+ years of QA engineering experience, with meaningful time spent writing test code (not just managing test cases). • Hands-on experience testing LLM-powered applications you understand prompt sensitivity, output variance, and how eval pipelines catch regressions across model updates. • You write test code. Python is your primary tool. • Experience contributing to CI/CD-integrated test suites. • Comfortable testing complex API chains, async/streaming responses, and multi-service workflows. • Collaborative and self-directed you work well as part of a team, pair well with engineers, and move work forward without hand-holding. • Strong English communication skills, written and verbal. • Available during US Eastern business hours with a minimum of 5 hours of daily overlap. • Even Better If • Experience with LLM evaluation frameworks such as LangSmith, PromptFlow, or custom eval pipelines. • Experience testing agent frameworks (LangChain, CrewAI, or similar) and agent orchestration systems. • Experience testing graph databases (Memgraph, Neo4j) or vector stores (Qdrant). • Background in enterprise software or regulated industries where audit trail integrity is non-negotiable. • Insurance industry background is a plus — it is our first vertical. • The Stack You’ll Test Against • AI/LLM: Anthropic Claude, OpenAI GPT, LiteLLM (multi-model routing) • AI/LLM: • Frontend: React/Next.js, TypeScript, Tailwind CSS • Frontend: • Backend: Python (FastAPI), Node.js/TypeScript • Backend: • Data & Graph: Memgraph, Neo4j, Qdrant, PostgreSQL, Redis • Data & Graph: • Integrations: Nango (700+ connectors) • Integrations: • Infrastructure: Google Cloud Platform (Cloud Run, GCE, Firebase) · Azure (Cosmos DB, AI Search) · GitHub Actions CI/CD · Docker • Infrastructure: • Visualization: Plotly, D3, Recharts, Mermaid • Visualization: • What Makes This Different • You are joining a funded early-stage AI startup with a working platform on live infrastructure and a first client engagement already in motion. You will have access to production data, live workflows, and real compliance requirements from day one — the kind of environment where your testing work has visible, immediate impact on what clients see.
Responsibilities
• First 90 Days — Build the QA Foundation • Establish the testing framework from zero: unit, integration, end-to-end, and LLM-specific evaluation pipelines • Define quality standards, test coverage requirements, and documentation practices in partnership with the VP of Engineering • Audit the existing platform and identify the highest-risk surfaces before the next major customer deployment • Own the QA function end to end and be the voice of quality across the engineering team • Design evaluation frameworks for non-deterministic LLM outputs — including prompt regression testing, model drift detection, and output quality scoring across Claude, GPT-4o, Grok, and Gemini — • Build automated test suites for the agent orchestration layer including governance agent audit trail integrity and human-override behavior • Validate the Enterprise Knowledge Graph (Neo4j + vector search) for data accuracy, retrieval quality, and failure modes under real enterprise data conditions • Own end-to-end testing of the file ingestion pipeline across document types (Word, Excel, PowerPoint, PDF) including encryption, formatting edge cases, and audit trail continuity • Validate streaming response handling, latency thresholds, and graceful degradation when a model is unavailable or slow • Test multi-model routing logic to confirm cost-optimized task allocation behaves correctly across LLM providers • Partner with the Full-Stack Engineer to define and test trust-layer UX standards onboarding flows, progressive disclosure, uncertainty states, and real-time document viewers • Act as the internal advocate for the non-technical enterprise user — if a CEO would be confused by it, it ships
Benefits
• Competitive contractor rate commensurate with experience. Paid monthly via Deel in USD. • The Clincher • Tell us about a quality failure — one you caught before it shipped, or one that got through. What did you build or change after it, and how did you make sure your team could catch the next one without you?
No credit card. Takes 10 seconds.