wagey.ggwagey.gg
Open Tech JobsCompaniesPricing
Log InGet Started Free
Jobs/Data Engineer Role/AI Data Engineer

AI Data Engineer

InstrumentlHybrid - USA *$175k – $220k+ Equity3w ago
RemoteMidNACloud ComputingArtificial IntelligenceData EngineerQuantRubyPythonFastAPICeleryTypeScript

Upload My Resume

Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Build content discovery pipelines: Automate discovery and acquisition of grant-related content from the web—foundation websites, RFPs, program announcements—turning the open web into structured, actionable data. • Build LLM extraction pipelines: Implement production pipelines to transform unstructured text into canonical business objects—including document ingestion (PDFs, HTML, Word), OCR, table extraction, and layout-aware parsing. Partner with product engineers to evolve schemas as domain needs change. • Own semantic chunking and embeddings: Design chunking strategies optimized for retrieval; select and manage embedding models; maintain vector indices that power downstream search and RAG features. • Optimize for cost and latency: Profile token usage, implement caching and batching strategies, choose appropriate models for different tasks, and manage the cost/quality tradeoff at scale. • Maintain data quality and serve downstream consumers: Implement validation, anomaly detection, and alerting for extraction drift. Expose clean data via APIs, materialized views, or event streams that product teams can rely on without understanding the extraction complexity. Integrate and normalize data from external providers—resolving entities, mapping to internal schemas, and ensuring "Ford Foundation" and "The Ford Foundation" resolve to the same canonical record. • Software engineering background: 5+ years of professional software engineering experience, including 2+ years working with modern LLMs (as an IC). Startup experience and comfort operating in fast, scrappy environments is a plus. • Proven production impact: You’ve taken LLM/RAG systems from prototype to production, owned reliability/observability, and iterated post‑launch based on evals and user feedback. • LLM agentic systems: Experience building tool/function‑calling workflows, planning/execution loops, and safe tool integrations (e.g., with LangChain/LangGraph, LlamaIndex, Semantic Kernel, or custom orchestration). • RAG expertise: Strong grasp of document ingestion, chunking/windowing, embeddings, hybrid search (keyword + vector), re‑ranking, and grounded citations.Experience with re‑rankers/cross‑encoders, hybrid retrieval tuning, or search/recommendation systems. • Embeddings & vector stores: Hands‑on with embedding model selection/versioning and vector DBs (e.g., pgvector, FAISS, Pinecone, Weaviate, Milvus, Qdrant). Document processing at scale (PDF parsing/OCR), structured extraction with JSON schemas, and schema‑guided generation. • Evaluation mindset: Comfort designing eval suites (RAG/QA, extraction, summarization), using automated and human‑in‑the‑loop methods; familiarity with frameworks like Ragas/DeepEval/OpenAI Evals or equivalent. • Infrastructure & languages: Proficiency in Python (FastAPI, Celery) and TypeScript/Node; familiarity with Ruby on Rails (our core platform) or willingness to learn. • Experience with AWS/GCP, Docker, CI/CD, and observability (logs/metrics/traces). • Data chops: Comfortable with SQL, schema design, and building/maintaining data pipelines that power retrieval and evaluation • Collaborative approach: You thrive in a cross‑functional environment and can translate researchy ideas into shippable, user‑friendly features. • Results‑driven: Bias for action and ownership with an eye for speed, quality, and simplicity. • Fine‑tuning: Practical experience with SFT/LoRA or instruction‑tuning (and good intuition for when fine‑tuning vs. prompting vs. model choice is the right lever).Exposure to open‑source LLMs (e.g., Llama) and providers (e.g., OpenAI, Anthropic, Google, Mistral).Familiarity with responsible AI, red‑teaming, and domain‑specific safety policies.

Responsibilities

• Build content discovery pipelines: Automate discovery and acquisition of grant-related content from the web—foundation websites, RFPs, program announcements—turning the open web into structured, actionable data. • Build LLM extraction pipelines: Implement production pipelines to transform unstructured text into canonical business objects—including document ingestion (PDFs, HTML, Word), OCR, table extraction, and layout-aware parsing. Partner with product engineers to evolve schemas as domain needs change. • Own semantic chunking and embeddings: Design chunking strategies optimized for retrieval; select and manage embedding models; maintain vector indices that power downstream search and RAG features. • Optimize for cost and latency: Profile token usage, implement caching and batching strategies, choose appropriate models for different tasks, and manage the cost/quality tradeoff at scale. • Maintain data quality and serve downstream consumers: Implement validation, anomaly detection, and alerting for extraction drift. Expose clean data via APIs, materialized views, or event streams that product teams can rely on without understanding the extraction complexity. Integrate and normalize data from external providers—resolving entities, mapping to internal schemas, and ensuring "Ford Foundation" and "The Ford Foundation" resolve to the same canonical record.

Benefits

• For US-based candidates, our target salary band is $175,000 - $220,000 USD + equity. Salary decisions consider experience, location, and technical depth • 100% covered health, dental, and vision insurance for employees (50% for dependents) • Generous PTO, including parental leave • Company laptop and home-office stipend • Bi-Annual Company Retreats for in-person collaboration • Instrumentl is evolving rapidly. You’ll always have new challenges and opportunities to grow here.

Similar Jobs

Software Engineer Intern (Chicago)11h ago
LogicGateLogicGate·Chicago - United States - Hybrid
In OfficeNAInternCloud ComputingHigher EducationSoftware EngineerInternJavaC#C++RubyPythonJavaScriptSpringJiraClaudeSpring BootNeo4jAngularKotlinSlackAWSSCSSKubernetesDockerTypeScriptTerraformAnsible
Data Operations 13h ago
elevenlabselevenlabs·Remote - United Kingdom
RemoteEMEAArtificial IntelligenceData EngineerSAFeLearning & Development
Community Manager13h ago
ruby-labsruby-labs·European Union
In OfficeEMEAMidGamingCommunity ManagerCopywritingContent CreationReportingDiscordSEORuby
Full Stack Engineer - Backend Focus (Ruby on Rails/Python)13h ago
revealtechrevealtech·Remote - USA·$130k – $170k/year + Equity
RemoteNAMidLogisticsGovernmentFull Stack EngineerFull StackRubyRuby on RailsPythonDockerKubernetesReactTypeScriptTailwindPostgreSQLCAC
Work17h ago
SynackSynack·Remote - USA·$140k – $180k/year + Equity
RemoteNAMidCybersecurityCloud ComputingCo-opRubyRESTKafkaNoSQLDockerKubernetesGCPCockroachDB

Stop filling. Start chilling.Start chilling.

Get Started Free

No credit card. Takes 10 seconds.

© 2026 Dominic Morris. All rights reserved.·Privacy·Terms·