protege - Senior Software Engineer, Data Processing

Remote - USA3w ago

Remote Senior NA Cloud Computing Artificial Intelligence Senior Software Engineer Senior Data Engineer AWS Airflow Dagster Data Quality

Requirements

• 5+ years building and operating production backend or data systems, with real experience in data processing at scale • Hands-on experience designing and running large-scale data pipelines • Experience with distributed data processing • Strong proficiency with AWS • Comfort with messy, varied, high-volume data and high ambiguity, with a knack for finding patterns in complex environments • Attention to detail without losing speed, and a bias to action • Excited to work on a product built around moving and processing large volumes of data • Curious, tenacious, and proactive • Experience processing one or more specific modalities at scale: medical imaging (e.g., DICOM), text, audio or video • Background working with sensitive or regulated data environments (HIPAA, healthcare compliance, PHI handling) • Experience with streaming systems or workflow orchestration (e.g., Airflow, Dagster) • Prior startup experience as a founding or early engineer • Familiarity with ML, NLP, or LLM-based systems, including embeddings and fine-tuning

Responsibilities

• INGESTION & PROCESSING SYSTEMS • Design, build, and operate the ingestion systems that process large volumes of multimodal data into usable, well-structured datasets • Own the ingestion path end to end, from how data lands to how it is validated, processed, tracked, and made available downstream • Build modality-specific processing steps for real-world source data, such as medical imaging processing, audio and video metadata extraction, quality validation, and notes processing • Build parsers, validators, and normalization logic that can systematically handle messy, non-standard, and high-variance source formats • Turn repeated one-off data handling work into reusable processing patterns, internal tooling, and platform capabilities • SCALE, PERFORMANCE & RELIABILITY • Build for high volume and high throughput, optimizing systems for reliability, cost, and speed • Work across distributed and parallel compute systems to process workloads that do not fit well on a single machine • Choose the right execution model for the workload, including batch processing, distributed execution, and modern compute patterns for unstructured data and inference-heavy processing • Diagnose and resolve bottlenecks across ingestion and processing systems, and keep performance from degrading as volume and modality complexity grow • DATA QUALITY, SECURITY & COMPLIANCE • Build validation and quality checks that catch bad, incomplete, or malformed data before it propagates downstream • Handle sensitive and regulated data, including PHI, with the security and care the domain demands, including de-identification where required • Track provenance, metadata, and usage constraints through the ingestion path so downstream use remains compliant and auditable • Raise the quality bar for observability, debuggability, and operational reliability across the ingestion layer • CROSS-FUNCTIONAL PARTNERSHIP • Partner with product and Data Lab to support new modalities, new partner requirements, and non-standard source data • Work directly with partner engineering teams when needed to translate source-system realities into robust ingestion and processing design • Surface recurring patterns that are worth standardizing into reusable transforms, validators, and internal tooling • Help shape how Protege handles new data types as the platform expands into more complex data environments • WHAT SUCCESS LOOKS LIKE • Get productive in the codebase and ship your first improvements to existing pipelines • Build a working map of the ingestion and processing stack, the major data flows, and how we handle each modality • Meet the engineering, product, and Data Lab teams to understand how the function operates across the company • 60 DAYS: TAKE OWNERSHIP • Own a processing pipeline or modality end to end, from ingestion through delivery of AI-ready output • Develop depth in how we handle one or two data types at scale • Start raising the bar on data quality, observability, and processing best practices • 90 DAYS: OPERATE INDEPENDENTLY • Own a significant part of the ingestion and processing layer and lead design on new modalities or scaling challenges • Ship reliably with minimal hand-holding, and help unblock others working in the data layer • Identify at least one leverage opportunity — a reusable transform, tool, or architectural improvement — worth investing in, and drive it