Neurons Lab - Data Engineer
Requirements
• Strong SQL and Python for large-scale data processing • Python • AWS data stack: S3, Glue, Lake Formation, Athena / Redshift, EMR / Spark, Step Functions / Airflow • AWS data stack • Data modeling & semantic layer (dbt or equivalent); dimensional modeling • Data modeling & semantic layer • Entity resolution / record linkage across heterogeneous sources • Entity resolution / record linkage • Data-quality & testing frameworks (Great Expectations, dbt tests) and data lineage • Data-quality & testing • Anonymization / pseudonymization techniques and their analytical trade-offs • Anonymization / pseudonymization • Big-data processing (Spark) with performance and cost optimization at scale • Clear written / verbal English; documents for handover and works well with a distributed team • Knowledge • GDPR fundamentals as applied to anonymized / pseudonymized financial data and UK / EU data residency • AWS Well-Architected (Analytics, Security) for BFSI • AWS Well-Architected • Awareness of credit / risk data structures and what downstream modeling consumers need — a plus • 4+ years in data engineering, with strong AWS + Spark / SQL at scale • 4+ years • AWS + Spark / SQL at scale • Demonstrated experience harmonizing / integrating data across multiple source systems • harmonizing / integrating data across multiple source systems • Experience building validated, reproducible pipelines in a regulated environment (BFSI, healthcare, government) — strong plus • validated, reproducible pipelines in a regulated environment • Comfortable stepping into a messy, partly-built data estate and bringing it up to standard • messy, partly-built data estate • Comfortable as the sole or lead data engineer on a small (3–4 person) delivery pod
Responsibilities
• Reproduce a descriptive-statistics report end-to-end so any figure traces back to raw source — closing the gap the client admitted (numbers they can't currently defend). • Reproduce a descriptive-statistics report end-to-end • Profile and reconcile differing source schemas across acquired entities: map differing field names, types, encodings and business definitions for the same concept into one conformed model. • reconcile differing source schemas • Build dbt staging → intermediate → mart models with tests; codify the harmonized definitions the Data Science Lead specifies. • dbt staging → intermediate → mart models • Write Great Expectations suites (null / range / uniqueness / referential checks) and wire them into the pipeline so bad data fails loudly rather than silently corrupting analysis. • Great Expectations suites • Implement entity / identity resolution (deterministic + fuzzy matching) where there is no clean shared key for the same customer or account across sources. • entity / identity resolution • Implement and verify anonymization / pseudonymization (hashing / tokenization / k-anonymity) and evidence that re-identification risk is controlled for the client's IT / compliance team. • verify anonymization / pseudonymization • Optimize Spark / Glue jobs over tens of millions of rows — partitioning, file formats (Parquet), incremental loads, cost control. • Optimize Spark / Glue jobs over tens of millions of rows • Orchestrate with Airflow / Step Functions; build repeatable, scheduled pipelines rather than one-off scripts. • Airflow / Step Functions • Prepare clean, documented, feature-ready datasets for the PD / delinquency models. • clean, documented, feature-ready datasets • Document runbooks so the offshore team can operate the pipelines and handover takes days, not weeks; help scope onboarding of the remaining (Ireland + additional) sources. • runbooks
Apply in one click
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT