ML Data Engineer
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Must-have • Strong Python fundamentals; you write clean, maintainable, production-ready code. • Python • Solid hands-on Kubernetes experience (containers, jobs, batch/distributed processing). • Kubernetes • Proven track record with unstructured data, especially images (loading, filtering, transforming at scale). • unstructured data • images • Experience developing data-ingestion or parsing tools for publicly accessible sources, including handling real-world reliability and failure cases gracefully. • Comfort with S3/object storage and moving lots of data efficiently and safely. • S3/object storage • Pragmatic, detail-oriented, ownership mindset; you enjoy making systems reliable and fast. • Nice-to-have • Familiarity with ML workflows (PyTorch) and downstream training considerations. • Experience with image quality scoring, captioning, or image-to-text pipelines. • DAG/workflow visualizations or pipeline UX tooling. • DevOps fluency: Docker, CI/CD, infra automation.
Responsibilities
• Develop and maintain data-ingestion pipelines to source and prepare large-scale image (and occasional text/HTML) datasets from open, publicly accessible, and permitted sources. • Own the end-to-end flow: raw data → quality/beauty/relevance filtering → dedup/validation → ready-to-train artifacts.Operate and improve our Kubernetes-based data-pipeline framework (distributed jobs, retries, monitoring, automation). • Kubernetes-based • Work with S3-style object storage: efficient layouts, lifecycle, throughput, and cost awareness. • S3-style object storage • Add tooling around pipelines (progress/health visualization, metrics, alerts) for observability and faster iteration. • Collaborate closely with ML engineers to align datasets with training needs and accelerate experimentation.
Benefits
• We’re able to offer Skilled Worker visa sponsorship in the UK for qualified candidates. • Real impact on model quality: your pipelines directly power training runs and product improvements. • Real impact on model quality: • Ownership with support: autonomy to design and improve systems, alongside experienced ML peers. • Ownership with support: • Modern stack: Python, Kubernetes, S3, internal pipeline framework built for scale. • Modern stack: • Growth: a fast-moving environment where shipping well-engineered systems is the norm. • Growth: