Cosine - Machine Learning Engineer – Lumen Enterprise Models (SWE-focused LLMs)
Upload My Resume
Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT
Requirements
• Strong experience training deep learning models in production: • Typically 3–5+ years working as an ML engineer / applied scientist, including hands-on responsibility for training and shipping models. • Deep proficiency with PyTorch and its primitives: • Comfort implementing custom training loops, losses, and dataloaders. • Hands-on experience with torch.distributed (DDP/FSDP-style training, distributed data loading, gradient scaling, etc.). • Experience training large sequence models or LLMs: • Have trained models at ≥70B parameters end-to-end on multi-GPU setups. • Understand practical issues: stability, init, scaling laws, gradient accumulation, curriculum and sampling strategies. • Experience with SFT and RL on top of LLMs: • Have implemented or meaningfully modified at least one RLVR system (e.g. PPO-style, GRPO-style, or similar). • Comfortable working with advantages, policy ratios, KL penalties, and sequence-level rewards. • Strong software engineering background: • You can read, debug, and write non-trivial production code (Python, plus familiarity with at least one of: TypeScript, Go). • You care about code quality, correctness, and maintainability as much as model metrics. • High level of Git proficiency. • Distributed systems / training ops experience: • Practical experience running multi-node jobs on GPU clusters (Slurm, Kubernetes, or managed cloud equivalents). • Familiarity with GPU performance tuning: memory usage, mixed precision, throughput vs. latency tradeoffs. • Data engineering instincts: • Comfortable working with large-scale datasets, object storage, dataset sharding, and filtering. • Know that data quality and sampling strategies matter as much as architecture. • Clear communication and ownership: • Can take a vague modelling goal (“make Lumen Enterprise better at X”) and turn it into a concrete plan of experiments. • Comfortable documenting decisions and walking others through tradeoffs. • You don’t need all of these, but the more you have, the more you’ll hit the ground running: • Continued pretraining and long-context experience: • Have run continued pretraining on domain-specific or long-context corpora. • Familiarity with techniques like RoPE scaling, YaRN-style extrapolation, context parallelism, or similar. • Code-focused RL and evaluation: • Experience building RL loops where rewards come from code execution (tests, linters, static analysis, fuzzing, runtime traces). • Familiarity with evaluation benchmarks for code models (e.g. HumanEval, MBPP, SWE-bench, or internal equivalents). • Experience with modern LLM training stacks: • Experience with large MoE models and expert/tensor parallelism is a plus. • Serving and online training: • Experience in tuning inference tasks for opensource frameworks, e.g. VLLM, SGLang, etc. • Safety, robustness, and reward shaping: • Experience with LLM-as-a-judge, reward hacking detection, or robustness evaluation. • Open-source contributions or research: • Contributions to open-source LLM tooling, RL libraries, or relevant research papers in LLM training / RLHF / code models. • ___________________________________________________________________________
Responsibilities
• Develop machine learning models for enterprise solutions with a focus on SWE languages like Python, R, Java, etc. • Collaborate closely with product teams to understand business needs and translate them into technical requirements. • Implement data preprocessing techniques such as normalization, encoding categorical variables, handling missing values, feature selection/extraction for model inputs. • Evaluate the performance of machine learning models using appropriate metrics (accuracy, precision, recall, F1 score) and visualizations to understand their strengths and weaknesses in real business scenarios. • Optimize existing ML algorithms or develop new ones tailored specifically for Lumen Enterprise Models' needs by experimenting with different model architectures, hyperparameters tuning using techniques like grid search/randomized search etc. • Deploy machine learning models into production environments and monitor their performance over time to ensure they continue meeting business objectives effectively. This may involve setting up APIs for real-time predictions or batch processing jobs depending on the use case requirements of Lumen Enterprise Models' clients. • Document ML workflows, model architectures, data preprocessing steps, evaluation metrics and results in a clear and concise manner to facilitate knowledge transfer within teams as well as with external stakeholders like business analysts or product managers who may not have technical expertise but need understanding of the models being developed. • Stay updated on latest trends and advancements in machine learning research, tools, libraries etc., by attending conferences/seminars, reading relevant publications to continuously improve ML solutions for Lumen Enterprise Models' clients based on cutting edge techniques or best practices from the industry leaders like Google AI Research. • Participate actively in code reviews and pair programming sessions with peers as part of team’s continuous learning culture aimed at improving coding standards, reducing technical debt etc., while also providing mentorship to junior engineers on ML development best practices within the Cosine organization's specific context. • Communicate effectively across different teams (product management, business analysts) and external stakeholders about their needs in terms of data requirements for building machine learning models that align with Lumen Enterprise Models’ overall product strategy while also providing technical insights into how these ML solutions can be implemented efficiently within Cosine's technology stack. • Handle customer queries related to the performance or usage of deployed ML systems and provide timely support by troubleshooting issues, escalating concerns if necessary etc., as part of ongoing post deployment maintenance activities for Lumen Enterprise Models’ clients using tools like Slack/Email channels within Cosine's internal communication platforms. • Manage version control repositories (e.g Git) and automate build processes to ensure smooth development workflow across multiple teams working simultaneously on different aspects
Benefits
• ___________________________________________________________________________ • Cosine at a glance • At Cosine, we’re building autonomous AI engineers that plan, write, and ship code inside real development workflows. • Cosine is designed for on-premise and virtual private cloud (VPC) deployments, including fully air-gapped environments. We build our agent tooling entirely in-house and post-train open-source models to deliver reliable, enterprise-grade coding performance in security-critical settings. • In 2024, Cosine achieved a 72% score on OpenAI’s SWE-Lancer benchmark, placing us among the strongest real-world software-engineering AI systems evaluated. • YC-backed and well-funded, Cosine was founded by experienced operators focused on building dependable, production-grade AI. • This role is based in our Hoxton office, five days a week, because close collaboration, fast feedback, and shared context matter for the problems we’re solving. • Direct impact: Your work directly shapes the next generations of Lumen Enterprise SWE models that engineers use every day. • Real scale: You’ll work with large, modern open-source models, long context lengths, and multi-node training runs. • Full-stack ML engineering: From custom PyTorch code and distributed systems to data curation, RL design and MLOps. • Research + pragmatism: You’ll stay close to the latest literature in SFT, and code LLMs, but you’ll be judged by shipped improvements, not just ideas. • If this sounds like a fit, this is a role where you can meaningfully push the frontier of open-source–based software engineering models.
No credit card. Takes 10 seconds.