G2i Inc. - Senior Software Engineer — AI Evaluation & Benchmarks (Python)

Remote - Bolivia, Ecuador, Paraguay...$166k - $208k1mo ago

Remote Senior LATAM Payments Artificial Intelligence Senior Software Engineer Go C++JavaScript Git JUnit

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• 4+ years of professional software engineering experience (non-negotiable) • Expert Python — clean, performant, well-tested code • Hands-on experience working in large, complex codebases • Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines • Strong command of Git and modern development workflows • Track record at a high-growth tech company or top-tier software organization • Strong written English communication • Identity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence. • Senior or Lead-level profile with a history of technical ownership • Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience) • Proficiency in additional languages: JavaScript, Go, C++, or others • CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit) • Background in security engineering or significant open-source contributions • Familiarity with AI/ML evaluation methodologies or model benchmarking • Logistics • Logistics • Location: Fully remote — work from anywhere on the accepted locations list • Compensation: $80–$100/hr based on location and seniority • Contract length: 3 months, with potential for extension • Hours: Full-time availability preferred — hours vary by project and are not guaranteed week to week • Engagement: 1099 independent contractor • Payment: Weekly via PayPal or Stripe • ⚠️ Important: Hours are project-dependent and can vary week to week. We recommend keeping other work options open alongside this engagement rather than relying on it as your sole source of income.

Responsibilities

• Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work: • Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code • Build and maintain scalable data pipelines for evaluation workflows • Analyze model-generated code for correctness, reliability, and edge-case failures • Construct structured evaluation scenarios across large repos and multi-language environments • Provide detailed technical feedback on model performance and failure patterns • Contribute to evaluation frameworks that set the bar for how coding ability is measured • End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved. • AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones.

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities