Apollo Research - Backend Engineer (Product)

London£100k - £180k5mo ago

In Office Mid EMEA Cloud Computing Developer Tools Backend Engineer Documentation Apollo Flask Django FastAPI

Requirements

• 4+ years of experience building production backend systems at scale • Strong Python proficiency with experience in frameworks like FastAPI, Flask, or Django • Experience designing and implementing RESTful APIs with clear documentation • Solid understanding of database design and optimization (SQL and/or NoSQL) • Experience with cloud platforms (AWS, Google Cloud, or Azure) and containerization technologies (Docker, Kubernetes) • Experience building data-intensive applications or processing large-scale log data • Strong understanding of system design principles, including scalability, reliability, and security • Experience with asynchronous processing, message queues, and distributed systems • Demonstrated ability to write clean, well-tested, maintainable code • Familiarity with real-time data processing frameworks (Kafka, Redis Streams, etc.) • Experience with ML/AI infrastructure or building tools for AI applications • Previous work on developer tools, monitoring systems, or security tools • Experience with infrastructure-as-code (Terraform, CloudFormation, etc.) • Familiarity with AI safety concepts or evaluation frameworks like Inspect • Contributions to open-source backend infrastructure projects • Experience building security-centric tools • We want to emphasize that people who feel they don't fulfill all of these characteristics but think they would be a good fit for the position nonetheless are strongly encouraged to apply. We believe that excellent candidates can come from a variety of backgrounds and are excited to give you opportunities to shine. • REPRESENTATIVE PROJECT • Real-time agent monitoring infrastructure: Design and build the backend system that processes AI coding agent outputs in real-time to detect safety and security issues. Start by implementing a scalable ingestion pipeline that can accept agent logs via API, then build a processing system that routes logs through various monitors based on their characteristics. Implement a storage layer that efficiently handles both recent high-frequency queries and historical analysis. Add a notification system that alerts users when monitors detect concerning behaviors, with configurable thresholds and delivery methods. Throughout the project, ensure the system maintains sub-second p95 latency for critical operations while gracefully handling traffic spikes and partial system failures.

Responsibilities

• Infrastructure & Architecture • Design and implement scalable backend systems capable of processing and analyzing large volumes of AI agent logs in real-time • Build and maintain data processing pipelines that extract, transform, and store agent trajectory data efficiently • Architect database schemas and data models optimized for both high-throughput writes and complex analytical queries • Design for reliability, implementing robust error handling, retry logic, and graceful degradation strategies • Monitor system performance and optimize bottlenecks to ensure sub-second latency for critical monitoring operations • API Development • Develop secure, well-documented RESTful APIs that allow users to integrate our monitoring tools into their workflows • Implement authentication, authorization, and rate limiting to protect users data and ensure fair resource usage • Build webhook systems and real-time notification services to alert users about critical safety events • Design API interfaces that are intuitive for developers while remaining flexible for diverse user use cases • Design and implement integrations with Security Information and Event Management (SIEM) systems, enabling users to stream monitoring alerts and security events into their existing security operations workflows • Implement efficient storage solutions for both structured data (monitoring results, metadata) and unstructured data (agent logs, code outputs) • Build data processing systems that can handle everything from streaming real-time monitoring to batch analysis of historical data • Design and implement caching strategies to optimize frequent queries and reduce infrastructure costs • Create data retention and archival policies that balance users needs with storage efficiency • Monitoring & Observability • Build comprehensive logging, metrics, and tracing systems to ensure visibility into system health and performance • Implement alerting systems that notify the team of infrastructure issues before they impact users • Create dashboards and tools that help the team understand system behavior and diagnose issues quickly • Design systems that make debugging production issues straightforward and minimize time-to-resolution • Collaboration & Quality • Work closely with our researchers to understand their needs and translate research prototypes into production-ready systems • Collaborate with frontend engineers to design APIs and data structures that enable excellent user experiences • Participate in code reviews to maintain high standards for code quality, security, and performance • Document architectural decisions, API specifications, and system behaviors to facilitate knowledge sharing • Contribute to technical discussions about technology choices, trade-offs, and implementation approaches

Benefits

• Salary: 100k - 180k GBP (~135k - 245k USD) • Flexible work hours and schedule • Unlimited vacation • Unlimited sick leave • Lunch, dinner, and snacks are provided for all employees on workdays • Paid work trips, including staff retreats, business trips, and relevant conferences • A yearly $1,000 (USD) professional development budget. • Start Date: Target of 2-3 months after the first interview • Time Allocation: Full-time • Location: The office is in London, and the building is next to the London Initiative for Safe AI (LISA) offices. This is an in-person role. In rare situations, we may consider partially remote arrangements on a case-by-case basis • Work Visas: We can sponsor UK visas