SUBMER - Sr Software Engineer (Kubernetes and Distributed Systems) Radian Arc (EMEA)
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• 5+ years of experience building distributed systems or infrastructure platforms. • Strong programming experience in Go. • Experience developing Kubernetes operators and controllers. • Kubernetes Platform Engineering • Strong understanding of Kubernetes internals and control plane architecture. • Experience building infrastructure automation around Kubernetes. • Familiarity with multi-tenant Kubernetes environments. • Distributed Systems • Experience designing and operating distributed systems at scale. • Understanding of distributed state management and service coordination. • Experience building reliable, highly available infrastructure services. • Infrastructure & Systems Knowledge • Strong Linux systems knowledge. • Experience troubleshooting complex production systems. • Understanding of networking and storage infrastructure used by distributed systems. • Operational Excellence • Experience operating high-availability production systems. • Familiarity with observability tooling such as Prometheus and Grafana. • Experience participating in on-call rotations and incident response. • Personal Attributes • Strong analytical and problem-solving skills. • Excellent communication and collaboration abilities. • Passion for building reliable infrastructure systems at scale.
Responsibilities
• Design and develop the platform control plane responsible for managing GPU cloud infrastructure and distributed AI workloads. • This role focuses on building the core orchestration services that provision, manage, and coordinate compute, networking, and storage resources across global GPU clusters. • The Senior Software Engineer will design and implement distributed systems that power the platform’s control plane, enabling reliable orchestration of Kubernetes clusters, GPU workloads, and multi-tenant infrastructure. You will build APIs, services, and Kubernetes-native operators that automate infrastructure lifecycle management and provide the primitives required to run large-scale • AI workloads across multiple regions. • This role works closely with platform, networking, storage, and infrastructure teams to ensure the control plane integrates seamlessly with the underlying GPU infrastructure, networking fabrics, and disaggregated storage systems. • The emphasis is on independently delivering major control-plane components, solving difficult distributed-systems and orchestration problems, and improving platform reliability and operability • within the broader platform direction. • Platform Control Plane Development • Design and develop the platform control plane services responsible for managing GPU cloud infrastructure. • Implement APIs and services that orchestrate compute, networking, and storage resources. • Build distributed services responsible for cluster lifecycle management and infrastructure orchestration. • Implement reliable state management systems for distributed infrastructure components. • Kubernetes Platform Integration • Develop Kubernetes operators and controllers that automate platform infrastructure. • Implement cluster lifecycle APIs responsible for: • ○ Cluster provisioning, • ○ Cluster upgrades, • ○ Node lifecycle management. • ○ Infrastructure automation. • Integrate platform services with Kubernetes control planes running on bare-metal infrastructure. • AI Infrastructure Orchestration • Develop orchestration frameworks that manage GPU workloads across distributed clusters. • Implement platform services that optimize resource scheduling and utilization for AI workloads. • Integrate the platform control plane with components such as: • ○ NVIDIA GPU Operator, • ○ Argo Workflows, • ○ SLURM integration, • ○ KubeVirt virtualization. • Distributed Systems Engineering • Build distributed systems that coordinate workloads across multi-region GPU clusters. • Implement services capable of handling high-throughput infrastructure orchestration workloads. • Design scalable mechanisms for distributed state management and coordination. • Contribute practical design input for platform components. • Reliability & Operations • Implement observability, monitoring, and alerting for platform services. • Participate in incident response and on-call rotations for platform systems. • Perform root cause analysis and implement systemic improvements to platform reliability. • Engineering Excellence • Drive technical design decisions for platform components. • Maintain high standards for testing, CI/CD, and operational safety. • Participate in architecture discussions, code reviews, and system design. • Contribute to repeatable patterns, implementation quality, and operational maturity within the platform software domain. • Technical Stack • Technical Stack • Platform Development • Kubernetes controllers / operators. • Distributed systems architecture. • REST / gRPC APIs. • Platform Infrastructure • GitOps workflows. • AI Platform Components • NVIDIA GPU Operator. • Argo Workflows. • SLURM integration • Storage Integration • Weka distributed storage • VAST disaggregated storage
Benefits
• Attractive compensation package reflecting your expertise and experience. • A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach. • You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution. • Our job titles may span more than one job level. The actual base pay is dependent on a number of factors, such as transferable skills, work experience, business needs and market demands. • Our inclusive responsibility
No credit card. Takes 10 seconds.