Epic Kids Inc. - Senior Software Engineer, Infrastructure

Remote - USA3w ago

Remote Senior NA Cloud Computing Senior Software Engineer Bash Python GCP Docker Kubernetes

Requirements

• Bachelor's degree or higher in Computer Science, Software Engineering, or a related field • 5+ years of experience in infrastructure, platform, DevOps, or a related engineering role • Hands-on experience with GCP (GCE, GCS, VPC, IAM, Cloud Monitoring, and related services) • Experience with Docker and Kubernetes (GKE)—containerizing workloads, deploying to GKE, Helm, and cluster fundamentals • Docker • Kubernetes (GKE) • Experience with CI/CD pipelines (GitHub Actions, ArgoCD, Jenkins, or similar) • CI/CD pipelines • Experience with an observability platform such as New Relic (metrics, logging, alerting, dashboards) • New Relic • Proficiency in Terraform for managing infrastructure as code • Terraform • Scripting/programming skills in Python, Bash, or similar • Comfort participating in a frequent production on-call rotation • Track record of measurably improving reliability of production systems—e.g., defining SLOs, reducing incident frequency or MTTR, eliminating recurring failure modes • Strong problem-solving skills, sense of ownership, and ability to work effectively in evolving systems • Fluency in English for daily collaboration and technical documentation • Proficiency in Mandarin Chinese to collaborate effectively with global engineering and business partners • Experience operating workflow orchestration platforms (e.g., Dagster, Airflow) as a service for data or platform teams • Dagster • Familiarity with the operational footprint of data platforms (warehouse infrastructure, job schedulers, batch workloads) • Experience in distributed or global engineering teams • Working knowledge of compliance frameworks (e.g., SOC 2, FERPA, COPPA) and GRC tools.

Responsibilities

• Drive the stability and reliability of Epic's GCP infrastructure—setting and tracking SLOs/SLIs, reducing toil, and engineering out recurring sources of instability • Build and operate Epic's GCP infrastructure for high availability, scalability, and cost efficiency • Manage and harden our Docker and GKE container platform, including workload scheduling, autoscaling, networking, and graceful failure handling • Maintain and improve CI/CD pipelines that enable fast, safe, low-risk delivery across engineering teams • Own and evolve the observability stack—metrics, logs, traces, dashboards, and alerts—so that signals are actionable, noise is low, and on-call has the context to resolve issues quickly • Write and maintain Terraform to codify infrastructure across the organization, with a focus on consistency, change safety, and reproducibility • Contribute to capacity planning, cost optimization, and architectural reviews, with reliability as a first-class consideration • Champion platform security best practices, including secrets management, IAM policies, and network segmentation • Support compliance-aware infrastructure practices—vulnerability management, access reviews, audit-evidence flows, and incident-response readiness—as we mature our SOC 2 and student-data compliance programs • Partner with data engineering to operate the orchestration platform and supporting infrastructure—deployment, scaling, reliability, and observability • Collaborate with backend and data engineers to troubleshoot service and platform issues • Lead by example in a frequent on-call rotation; drive incident response, blameless post-mortems, and the follow-through that turns one-time outages into systemic, lasting reliability improvements • Provide guidance to developers on infrastructure concerns and best practices