Parallel Domain - Senior Site Reliability Engineer

Remote - Pacific Northwest Area$145k - $185k+ Equity1mo ago

Remote Senior NA Cloud Computing Artificial Intelligence Site Reliability Engineer Terraform AWS Kubernetes Helm Bash

Requirements

• Experience. 5+ years in SRE, DevOps, or infrastructure engineering roles, with a track record of operating production systems across multiple regions. • Terraform. Modules, state management, and multi-environment patterns. • AWS depth. Solid experience across VPC, IAM, EKS, S3, and CloudWatch. • Kubernetes expertise. Cluster operations, autoscaling, RBAC, and Helm. • CI/CD and GitOps. Experience with GitHub Actions, ArgoCD, or similar workflows. • Networking fundamentals. CIDR, DNS, load balancing, VPN, and cross-region connectivity. • Observability. Experience with tooling such as Prometheus and Grafana. • Scripting. Comfort with Python and Bash for tooling and automation. • Cross-platform familiarity. Working knowledge of both Linux and Windows environments. Operational experience supporting Windows-based workloads is a meaningful advantage. • Pragmatism and ownership. Comfortable in a fast-moving startup with evolving priorities. You take ownership of systems while collaborating closely with other teams, and you're pragmatic about tradeoffs between speed, reliability, and complexity. • Windows on Kubernetes. Experience with Windows node pools, Windows AMIs, and GPU-adjacent components on K8s. • GPU scheduling. Familiarity with GPU scheduling on Kubernetes, including NVIDIA device plugin configuration. • Domain workloads. Experience supporting simulation, ML, or rendering workloads in cloud infrastructure. • AWS extras. Exposure to AWS Storage Gateway, Active Directory integrations, or AWS Transfer Family. • Service mesh. Familiarity with service proxy or service mesh patterns. • Container OS. Experience with container-optimized OS images (e.g., Bottlerocket, Packer). • Cost optimization. Cloud cost optimization at scale. • Terraform · AWS · Kubernetes · Helm · Kustomize · ArgoCD · GitHub Actions · Prometheus · Grafana · Docker · Python · Bash • What Makes a Great Candidate • You think in failure modes and proactively surface issues. You hold a principled view on security and push back constructively when designs introduce unnecessary risk. You communicate clearly across engineering, product, and customer-facing teams, flagging issues with urgency proportional to customer impact. You take end-to-end ownership of complex efforts and know when to push for the clean solution versus the pragmatic one. • Base salary range of CAD $145,000–$185,000, depending on skills, qualifications, and experience, plus equity, full health/dental/vision coverage, learning stipend, and generous vacation. This role is remote-friendly across Canada and the US Pacific Northwest. • We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Responsibilities

• Infrastructure ownership and cloud operations. Design, build, and maintain multi-region AWS infrastructure using Terraform. Operate and scale EKS clusters across production regions: autoscaling, node lifecycle, workload health. Manage networking across environments: VPC design, DNS, load balancing, and cross-region connectivity. Support infrastructure changes, migrations, and expansions into new regions. Contribute to and improve GitOps-based deployment workflows using GitHub Actions, Helm, and Kustomize. • Reliability engineering and incident response. Help build and run incident management processes: severity definitions, escalation paths, on-call practices. Lead incident response, debugging, and root-cause analysis. Write postmortems and drive systemic reliability improvements from what they surface. Improve observability across metrics, logging, tracing, and dashboards. Support GPU and batch workloads running on Kubernetes. • Security and access management. Provide security-conscious feedback on platform architecture decisions. Own cloud IAM governance: roles, policies, and access boundaries across accounts and services. Lead compliance-adjacent work including audit-readiness, partner certification requirements, and supporting responses to customer security questionnaires.