simspace-corporation - Staff Site Reliability Engineer

Remote - USA$165k - $230k+ Equity1mo ago

Remote Staff NA Site Reliability Engineer Principal Go Python Kubernetes Move Kustomize Grafana Coaching Stakeholder Management

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• In this position, you'll provide overarching technical leadership across our SRE practice, bridging traditional site reliability, DevOps, and DevSecOps. You'll architect the systems and strategies that allow SimSpace to deliver software seamlessly across our own data centers, to customers who bring their own hardware, and as pre-packaged appliances with bundled hardware and software. As our on-premises product matures and scales, you will design the long-term automation frameworks that make these varied deployments robust, secure, and repeatable. • What will you be doing as a Staff SRE at SimSpace? • Technical Strategy & Architecture: Design and architect the overarching infrastructure strategy that enables consistent, repeatable, and secure deployments across SimSpace-hosted data centers, customer-provided hardware, and highly restricted air-gapped environments. • Platform Evolution & Configuration Management: Lead the evolution of our CI/CD and Kubernetes platforms. Drive advanced application packaging, templating, and configuration management strategies using Jsonnet and Grafana Tanka (alongside Kustomize). Move beyond maintaining pipelines to architecting multi-cluster, multi-environment deployment frameworks that drastically improve developer velocity. • Reliability Leadership: Define, measure, and govern Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets across the engineering organization. Partner with product and engineering leadership to balance feature delivery with platform stability. • Advanced Observability: Architect our enterprise observability strategy using the Grafana stack. Design frameworks for proactive monitoring, complex anomaly detection, and distributed tracing that give teams unparalleled visibility into system health, pod scaling, and latency bottlenecks. • Security & Compliance Architecture: Drive the infrastructure security posture at an architectural level. Embed advanced container security, zero-trust network segmentation, and automated compliance policies directly into our deployment pipelines and runtime environments. • Cross-Functional Enablement: Serve as a strategic partner and consultant to development teams. Advocate for an "SRE culture" by designing self-service tooling, establishing "paved roads" for developers, and reducing operational toil across the entire engineering org. • Incident Command: Act as an Incident Commander during complex, high-severity outages. Drive blameless post-mortems and engineer long-term, systemic, and architectural fixes to ensure classes of failures never repeat. • Mentorship & Multiplier: Act as a technical mentor to senior and mid-level engineers. Raise the baseline of engineering excellence across the company by coaching, documenting best practices, and leading by example. • Experience: 8+ years of experience in Site Reliability, Platform, or DevOps engineering, with a proven track record of operating at a Staff, Principal, or Lead level to drive organization-wide infrastructure initiatives. • Expert Software Engineering: You possess deep software engineering skills (beyond scripting) and can architect complex, production-quality systems. You design clean interfaces, build maintainable tooling, and can dictate the technical direction of our internal toolchain. Language agnostic, but highly proficient in at least one modern language (e.g., Go, Python). • Advanced Kubernetes & Configuration Mastery: Deep, architectural understanding of Kubernetes in multi-tenant and multi-cluster production environments. You possess expert-level knowledge of Jsonnet and Grafana Tanka for managing complex, scalable Kubernetes configurations and application packaging. • GitOps & IaC Expertise: Extensive experience architecting sophisticated CI/CD pipelines and GitOps workflows using GitHub Actions, ArgoCD, and infrastructure-as-code principles at an enterprise scale. • Complex Deployments: Systems-level thinking with the ability to design architectures that span self-hosted, on-premises, VMware-based, and air-gapped deployment models. • Observability Expert: Deep expertise with observability platforms (Grafana stack preferred) and a proven ability to design alerting and monitoring strategies for complex distributed systems. • Security Mindset: Strong background in infrastructure security architecture, including container hardening, network security, vulnerability management, and delivering software to heavily regulated or customer-managed environments. • Influential Communicator: Exceptional communication and stakeholder management skills. You have a service-oriented mindset, but you also have the ability to influence cross-functional leadership, negotiate reliability tradeoffs, and align engineering teams behind a unified technical vision. • We’re proud to offer a competitive and comprehensive package designed to support your well-being, growth, and success:

Get Started Free

No credit card. Takes 10 seconds.

Requirements