Pro members applied to this job 36 hours before you saw itGet Pro ›

vynca - Site Reliability Engineer

Remote - ET (Eastern)2d ago

Remote NA Cloud Computing Software Site Reliability Engineer Security Architect Go Python AWS Terraform Kubernetes Helm Linux Prometheus PostgreSQL Grafana Datadog MySQL AWS Secrets Manager Redshift Snowflake

Upload My Resume

Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Experience: Three to five (3–5) years of experience in Site Reliability Engineering, DevOps Engineering, Platform Engineering, Cloud Infrastructure Engineering, or similar infrastructure-focused roles, preferably within healthcare, SaaS, or high-growth technology environments. • Education: Bachelor's degree in Computer Science, Information Systems, Software Engineering, or a related technical field; equivalent professional experience will also be considered. • Strong hands-on experience operating production workloads within AWS environments. • Proven experience managing infrastructure as code using Terraform, including module development, state management, and deployment automation. • Experience operating and supporting production Kubernetes environments. • Hands-on experience deploying and managing applications using Helm. • Experience working with distributed systems, event-driven architectures, or event-sourcing platforms, including concepts such as partitioning, event ordering, replay, and fault tolerance. • Experience establishing and managing observability practices including monitoring, logging, tracing, alerting, and incident response. • Strong understanding of Linux systems administration, networking, cloud architecture, and distributed systems fundamentals. • Experience designing, implementing, and maintaining CI/CD pipelines and deployment automation. • Strong problem-solving skills with the ability to troubleshoot complex infrastructure and application issues. • Excellent written and verbal communication skills with the ability to collaborate effectively across technical and non-technical teams. • High level of ownership, accountability, and initiative with a proactive approach to reliability and operational excellence. • Ability and willingness to participate in an on-call rotation supporting production systems. • Strong programming or scripting experience with Python, Go, or similar languages. • Experience with observability platforms such as Prometheus, Grafana, Datadog, CloudWatch, SigNoz, or OpenTelemetry. • Experience with GitOps tools such as ArgoCD or Flux. • Experience managing databases such as PostgreSQL, MySQL, Redshift, or ClickHouse. • Experience implementing secrets management solutions such as AWS Secrets Manager or HashiCorp Vault. • Experience supporting healthcare technology platforms or other highly regulated environments. • Familiarity with data infrastructure technologies including Snowflake, Redshift, and ETL/ELT pipelines. • Experience with database performance tuning and optimization. • At this time we are only considering applicants in the following states: Arizona, California, Colorado, Florida, Georgia, Illinois, Nevada, North Carolina, Oregon, Texas, and Washington.

Responsibilities

• Design, provision, and manage AWS infrastructure using Terraform as the source of truth. • Operate, maintain, and scale production workloads running on Kubernetes. • Package, deploy, and manage applications using Helm and infrastructure automation tools. • Build, operate, and improve distributed and event-driven systems, including event sourcing, partitioning, event ordering, replay, and failure recovery mechanisms. • Define, monitor, and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to balance reliability and engineering velocity. • Develop automation for deployment, scaling, monitoring, incident response, and operational workflows to reduce manual effort and improve system resilience. • Own platform observability by implementing and maintaining metrics, logging, tracing, monitoring, and alerting solutions. • Lead incident response efforts, facilitate blameless postmortems, and drive long-term corrective actions that improve system reliability. • Partner with Product and Engineering teams on capacity planning, performance optimization, and resilient system design. • Implement and maintain security best practices to support HIPAA, SOC 2, and other compliance requirements. • Participate in an on-call rotation and provide operational support for production systems. • Vaccination Requirement: Employees in patient, client, or customer-facing roles must be vaccinated against influenza. Requests for religious or medical accommodations will be considered but may not always be approved. • Employment Eligibility: Compliance with federal law requires identity and work eligibility verification using E-Verify upon hire.

Get Started Free

No credit card. Takes 10 seconds.

Requirements

Responsibilities