openDoctor - DevOps Engineer

Remote - ET (Eastern)$16k - $21k1w ago

Requirements

• 5+ years of experience in DevOps, CloudOps, Site Reliability Engineering, or a related infrastructure role at a mid to senior level • 5+ years • mid to senior level • Strong hands-on experience with Terraform for provisioning and managing cloud infrastructure • Terraform • Production experience with AWS, including core services such as VPC, IAM, EC2/EKS/ECS, S3, RDS, Lambda, EFS, CloudWatch, Route 53, KMS, WAF, and related networking/security services • Demonstrated ability to design complex, production-grade cloud architectures that balance scalability, security, availability, and cost • design complex, production-grade cloud architectures • Practical experience with disaster recovery practices, including DR strategy, playbook development, backup strategies, multi-region or multi-zone design, failover planning, and DR testing • disaster recovery practices • Strong working knowledge of CI/CD concepts and tooling (e.g., GitHub Actions, GitLab CI, Jenkins, Argo CD, or similar), including pipeline design, release automation, and repository governance • CI/CD • Hands-on experience with Docker and containerized deployments; familiarity with ECS and/or Kubernetes/EKS in production environments • Docker • Proficiency in scripting and development using Python, Go, or Node.js for automation, Lambda functions, tooling, and pipeline integrations • scripting and development • Python, Go, or Node.js • Experience with IaC beyond a single toolset — designing reusable modules, managing environment replication, and supporting DR-ready infrastructure • Solid grasp of networking concepts (DNS, load balancing, firewalls, VPNs, hub/spoke models, private connectivity, TLS) • networking concepts • Familiarity with security best practices for cloud infrastructure, including least-privilege IAM, secrets management, WAF management, and compliance considerations (e.g., SOC 2) • security best practices • Ability to communicate clearly with engineering, security, and leadership stakeholders and to work independently on roadmap initiatives with minimal day-to-day direction • communicate clearly • work independently • Strong problem-solving skills, sound judgment, and a track record of delivering infrastructure work across multiple concurrent priorities • Cloud certifications in AWS (e.g., AWS Solutions Architect, AWS DevOps Engineer, or related associate/professional-level credentials) • Cloud certifications • Production experience with GCP, including core services such as VPC, IAM, GKE, Cloud Storage, Cloud SQL, Cloud Functions, Cloud Monitoring, and Cloud DNS • GCP certifications (e.g., Google Professional Cloud Architect, Google Professional DevOps Engineer, or related credentials) • GCP certifications • Experience designing, deploying, or operating AI/ML workloads in the cloud, including model hosting, inference pipelines, vector/search infrastructure, model serving infrastructure, GPU/compute provisioning, data pipelines, and integration with managed AI services on AWS and/or GCP (e.g., SageMaker, Bedrock, Vertex AI) • AI/ML workloads in the cloud • Experience with additional IaC or configuration tools (e.g., Ansible, Pulumi, CloudFormation, Deployment Manager) • Experience implementing Infrastructure as Code governance, including policy-as-code (OPA, Sentinel, or similar) and drift detection • Infrastructure as Code governance • Familiarity with Mirth Connect or healthcare integration platforms • Mirth Connect • Experience with observability stacks (Prometheus, Grafana, Datadog, OpenTelemetry, ELK, or similar) • Experience supporting regulated environments (SOC 2, HIPAA, PCI, or similar) • Prior involvement in chaos engineering, game days, or formal resilience testing programs • chaos engineering • Experience leading VM-to-container migration programs or multi-account AWS organization redesigns • What Success Looks Like • Infrastructure is provisioned consistently through Terraform with clear module boundaries and reviewable change workflows • CI/CD pipelines and release practices are standardized, secure, and adopted by development teams • Critical systems have documented DR plans with tested recovery procedures and defined RTO/RPO targets • Containerization and Lambda-based modernization efforts reduce operational overhead and improve deployment velocity • Cloud architectures are scalable, secure, and observable, with measurable improvements in reliability and cost efficiency • The candidate operates effectively as a self-directed contributor while keeping stakeholders informed and aligned • Work Environment • Remote-first role with home office as the primary working location • Required overlap of 9:00 AM – 12:00 PM ET to collaborate with US-based teams • 9:00 AM – 12:00 PM ET • Participation in an on-call rotation is expected for production support • Collaboration with distributed engineering and platform teams across Surgimate, Implatbase, and openDoctor • Expectation of independent execution on assigned roadmap work, with regular communication on progress, risks, and dependencies • Flexibility to work extended hours as needed to support key initiatives

Responsibilities

• Design and implement complex cloud architectures on AWS, including networking, compute, storage, identity, and observability components • Build and maintain Infrastructure as Code (IaC) using Terraform, with modular, reusable, and version-controlled configurations • Terraform • Design, implement, and maintain CI/CD pipelines and release practices that support safe, repeatable infrastructure and application deployments • CI/CD pipelines • Build and operate Docker-based workloads and support containerization initiatives, including migration from VM-based operating models • Docker • Write automation and tooling in Python, Go, and Node.js to support infrastructure, pipelines, integrations, and operational workflows • Python, Go, and Node.js • Leverage AI development tooling (e.g., Cursor, Claude, and related platforms) as part of day-to-day development • Define and implement disaster recovery (DR) and business continuity strategies, including backup, failover, RTO/RPO planning, playbook creation, and recovery testing • Execute and advance roadmap initiatives spanning account reorganization, hub/spoke networking, DR strategy, Git migrations, EKS runners, database migrations, security hardening, and cost optimization • Establish and improve operational practices for monitoring, alerting, incident response, metrics, and post-incident review • Automate provisioning, configuration management, and operational tasks to reduce manual toil and improve consistency • Provide ongoing DevOps support for development teams across multiple business units • Document architecture decisions, runbooks, and operational procedures for knowledge sharing and audit readiness • Participate in on-call rotation and lead or support production incident resolution as needed • Deploy, administer, manage, and optimize LAMP (Linux, Apache, MariaDB/MySQL, PHP) environments supporting business-critical and high-availability applications. • Troubleshoot complex issues across the full application stack, including Linux OS, Apache, PHP, databases, networking, storage, and application integrations. • Install, configure, secure, and troubleshoot Apache web servers, including virtual hosts, SSL/TLS certificates, reverse proxy configurations, URL rewrites, and performance tuning.