OXIO Corporation - Site Reliability Engineer

Remote - USA1mo ago

Remote NA Cloud Computing Telecommunications Site Reliability Engineer Go Ruby Bash Perl Python

Requirements

• Strong understanding of deployment strategies (canary releases, blue-green deployments, etc.). • Familiarity with high availability and understanding failover mechanisms. • Familiarity with IAM (Identity and Access Management) and zero trust principles. • Experience working with distributed systems (e.g., Kafka, Cassandra, Elasticsearch). • Building custom monitoring tools or writing complex automation scripts. • Functional knowledge of database management (SQL and NoSQL). • Familiarity with distributed tracing (Jaeger, OpenTelemetry) and advanced log aggregation strategies (ELK stack, Splunk). • Familiarity with performance profiling tools and optimizing application performance under heavy load. • Familiarity in load testing and identifying bottlenecks. • Familiarity with Configuration Managment using SaltStack for maintaining server configurations.

Responsibilities

• Design and implement platform on the cloud to support OXIO backend services • Automate technical operations: deployments, scaling, recovery, etc. • Monitor and maintain mission-critical production infrastructure to ensure maximum uptime • Participate in an on-call rotation and culture of continuous improvement through blameless postmortems • Enable the Engineering/Telecom/Data Engineering teams by providing them the tools to operate the service they build • Understanding of Linux/Unix systems (most systems are Linux-based). • Familiarity with Linux/Unix system internals like process management, filesystems, memory management, and networking. • Proficiency in at least one programming language (Python, Go, or Ruby) and strong skills in scripting (Bash, Perl). • Experience with infrastructure provisioning tools such as Terraform, CloudFormation, or Ansible. • Familiarity with containerization (Docker) and orchestration tools (Kubernetes). • Familiarity with monitoring tools like Prometheus, Grafana, or Datadog. • Knowledge of setting up alerts, analyzing logs, and creating dashboards for observability. • Familiarity with incident management practices (e.g., runbooks, postmortems). • Experience in being part of an on-call rotation and handling incidents. • Experience in setting up and maintaining Continuous Integration/Continuous Delivery pipelines (Jenkins, GitLab CI, CircleCI, etc.). • Hands-on experience with cloud providers (AWS, Google Cloud, Azure). • Knowledge of virtualization technologies (VMware, KVM) and cloud-native architecture. • Understanding of TCP/IP, DNS, HTTP/HTTPS, load balancing, and firewalls.