DevOps / Platform Engineer (Fintech + AI Infrastructure)

OnHiresRemote - European Union, Ukraine1w ago

Upload My Resume

Drop here or click to browse · PDF, DOCX, TXT

Apply in One Click

Requirements

Experience: 5+ years in DevOps, SRE, or Platform Engineering (Fintech experience is mandatory).
Core Systems: Deep expertise in Linux, networking (TCP/IP, DNS, TLS, routing), and complex troubleshooting.
Kubernetes: Production experience with K8s, Helm, Ingress, autoscaling, network policies, and resource management.
CI/CD: Proficiency in GitHub Actions, GitLab CI, or Jenkins.
Observability: Hands-on experience with Prometheus + Grafana, logging (Loki/ELK), and tracing (OpenTelemetry/Jaeger).
AI Infrastructure: Experience with GPU clusters and ML stacks (NVIDIA drivers, CUDA, MIG, GPU monitoring).
Data Stores: Production-level operation of Postgres, Redis, Kafka, or RabbitMQ.
Security First: Practical knowledge of Vault, KMS, RBAC, OPA/Gatekeeper/Kyverno, Trivy, and SBOM.
Technology Stack
Cloud: AWS, Hetzner, DigitalOcean
Orchestration: Docker, Kubernetes
IaC: Terraform, Ansible
CI/CD: GitHub Actions / GitLab CI
Observability: Prometheus, Grafana, Loki / ELK, OpenTelemetry
Security: HashiCorp Vault, KMS, RBAC, Policy-as-Code
AI Serving: Triton, vLLM, custom inference services

1. AI / MLOps (Production for Models)
GPU Infrastructure: Deploy and maintain high-performance GPU clusters.
AI Lifecycle: Manage the full lifecycle of AI services: inference deployment (Triton, vLLM, custom services), autoscaling, and seamless rollout/rollback strategies.
Data Management: Manage model storage, artifact versioning, caching, and high-speed data access via S3-compatible storage.
Observability: Monitor performance metrics including latency, throughput, error budgets, resource limits, and cost/performance ratios.
2. PSP / Fintech Reliability
High Availability: Ensure fault tolerance for payment services (SLA/SLO management, redundancy, Disaster Recovery planning, and regular recovery testing).
Fintech-Grade Security: Implement secrets management, HSM/managed KMS integration, infrastructure hardening, and audit logging.
Secure CI/CD: Build secure pipelines featuring artifact signing, vulnerability scanning, policy gates, and isolated environments.
3. Crypto Infrastructure
Node Operations: Deploy and maintain crypto nodes (Full, Archive, RPC) across various networks.
Automation: Automate node updates, synchronization monitoring, and health checks.
Storage & Performance: Manage disk I/O (IOPS/RAID), protect RPC endpoints, and manage access controls.
Metrics: Monitor for sync lags, chain forks, and consensus issues.