Nash - Staff Infrastructure and Performance Engineer

San Francisco, California, USA5mo ago

In Office Staff NA Cloud Computing Logistics Staff Engineer Support Engineer AWS Team Leadership Height Data Analysis

Requirements

• 6+ years of experience building and operating high-scale, production infrastructure for business-critical systems. • Deep expertise in AWS, including networking, compute, storage, and managed services. • Hands-on experience running production workloads on ECS/Fargate at scale. • Strong background in Postgres, including performance tuning, replication, high availability, and operational excellence. • Proven experience designing and operating multi-region architectures with strict uptime and reliability requirements. • Strong understanding of CI/CD for enterprise deployments, including rollout strategies, environment isolation, and rollback safety. • Experience building low-latency systems where milliseconds matter. • Excellent debugging and systems-level problem-solving skills. • Ability to operate autonomously and lead technical initiatives in a fast-paced startup environment.

Responsibilities

• Own infrastructure performance and reliability across Nash’s production systems, with a focus on low latency, high throughput, and predictable behavior under load. • Design, build, and optimize AWS-based infrastructure, leveraging managed services with a strong emphasis on ECS/Fargate. • Lead Postgres performance engineering, including query optimization, indexing strategies, connection management, replication, cluster design, and failover. • Architect and operate multi-region, highly availability systems with strong resiliency, disaster recovery, and failover guarantees. • Design and evolve enterprise-grade CI/CD pipelines that support safe, repeatable, and fast deployments across environments and regions. • Drive observability standards (metrics, logs, tracing, SLOs) and use data to proactively identify and eliminate performance bottlenecks. • Partner with application engineers to influence system design decisions that impact scalability, latency, and reliability. • Lead incident response and postmortems, focusing on root cause analysis, systemic fixes, and long-term resilience. • Set infrastructure and performance best practices and mentor engineers across the organization.