Crusoe - Staff Network Engineer, Operations

San Francisco, California, USA$195k - $235k+ Equity1mo ago

In Office Staff NA Diagnostics Staff Engineer Network Engineer Python Prometheus Documentation Grafana

Requirements

• What You'll Be Working On: • Production Reliability: Help own uptime across Crusoe's global edge, backbone, data center, and GPU cluster network, directly supporting AI workloads at scale. • Incident Response: Lead and contribute to end-to-end response for high-severity network events, including mitigation, stakeholder communication, and postmortem documentation. • Root Cause Analysis: Drive RCAs for production incidents, identify systemic issues, and author remediation plans tracked through to closure. • Observability Improvements: Contribute to and improve Crusoe's network monitoring stack using streaming telemetry, SNMP, NetFlow, and tools such as Kentik, Grafana, Prometheus, and ThousandEyes. • Operational Standards: Author and maintain runbooks, escalation playbooks, and SOPs used across the operations team. • Operational Automation: Write Python-based tooling to reduce toil, automate common remediation workflows, and accelerate mean time to resolution. • SLI/SLO Contribution: Partner with Architecture and SRE teams to define and track network reliability metrics and service level objectives backed by real-time dashboards. • Mentorship: Provide technical guidance to Senior engineers and contribute to a culture of operational excellence and continuous learning. • 8+ years of production network engineering experience with a focus on operations, incident response, and reliability in large-scale or internet-scale environments. • Hands-on experience with observability and monitoring tools including streaming telemetry, SNMP, NetFlow/sFlow, Grafana, Prometheus, and ThousandEyes. • Experience operating RDMA/RoCE lossless fabrics for GPU or HPC workloads, including familiarity with PFC, ECN, and DCQCN tuning. • Expert hands-on knowledge of BGP, EVPN-VXLAN, IS-IS, OSPF, MPLS, QoS, and TCP/IP in production data center environments. • Proficiency with Arista (EOS) and Juniper (Junos) platforms in leaf-spine CLOS architectures across multi-vendor environments. • Python proficiency for writing auto-remediation scripts, diagnostic tooling, and operational automation. • Comfort operating large device fleets across multi-region environments with on-call responsibility, including experience as an escalation point during critical events. • Bachelor's degree in Computer Science, Electrical Engineering, or a related field, or equivalent practical experience. • Experience with NVIDIA/Mellanox networking platforms in GPU cluster environments. • Familiarity with Kentik or Arbor for traffic analysis and DDoS visibility. • Experience defining or contributing to SLIs and SLOs in partnership with SRE or product teams. • Exposure to operating 10K+ device fleets across hyperscale or cloud environments. • Background contributing to post-incident learning programs or operational excellence initiatives org-wide.

Benefits

• Competitive compensation and equity packages • Restricted Stock Units • Paid time off, paid holidays & leave of absence programs • Comprehensive health, dental & vision insurance • Employer contributions to HSA account • Paid parental leave • Paid life insurance, short-term and long-term disability • Professional development & tuition reimbursement • Mental health & wellness support • Cell phone stipend • 401(k) Retirement plan with company match up to 4% of salary • Volunteer time off • Global travel insurance & emergency assistance • Daily meals allowance • Additional perks & programs specific to location • Compensation will be paid in the range of up to $195,000 -$235,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's knowledge, education, and abilities, as well as internal equity and alignment with market data.