Crusoe - Staff Network Engineer, Operations
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• What You'll Be Working On: • Production Reliability: Help own uptime across Crusoe's global edge, backbone, data center, and GPU cluster network, directly supporting AI workloads at scale. • Incident Response: Lead and contribute to end-to-end response for high-severity network events, including mitigation, stakeholder communication, and postmortem documentation. • Root Cause Analysis: Drive RCAs for production incidents, identify systemic issues, and author remediation plans tracked through to closure. • Observability Improvements: Contribute to and improve Crusoe's network monitoring stack using streaming telemetry, SNMP, NetFlow, and tools such as Kentik, Grafana, Prometheus, and ThousandEyes. • Operational Standards: Author and maintain runbooks, escalation playbooks, and SOPs used across the operations team. • Operational Automation: Write Python-based tooling to reduce toil, automate common remediation workflows, and accelerate mean time to resolution. • SLI/SLO Contribution: Partner with Architecture and SRE teams to define and track network reliability metrics and service level objectives backed by real-time dashboards. • Mentorship: Provide technical guidance to Senior engineers and contribute to a culture of operational excellence and continuous learning. • 8+ years of production network engineering experience with a focus on operations, incident response, and reliability in large-scale or internet-scale environments. • Hands-on experience with observability and monitoring tools including streaming telemetry, SNMP, NetFlow/sFlow, Grafana, Prometheus, and ThousandEyes. • Experience operating RDMA/RoCE lossless fabrics for GPU or HPC workloads, including familiarity with PFC, ECN, and DCQCN tuning. • Expert hands-on knowledge of BGP, EVPN-VXLAN, IS-IS, OSPF, MPLS, QoS, and TCP/IP in production data center environments. • Proficiency with Arista (EOS) and Juniper (Junos) platforms in leaf-spine CLOS architectures across multi-vendor environments. • Python proficiency for writing auto-remediation scripts, diagnostic tooling, and operational automation. • Comfort operating large device fleets across multi-region environments with on-call responsibility, including experience as an escalation point during critical events. • Bachelor's degree in Computer Science, Electrical Engineering, or a related field, or equivalent practical experience. • Experience with NVIDIA/Mellanox networking platforms in GPU cluster environments. • Familiarity with Kentik or Arbor for traffic analysis and DDoS visibility. • Experience defining or contributing to SLIs and SLOs in partnership with SRE or product teams. • Exposure to operating 10K+ device fleets across hyperscale or cloud environments. • Background contributing to post-incident learning programs or operational excellence initiatives org-wide.
Benefits
• Competitive compensation and equity packages • Restricted Stock Units • Paid time off, paid holidays & leave of absence programs • Comprehensive health, dental & vision insurance • Employer contributions to HSA account • Paid parental leave • Paid life insurance, short-term and long-term disability • Professional development & tuition reimbursement • Mental health & wellness support • Cell phone stipend • 401(k) Retirement plan with company match up to 4% of salary • Volunteer time off • Global travel insurance & emergency assistance • Daily meals allowance • Additional perks & programs specific to location • Compensation will be paid in the range of up to $195,000 -$235,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's knowledge, education, and abilities, as well as internal equity and alignment with market data.
No credit card. Takes 10 seconds.