starcompliance - Principal Site Reliability Engineering Lead US
Upload My Resume
Drop here or click to browse · Tap to choose · PDF, DOCX, DOC, RTF, TXT
Requirements
• Strong hands-on experience operating distributed, cloud-hosted SaaS platforms at scale. • Professional experience with at least one modern programming language. • Experience working with or supporting .NET-based systems is highly beneficial. • Strong experience with Microsoft Azure, including core platform services, networking, identity, and security. • Deep expertise in observability tooling and practices. Experience improving production promotion, deployment, and release processes. • Experience with Infrastructure as Code and automation-driven operations. • Strong understanding of failure modes, resilience patterns, and recovery strategies. Ability to influence senior stakeholders through technical credibility and pragmatism. • Based In East Coast Time Zone • Typically, 8+ years of experience in SRE, platform, operational, or software engineering roles with a large amount of these spent in multi-tenant environments. • Experience supporting production systems with formal on-call or rota responsibility. • Experience in leading and mentoring a team of SRE engineers, with an emphasis on professional and personal growth. • Experience enabling regular, multi-service production releases at scale. • Right to work in the country of employment. • StarCompliance Background Checks • All positions require pre-employment screening due to employees potentially having access to highly sensitive and confidential information involving finance and compliance; candidates must be trustworthy and have a heightened sensitivity to protecting confidential financial, professional information. To be eligible for employment with StarCompliance, candidates must undergo a rigorous background investigation with checks including, but not limited to, criminal record history, consumer credit, employment history, qualifications, and education checks.
Responsibilities
• Lead the SRE team in ensuring high availability and reliability of our IT infrastructure across all regions where we operate globally. This includes managing a portfolio of services that are critical to business operations. • Collaborate with other engineering teams, product management, and executive leadership to prioritize work based on the company's strategic goals and customer needs. • Develop and maintain comprehensive SRE processes for monitoring service performance metrics such as uptime percentages, response times, error rates, etc., across all regions where we operate globally. This includes setting up alerting mechanisms to quickly identify issues that impact user experience or business operations. • Work with the infrastructure team and other engineering teams on capacity planning for our global IT environment based on historical usage patterns as well as future growth projections. Include in this plan strategies such as scaling, load balancing, disaster recovery, etc., to ensure high availability of critical services during peak demand periods or unexpected outages. • Participate and contribute actively in the SRE community by sharing knowledge through blogs, presentations at conferences, mentoring junior engineers, participating on mailing lists/forums related to Site Reliability Engineering (SRE), etc. This includes staying up-to-date with industry trends and best practices for improving service reliability and availability in a distributed environment such as ours where we operate globally across multiple regions. • Provide technical leadership by coaching junior engineers on SRE principles, tools, techniques, etc., to help them grow into their roles within the team or elsewhere at StarCompliance if they choose not to stay with us long term (i.e., as part of our succession planning efforts). This includes identifying high-potential candidates for leadership positions based on demonstrated skills and experience in SRE practices such as incident management, automation/orchestration using tools like Terraform or Kubernetes etc.. • Collaborate with other engineering teams to identify opportunities where we can leverage our expertise in Site Reliability Engineering (SRE) best practices across different parts of the organization beyond just infrastructure operations such as application development, testing environments/environments for new products being developed internally or externally by partners etc.. • Participate actively on various committees within StarCompliance related to SRE topics including but not limited to: Site Reliability Engineering (SRE) Council; Infrastructure Operations Committee which includes representatives from all regions where we operate globally as well other stakeholders such as business units/departments affected by infrastructure operations decisions made at StarCompliance headquarters in New York City USA. • Provide technical leadership on various committees within StarCompliance related to SRE topics including but not limited to: Site Reliability
No credit card. Takes 10 seconds.