As a Senior Site Reliability Engineer, you will play a pivotal role in shaping the infrastructure and reliability practices at CentML. You will be responsible for leading complex projects, mentoring other SREs, and collaborating with cross-functional teams to ensure our systems meet the highest standards of reliability, performance, and security. This is a senior-level position, ideal for individuals with deep technical expertise and leadership experience in SRE.
What you’ll do:
Leadership & Strategy:
- Design, implementation, and operation of highly reliable, scalable, and secure ML infrastructure.
- Develop and drive SRE best practices across the organization, setting the standards for operational excellence.
Technical Excellence:
- Architect and build large-scale, distributed systems that support complex ML workloads, ensuring high availability and fault tolerance.
- Lead efforts in automation, configuration management, and infrastructure-as-code, minimizing manual operations and ensuring consistency.
- Optimize the performance and scalability of our systems, identifying and addressing bottlenecks before they impact users.
Incident Management & Response:
- Lead incident response efforts, including real-time troubleshooting, root cause analysis, and postmortem reviews.
- Develop and maintain comprehensive monitoring, alerting, and logging systems that provide deep visibility into system health and performance.
Continuous Improvement & Innovation:
- Drive continuous improvement in system reliability, performance, and scalability through the adoption of new technologies, tools, and methodologies.
- Stay current with industry trends and innovations in SRE and ML infrastructure, bringing new ideas and approaches to the team.
What you’ll need to be successful
- 5+ years of experience in Site Reliability Engineering, DevOps, or related roles, with significant experience in leading and managing large-scale infrastructure projects.
- Proven track record of building and operating highly reliable, scalable, and secure systems in a production environment.
- Deep expertise in cloud platforms (e.g., AWS, GCP, Azure), containerization (e.g., Docker, Kubernetes), and infrastructure-as-code (e.g., Terraform)
- Advanced proficiency in scripting and automation using languages such as Python, Bash, or similar.
- Strong understanding of distributed systems, networking, and storage solutions, with the ability to architect complex systems from the ground up.
- Demonstrated experience in leading technical teams and projects, with the ability to mentor and develop other engineers.
- Excellent problem-solving skills, with a proactive approach to identifying and resolving issues before they impact the business.
- Strong communication and collaboration skills, with the ability to work effectively across different teams and stakeholders.
- Ability to operate effectively in a fast-paced, dynamic startup environment, with a focus on delivering results.