Job Description
Fluidstack is the AI Cloud Platform. We build GPU supercomputers for top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more.
Our team is highly motivated, and focused on providing a world class supercomputing experience. We put our customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals.
We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us.
You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset.
We’re looking for a Data Center Operations Manager to manage the ongoing operational performance of Fluidstack owned and operated GPU clusters. This is a “player-coach” role with both oversight and hands-on responsibilities, focused on ensuring the availability and performance of our data center infrastructure.
You’ll be the owner of everything that lives within our data centers, managing our Data Center Operations team as well as third parties, from installation through ongoing maintenance and coordinating upgrades. Your primary responsibility is to ensure the continuous and efficient operation of the data center by managing on-site technicians and third-party vendors, creating and maintaining operational procedures, diagnosing issues, and providing hands-on technical support when higher-level intervention is required. This role is ideal for individuals who excel in environments that demand both operational discipline and the ability to navigate complex, technical challenges.
Ensuring high availability of our GPU infrastructure.
Manage onsite team of data center technicians and third party vendors in daily operations, including server maintenance, equipment installation, and troubleshooting.
Respond to and resolve technical issues and emergencies in a timely manner, ensuring minimal downtime and disruption.
Act as interface between FDEs and onsite team to ensure fast, effective technical remediation and incident resolution.
Undertake regular data center maintenance,performing inspections and audits of equipment to maintain optimal performance and reliability.
Proactively manage infrastructure by defining and continuously improving standard operating procedures (SOPs) for routine data center maintenance.
Manage third-party hardware vendors, including initiating and coordinating the RMA process.
Available to travel to various locations in the US and Europe on short notice and potentially for extended periods when on-site support requires elevated, hands-on expertise.
5+ years experience in data center operations.
Proven ability to lead remote teams and manage vendors.
In-depth knowledge of data center infrastructure, including servers, networking equipment, and cooling systems.
Capable of training on-site datacenter technicians to perform routine physical maintenance.
Capable of remotely diagnosing hardware issues using common Linux and OOB utilities (dmesg, journalctl, dmidecode, lspci, mcelog, dcgmi, nvidia-smi, RedFish, IPMI, etc).
Familiar with common inventory management systems (e.g. NetBox).
Strong communication and organizational skills.
Willing to travel internationally on short notice, based onsite for extended periods as required.
Strong troubleshooting skills and the ability to quickly diagnose and resolve technical issues
Experience with data center management tools and software
Strong time management, communication and interpersonal skills, with the ability to manage a team
Competitive total compensation package (cash + equity).
Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.
Fluidstack is remote first, but has offices in key hubs. For all other locations, we provide access to WeWork.
FluidStack is a GPU cloud for AI companies. FluidStack specialises in providing compute at scale to companies like Meta, Character AI, Midjourney, and Poolside. Whilst offering private clusters for longer-term workloads requiring 2000+ GPUs such as large LLM training, users are also able to access over 50,000 GPUs including NVIDIA A100s, H100s and more, from 100s of DCs around the world onto a single cloud platform.