Location
Salary
(Yearly)
(Yearly)
(Yearly)
(Yearly)
(Hourly)
Undisclosed
-
Job type
September 17, 2025
Job type
Full-time
Experience level
Mid level
Summary this job with AI
Highlight
Highlight

Job Description

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our large GPU training clusters.

Role & Responsibilities

  • Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration
  • Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows
  • Partner with cloud and colocation providers to ensure cluster availability and performance
  • Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)
  • Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity
  • Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning

Required Experience

  • Production experience managing SLURM clusters at scale, including job scheduling policies, resource allocation, and federation
  • Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments
  • Proven track record managingGPU clusters, including driver management and DCGM monitoring

Preferred Qualifications

  • Understanding of distributed training patterns, checkpointing strategies, and data pipeline optimization
  • Experience with Kubernetes for containerized workloads, particularly for inference or mixed compute environments
  • Experience with high-performance interconnects (InfiniBand, RoCE) and NCCL optimization for multi-node training
  • Track record of managing 1000+ GPU training runs, with deep understanding of failure modes and recovery patterns
  • Familiarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloads
  • Experience running hybrid training/inference infrastructure with appropriate resource isolation
  • Strong scripting skills (Python, Bash) and infrastructure-as-code experience
Apply now
Black Forest Labs is hiring a Member of Technical Staff - Training Cluster Engineer. Apply through Homebase and and make the next move in your career!
Apply now
Companies size
11-50
employees
Founded in
Headquaters
Country
Industry
Computer Software
Social media
Visit website

Similar AI jobs

Here are other jobs you might want to apply for.

US.svg
United States

Enterprise Security Engineer

Full-time
MLOps / DevOps Engineer
earth.svg
Anywhere

Senior Software Engineer, Windows/Desktop Applications - Zurich, Switzerland

Full-time
MLOps / DevOps Engineer
earth.svg
Anywhere

Senior Software Engineer, Windows/Desktop Applications - Stockholm, Sweden

Full-time
MLOps / DevOps Engineer
US.svg
United States

Senior Software Engineer, Windows/Desktop Applications - Fishers, USA

Full-time
MLOps / DevOps Engineer
US.svg
United States

Senior Software Engineer, Windows/Desktop Applications - Olathe, USA

Full-time
MLOps / DevOps Engineer
US.svg
United States

Senior Software Engineer, Windows/Desktop Applications - Macon, USA

Full-time
MLOps / DevOps Engineer
Open Modal