Kubernetes AI Jobs

Discover the latest remote and onsite Kubernetes AI roles across top active AI companies. Updated hourly.

Check out 301 new Kubernetes AI roles opportunities posted on The Homebase

Software Engineering Manager

New
Top rated
Mirage
Full-time
Full-time
Posted

Oversee the design and operation of the core platform including third-party providers, storage, billing, observability, security, and API. Provide technical leadership for various product and platform features. Improve developer experience to enable the whole team to ship faster. Guide efforts that bridge AI research to production across all modalities such as video, audio, image, and text. Understand the capabilities and limitations of state-of-the-art AI models and leverage them in products. Partner with product, design, and research teams to ensure development aligns with user needs and business objectives.

$250,000 – $350,000
Undisclosed
YEAR

(USD)

New York, United States
Maybe global
Onsite
Python
JavaScript
Java
Docker
Kubernetes

Founding Engineering Lead

New
Top rated
AIFund
Full-time
Full-time
Posted

Own the technical foundation of Meeno end-to-end including web, mobile, backend, data, and experimentation. Co-design product vision in close partnership with Meeno's team. Build core AI product primitives such as voice capture/playback, low-latency interactions, scene framework (content, branching, scoring hooks), feedback loops and user progression, and personalization. Architect systems for speed and iteration with weekly experiments rather than quarterly releases. Set the engineering standards for quality, reliability, security/privacy, and shipping culture. Hire and mentor engineers as the team scales, focusing on quality over quantity and leveraging AI and talent to maintain lean operations.

$180,000 – $220,000
Undisclosed
YEAR

(USD)

New York, United States
Maybe global
Onsite
Python
JavaScript
Prompt Engineering
OpenAI API
MLOps

Founding Platform Engineer

New
Top rated
Netic
Full-time
Full-time
Posted

Design and own the semantic layer that powers the system-of-record flywheel, enabling compounding AI products across teams. Build primitives, abstractions, and APIs for product teams to use as building blocks, ensuring ease of use for shipping AI-driven features. Partner closely with internal product and engineering teams to understand needs, eliminate friction, and design intuitive, well-documented systems that are hard to misuse. Architect systems that span data warehouses, OLTP databases, streaming systems, and vector stores, making tradeoffs based on latency, throughput, consistency, and access patterns. Work with leadership to define the long-term platform architecture, including build-vs-buy decisions, evolving the semantic layer, and scaling the system as product surface area grows.

Undisclosed

()

San Francisco, United States
Maybe global
Onsite
Python
JavaScript
Java
Docker
Kubernetes

Site Reliability Engineer, Managed AI

New
Top rated
Crusoe
Full-time
Full-time
Posted

The Site Reliability Engineer is responsible for designing and operating reliable managed AI services focused on serving and scaling large language model workloads. They build automation and reliability tooling to support distributed AI pipelines and inference services, define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability, and collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters. Additionally, they automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services, investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling, and contribute to the architecture of next-generation distributed systems designed specifically for AI-first environments.

$204,000 – $247,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite
Python
Go
C++
MLOps
Kubernetes

2026 New Grad | Software Engineer, Full-Stack

New
Top rated
Loop
Full-time
Full-time
Posted

Ship critical infrastructure managing real-world logistics and financial data for large enterprises. Own the why by building deep context through customer calls and understanding Loop's value to customers, pushing back on requirements if better solutions exist. Work full-stack across system boundaries including frontend UX, LLM agents, database schema, and event infrastructures. Leverage AI tools to handle routine tasks enabling focus on quality, architecture, and product taste. Constantly optimize development loops, refactor legacy patterns, automate workflows, and fix broken processes to raise velocity.

$150,000 – $150,000
Undisclosed
YEAR

(USD)

San Francisco or Chicago or NYC, United States
Maybe global
Hybrid
Python
JavaScript
TypeScript
PyTorch
TensorFlow

New Grad | Software Engineer, AI

New
Top rated
Loop
Full-time
Full-time
Posted

Ship critical infrastructure by managing real-world logistics and financial data for the largest enterprise in the world. Own the why by building deep context through customer calls and understanding Loop’s value to customers, pushing back on requirements if there is a better, faster way to solve problems. Work with full-stack proficiency across system boundaries, from frontend UX to LLM agents, database schema, and event infrastructures. Leverage AI tools to handle the boilerplate work so focus can be on quality, architecture, and product taste. Constantly optimize development loops, refactor legacy patterns, automate workflows, and fix broken processes to raise the velocity bar.

$150,000 – $150,000
Undisclosed
YEAR

(USD)

San Francisco or Chicago or NYC, United States
Maybe global
Hybrid
Python
JavaScript
TypeScript
Hugging Face
Transformers

Software Engineer, Platform Systems

New
Top rated
OpenAI
Full-time
Full-time
Posted

Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs. Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior. Improve observability, reliability, and performance across OpenAI's training platform. Debug and resolve issues in complex, high-throughput distributed systems. Collaborate with systems, infrastructure, and research teams to evolve platform capabilities. Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads.

Undisclosed

()

London, United Kingdom
Maybe global
Onsite
Python
C++
Docker
Kubernetes
CI/CD

Software Engineer, Platform Systems

New
Top rated
OpenAI
Full-time
Full-time
Posted

Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs. Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior. Improve observability, reliability, and performance across OpenAI's training platform. Debug and resolve issues in complex, high-throughput distributed systems. Collaborate with systems, infrastructure, and research teams to evolve platform capabilities. Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads.

$310,000 – $460,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite
Python
C++
Docker
Kubernetes
CI/CD

Software Engineer, Full Stack

New
Top rated
Replicant
Full-time
Full-time
Posted

As a Full Stack Software Engineer at Replicant, you will design and deliver technology that powers natural, human-like conversations at scale to help companies reduce wait times, improve customer satisfaction, and empower representatives to focus on complex problems. You will build rich user experiences and backend services that enable customers to design, launch, and monitor AI-powered conversations. Responsibilities include building new features for Replicant's core AI voice and chat products handling millions of daily conversations, shipping full stack end-to-end features quickly, integrating automatic speech recognition, text to speech, and conversational AI model improvements into products, refactoring, optimizing, and debugging production systems balancing latency, cost, and user experience, participating in regular on-call rotations monitoring live systems, continuously improving systems based on performance metrics and customer feedback, shaping a culture emphasizing knowledge sharing and mentorship across distributed systems and enterprise-scale AI design, and participating in team and company-wide office events with travel required.

$130,000 – $190,000
Undisclosed
YEAR

(USD)

United States
Maybe global
Remote
TypeScript
Python
Node.js
React
Kubernetes

Software Engineer

New
Top rated
AIFund
Full-time
Full-time
Posted

Design, develop, and maintain web applications and backend services that integrate ML-powered features. Collaborate closely with Machine Learning Engineers and Product Managers to understand ML system requirements and translate them into robust software solutions. Build reliable, scalable, and low-latency services that support ML inference, data workflows, and AI-driven user experiences. Use LLMs to build scalable and reliable AI agents. Own the full software development lifecycle: design, implementation, testing, deployment, monitoring, and maintenance. Ensure high standards for code quality, testing, observability, and operational excellence. Troubleshoot production issues and participate in on-call or support rotations when needed. Mentor junior engineers and contribute to technical best practices across teams. Act as a strong cross-functional partner between product, engineering, and ML teams.

Undisclosed

()

San Francisco Bay Area, United States
Maybe global
Hybrid
Python
Docker
Kubernetes
AWS
GCP

Want to see more AI Egnineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Need help with something? Here are our most frequently asked questions.

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What are Kubernetes AI jobs?","answer":"Kubernetes AI jobs involve orchestrating containerized machine learning applications at scale. Professionals in these roles manage container deployment for AI workloads, distribute computational tasks across nodes for model training, allocate GPU resources efficiently, and automate ML pipelines. They typically work with frameworks like TensorFlow and PyTorch while ensuring high availability for production AI systems through automated scaling and self-healing capabilities."},{"question":"What roles commonly require Kubernetes skills?","answer":"Roles requiring Kubernetes skills include Machine Learning Engineers who deploy models to production, MLOps Engineers working with platforms like Kubeflow, Data Engineers managing processing pipelines, Platform Engineers supporting agentic AI applications, DevOps/SRE professionals handling containerized deployments, and Cloud Architects designing scalable environments. These positions typically involve maintaining infrastructure that supports the complete machine learning lifecycle."},{"question":"What skills are typically required alongside Kubernetes?","answer":"Alongside Kubernetes, employers typically look for container fundamentals (especially Docker), distributed systems knowledge, CI/CD pipeline experience, and cloud platform familiarity. Programming skills are essential for deployment scripts, while experience with ML frameworks like TensorFlow or PyTorch is valuable for AI-specific implementations. Understanding storage solutions, Kubernetes operators, and automated infrastructure management rounds out the typical skill requirements."},{"question":"What experience level do Kubernetes AI jobs usually require?","answer":"Kubernetes AI jobs typically require mid to senior-level experience. Employers look for professionals who understand containerization concepts, have worked with distributed systems, and can manage complex ML workflows. Prior exposure to cloud environments where Kubernetes runs is important. Candidates should demonstrate practical experience with CI/CD pipelines and familiarity with at least one major ML framework."},{"question":"What is the salary range for Kubernetes AI jobs?","answer":"Kubernetes AI jobs command competitive salaries due to the specialized intersection of container orchestration and machine learning skills. Compensation varies based on experience level, location, and specific industry. Roles requiring both strong AI expertise and Kubernetes infrastructure management typically offer premium compensation compared to general software engineering positions, reflecting the high market value of these combined skill sets."},{"question":"Are Kubernetes AI jobs in demand?","answer":"Kubernetes AI jobs are in high demand as organizations increasingly adopt containerized applications for machine learning workloads. The growth is driven by enterprises scaling their AI operations, edge computing applications, and the need for platform-agnostic infrastructure. Companies seek professionals who can manage the complexity of distributed ML systems, particularly for high-availability production environments and automated ML pipelines."},{"question":"What is the difference between Kubernetes and Docker in AI roles?","answer":"Docker creates containerized applications while Kubernetes orchestrates those containers at scale. In AI roles, Docker is used to package ML applications with their dependencies, while Kubernetes manages deployment across clusters, automates scaling during training, and handles resource allocation for GPUs. Docker provides consistency between environments, while Kubernetes adds critical production capabilities like load balancing, self-healing, and distributed computing for AI workloads."}]