Kubernetes AI Jobs

Discover the latest remote and onsite Kubernetes AI roles across top active AI companies. Updated hourly.

Check out 301 new Kubernetes AI roles opportunities posted on The Homebase

Head of Product, AI

New
Top rated
Bjak
Full-time
Full-time
Posted

Own the end-to-end AI product strategy grounded in technical feasibility and real-world constraints; translate model capabilities, data limitations, and evaluation results into clear product decisions; make hard trade-offs across quality, latency, cost, reliability, and user experience; work daily with ML, backend, and mobile engineers on design, evaluation, and iteration; define success metrics and feedback loops across offline evaluation, online experiments, and human feedback; drive execution with clear specifications, risk awareness, and disciplined prioritization; ensure AI features ship quickly, safely, and reliably into production; and own AI product quality across user experience, correctness, and outcomes.

Undisclosed

()

Beijing, China
Maybe global
Remote
Python
MLflow
Prompt Engineering
OpenAI API
Transformers

Product Manager, Models

New
Top rated
Heidi Health
Full-time
Full-time
Posted

As the Product Manager for Heidi's models platform, you will own the product strategy and roadmap for the platform including evaluation pipelines, fine-tuning infrastructure, model routing, and safety systems. Your responsibilities include prioritising your team's work across enablement requests, model safety and quality, and new capability bets; fixing platform issues that cause blocks for product teams; building evaluation tooling and fine-tuning workflows usable in clinical settings; deciding improvements based on clinician feedback, model quality signals, and product team needs; allocating engineering capacity among competing requests and clearly communicating deferrals; working with engineers on evaluation design, fine-tuning trade-offs, and model architecture decisions; setting model quality and safety targets based on clinical outcomes; consolidating duplicate infrastructure across product teams; and monitoring foundation model developments to adjust the roadmap accordingly. You will collaborate closely with engineers, researchers, product PMs, and clinical safety teams and report to product leadership. This is a platform role whose outputs impact every user-facing product at Heidi.

Undisclosed

()

Sydney, Australia
Maybe global
Remote
Python
Model Evaluation
MLOps
MLflow
Docker

Forward Deployed Engineer, Agentic Platform

New
Top rated
Cohere
Full-time
Full-time
Posted

Build and ship features for North, an AI workspace platform; develop autonomous agents that interact with sensitive enterprise data; experiment rapidly and with high quality to engage customers and deliver solutions that exceed expectations; work across the entire product lifecycle from conceptualization through production; lead end-to-end deployment of North in private cloud and on-premises environments including planning, configuration, testing, and rollout.

Undisclosed

()

Middle East
Maybe global
Onsite
Python
RAG
Docker
Kubernetes
AWS

Forward Deployed Engineer - ML

New
Top rated
Modal
Full-time
Full-time
Posted

As a Forward Deployed ML Engineer at Modal, you will work hands-on with companies like Suno, Lovable, Cognition, and Meta to architect and optimize production AI workloads on Modal. You will contribute to open-source projects, publish technical content demonstrating Modal's capabilities across the AI stack, and collaborate with Modal's product and sales teams as both an engineer and a product stakeholder. Additionally, you will build trusted relationships with technical leaders at companies doing frontier AI work and conduct technical demos, experiments, and proof-of-concepts that highlight Modal's performance advantages.

Undisclosed

()

Stockholm, Sweden
Maybe global
Onsite
Python
PyTorch
TensorFlow
MLOps
Docker

Research Product Manager — Structured AI Systems

New
Top rated
Granica
Full-time
Full-time
Posted

The Research Product Manager is responsible for advancing foundational work in tabular data learning, structured and relational representation learning, compression-aware AI, hybrid symbolic, relational, and neural systems, and large-scale systems, linking these research efforts to real production systems managing petabytes of data. The role involves productionizing structured AI models by collaborating with Research and Systems teams to design training on Parquet/Iceberg/Delta data, define training infrastructure requirements, inference architectures, and maintenance loops, while understanding storage and compute trade-offs, data layout, compute scheduling, model lifecycle, infrastructure bottlenecks, and evaluation pipelines. The role also involves defining economic value extraction by identifying buyers, economic value sources, quantification methods, and converting research advances into revenue and platform advantages, requiring strong enterprise infrastructure economic intuition. Additionally, the Research Product Manager identifies viable modeling advances for production, terminates non-viable research directions, defines integration paths into enterprise workloads, and works with the Chief Research Scientist on research agenda prioritization. The position requires deep understanding of large AI model training, deployment, and maintenance in production systems, as well as translating foundational modeling advances into economically valuable infrastructure, shaping technical execution and economic strategy.

$160,000 – $250,000
Undisclosed
YEAR

(USD)

Mountain View, United States
Maybe global
Hybrid
Python
MLflow
MLOps
Data Pipelines
AWS

Senior Fullstack Software Engineer

New
Top rated
Heidi Health
Full-time
Full-time
Posted

Build systems that integrate with the EHRs used in American healthcare to make Heidi feel like a native capability rather than a plugin. Develop systems that simplify the complexity of US healthcare billing, compliance, and payer constraints so clinicians do not have to manage these complexities. Write clean, testable code with strong interfaces, error handling, and observability, ensuring the workflows are reliable for clinicians, operators, and downstream systems. Focus on outcomes by ensuring that the built systems help clinicians and improve practice revenue. Create agentic workflow functionalities where AI assists with extraction, reconciliation, and drafting within workflows, incorporating human review, auditability, and control. Collaborate closely in a team environment with frequent pairing and shared ownership of design and implementation. Learn about healthcare organizational operations, especially those serving US customers, to translate requirements and constraints into product improvements.

$150,000 – $210,000
Undisclosed
YEAR

(USD)

London, United Kingdom
Maybe global
Hybrid
Python
JavaScript
TypeScript
Docker
Kubernetes

Director, Forward Deployed Engineering

New
Top rated
Harvey
Full-time
Full-time
Posted

The Director of Forward Deployed Engineering will own the Forward Deployed Engineering program end-to-end, including building the team, defining the operating model, and ensuring top strategic accounts feel prioritized. Responsibilities include building, hiring, and managing a team of software engineers and managers deployed into strategic accounts; defining staffing models, engagement structures, and capacity allocation across accounts; developing specialist pods of engineers for new verticals; setting and upholding quality standards for client deliverables, documentation, and knowledge transfer. The role also requires maintaining deep technical fluency to scope custom builds, unblock engineering decisions, and evaluate solution quality; overseeing the design and implementation of tailored workflows, retrieval systems, agent tools, and knowledge sources on Harvey's platform; and ensuring solutions are operationalized with evaluations, documentation, and user training. Additionally, the Director will identify patterns across client engagements to inform product and engineering leadership about client needs and product opportunities with specificity.

$320,000 – $360,000
Undisclosed
YEAR

(USD)

New York, United States
Maybe global
Onsite
Python
JavaScript
TypeScript
OpenAI API
Prompt Engineering

Senior Program Manager, Infrastructure Strategy and Business Operations

New
Top rated
Together AI
Full-time
Full-time
Posted

Advance inference efficiency end-to-end by designing and prototyping algorithms, architectures, and scheduling strategies for low-latency, high-throughput inference. Implement and maintain changes in high-performance inference engines, including kernel backends, speculative decoding, and quantization. Profile and optimize performance across GPU, networking, and memory layers to improve latency, throughput, and cost. Design and operate RL and post-training pipelines optimizing algorithms and systems where most cost is inference. Make RL and post-training workloads more efficient with inference-aware training loops and techniques for large-scale rollout collection and evaluation. Use pipelines to train, evaluate, and iterate on frontier models based on the inference stack. Co-design algorithms and infrastructure tightly coupling objectives, rollout collection, and evaluation to efficient inference and quickly identify bottlenecks across training engine, inference engine, data pipeline, and user-facing layers. Run ablations and scale-up experiments to understand trade-offs between model quality, latency, throughput, and cost, and feed insights back into model, RL, and system design. Profile, debug, and optimize inference and post-training services under real production workloads. Drive roadmap items requiring engine modifications including kernels, memory layouts, scheduling logic, and APIs. Establish metrics, benchmarks, and experimentation frameworks to rigorously validate improvements. Provide technical leadership, set technical direction for cross-team efforts intersecting inference, RL, and post-training, and mentor engineers and researchers on full-stack ML systems work and performance engineering.

$200,000 – $280,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite
Python
Reinforcement Learning
MLOps
MLflow
Docker

Helix AI Engineer, Agentic Systems

New
Top rated
Figure AI
Full-time
Full-time
Posted

Design, deploy, and maintain Figure's training clusters. Architect and maintain scalable deep learning frameworks for training on massive robot datasets. Work together with AI researchers to implement training of new model architectures at a large scale. Implement distributed training and parallelization strategies to reduce model development cycles. Implement tooling for data processing, model experimentation, and continuous integration.

$150,000 – $350,000
Undisclosed
YEAR

(USD)

San Jose, United States
Maybe global
Onsite
Python
PyTorch
AWS
Azure
GCP

Software Engineer, Backend

New
Top rated
Mirage
Full-time
Full-time
Posted

Design, build, and own backend systems end-to-end, including services, APIs, data pipelines, and infrastructure that power the products. Solve complex technical challenges across distributed systems, scaling, concurrency, and performance. Integrate and operate large generative AI models in production by deploying, serving, and scaling systems that combine internal research and external capabilities to unlock new product experiences. Instrument, experiment, and iterate in production to continuously improve system and product quality. Design and operate core platform infrastructure, including integrations with third-party providers, storage systems, security, and internal APIs.

$185,000 – $285,000
Undisclosed
YEAR

(USD)

Union Square or New York, United States
Maybe global
Onsite
Python
JavaScript
Java
Go
Docker

Want to see more AI Egnineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Need help with something? Here are our most frequently asked questions.

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What are Kubernetes AI jobs?","answer":"Kubernetes AI jobs involve orchestrating containerized machine learning applications at scale. Professionals in these roles manage container deployment for AI workloads, distribute computational tasks across nodes for model training, allocate GPU resources efficiently, and automate ML pipelines. They typically work with frameworks like TensorFlow and PyTorch while ensuring high availability for production AI systems through automated scaling and self-healing capabilities."},{"question":"What roles commonly require Kubernetes skills?","answer":"Roles requiring Kubernetes skills include Machine Learning Engineers who deploy models to production, MLOps Engineers working with platforms like Kubeflow, Data Engineers managing processing pipelines, Platform Engineers supporting agentic AI applications, DevOps/SRE professionals handling containerized deployments, and Cloud Architects designing scalable environments. These positions typically involve maintaining infrastructure that supports the complete machine learning lifecycle."},{"question":"What skills are typically required alongside Kubernetes?","answer":"Alongside Kubernetes, employers typically look for container fundamentals (especially Docker), distributed systems knowledge, CI/CD pipeline experience, and cloud platform familiarity. Programming skills are essential for deployment scripts, while experience with ML frameworks like TensorFlow or PyTorch is valuable for AI-specific implementations. Understanding storage solutions, Kubernetes operators, and automated infrastructure management rounds out the typical skill requirements."},{"question":"What experience level do Kubernetes AI jobs usually require?","answer":"Kubernetes AI jobs typically require mid to senior-level experience. Employers look for professionals who understand containerization concepts, have worked with distributed systems, and can manage complex ML workflows. Prior exposure to cloud environments where Kubernetes runs is important. Candidates should demonstrate practical experience with CI/CD pipelines and familiarity with at least one major ML framework."},{"question":"What is the salary range for Kubernetes AI jobs?","answer":"Kubernetes AI jobs command competitive salaries due to the specialized intersection of container orchestration and machine learning skills. Compensation varies based on experience level, location, and specific industry. Roles requiring both strong AI expertise and Kubernetes infrastructure management typically offer premium compensation compared to general software engineering positions, reflecting the high market value of these combined skill sets."},{"question":"Are Kubernetes AI jobs in demand?","answer":"Kubernetes AI jobs are in high demand as organizations increasingly adopt containerized applications for machine learning workloads. The growth is driven by enterprises scaling their AI operations, edge computing applications, and the need for platform-agnostic infrastructure. Companies seek professionals who can manage the complexity of distributed ML systems, particularly for high-availability production environments and automated ML pipelines."},{"question":"What is the difference between Kubernetes and Docker in AI roles?","answer":"Docker creates containerized applications while Kubernetes orchestrates those containers at scale. In AI roles, Docker is used to package ML applications with their dependencies, while Kubernetes manages deployment across clusters, automates scaling during training, and handles resource allocation for GPUs. Docker provides consistency between environments, while Kubernetes adds critical production capabilities like load balancing, self-healing, and distributed computing for AI workloads."}]