AI MLOps / DevOps Engineer Jobs | Top AI MLOps / DevOps Engineer Openings in 2025

Senior Site Reliability Engineer (SRE) - (Brazil)

Articul8

51-100

-

Brazil

Full-time

Remote

About UsArticul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.Position OverviewWe are seeking an experienced Site Reliability Engineer (SRE) to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As an SRE, you will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.Key ResponsibilitiesArchitect and maintain scalable, highly available infrastructure for our GenAI platform.Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.Implement and enforce security best practices across all systems and environments.Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.QualificationsRequiredBachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience5+ years of experience in DevOps, SRE, or similar rolesStrong experience with cloud platforms (AWS, GCP, or Azure)Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)Solid background in containerization technologies (Docker, Kubernetes)Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)Strong understanding of CI/CD pipelines and automationExceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systemsPreferredExperience supporting AI/ML systems in productionKnowledge of GPU infrastructure management and optimizationFamiliarity with distributed systems and high-performance computingExperience with database systems (SQL and NoSQL)Certifications in cloud platforms (AWS, GCP, Azure)Experience with chaos engineering and resilience testingKnowledge of security best practices and compliance requirementsReady to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow’s AI at Articul8 AI!

MLOps / DevOps Engineer

Software Engineer

Apply

August 5, 2025

Hidden link

Senior Site Reliability Engineer (SRE)

Articul8

51-100

-

United States

Full-time

Remote

About UsArticul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.Position OverviewWe are seeking an experienced Site Reliability Engineer (SRE) to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As an SRE, you will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.Key ResponsibilitiesArchitect and maintain scalable, highly available infrastructure for our GenAI platform.Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.Implement and enforce security best practices across all systems and environments.Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.QualificationsRequiredBachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience5+ years of experience in DevOps, SRE, or similar rolesStrong experience with cloud platforms (AWS, GCP, or Azure)Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)Solid background in containerization technologies (Docker, Kubernetes)Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)Strong understanding of CI/CD pipelines and automationExceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systemsPreferredExperience supporting AI/ML systems in productionKnowledge of GPU infrastructure management and optimizationFamiliarity with distributed systems and high-performance computingExperience with database systems (SQL and NoSQL)Certifications in cloud platforms (AWS, GCP, Azure)Experience with chaos engineering and resilience testingKnowledge of security best practices and compliance requirementsReady to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow’s AI at Articul8 AI!

MLOps / DevOps Engineer

Apply

August 5, 2025

Hidden link

Software Engineer (Site Reliability)

Anyscale

501-1000

-

No items found.

Full-time

Remote

About Anyscale:At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We’re commercializing Ray, a popular open-source project that's creating an ecosystem of libraries for scalable machine learning. Companies like OpenAI, Uber, Spotify, Instacart, Cruise, and many more, have Ray in their tech stacks to accelerate the progress of AI applications out into the real world.With Anyscale, we’re building the best place to run Ray, so that any developer or data scientist can scale an ML application from their laptop to the cluster without needing to be a distributed systems expert.Proud to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date.About the role:As a Site Reliability Engineer, you will play a crucial role in ensuring the smooth operation of all user-facing services and other Anyscale production systems. Anyscale values diversity and inclusion, and we encourage applications from individuals of all backgrounds. This includes processes for provisioning, negotiating prices, managing costs, seeing opportunities for teams to reduce wastage by finding applications across the company. You will apply sound engineering principles, operational discipline, and mature automation to our environments and the Anyscale codebase as we scale.As part of this role, you will:Develop a unified perspective on how cloud components are utilized across the company, taking into account diverse needs and requirements.Ensure that deployment methodologies align with the company's reliability goals.Build systems that promote understanding of production environments, facilitating quick identification of issues through robust observability infrastructure for metrics, logging, and tracing.Create monitoring and alerting systems at different levels, enabling teams to easily contribute and enhance the overall monitoring capabilities.Establish testing infrastructure to support the team in writing and executing tests effectively.Develop tools for measuring service level objectives (SLOs) and define organization-wide SLOs.Implement best practices and on-call systems, ensuring efficient incident management and up-leveling the incident management system at Anyscale.Coordinate the creation and deployment of cloud-based services, including tracking deployments and establishing effective communication channels for issue resolution.We'd love to hear from you if have:At least 3 years of relevant work experience in a similar role.Anyscale Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish

MLOps / DevOps Engineer

Software Engineer

Apply

August 4, 2025

Hidden link

Senior DevOps Engineer

Hippocratic AI

101-200

-

India

Full-time

Remote

About UsHippocratic AI has developed a safety-focused Large Language Model (LLM) for healthcare. The company believes that a safe LLM can dramatically improve healthcare accessibility and health outcomes in the world by bringing deep healthcare expertise to every human. No other technology has the potential to have this level of global impact on health.Why Join Our TeamInnovative Mission: We are developing a safe, healthcare-focused large language model (LLM) designed to revolutionize health outcomes on a global scale.Visionary Leadership: Hippocratic AI was co-founded by CEO Munjal Shah, alongside a group of physicians, hospital administrators, healthcare professionals, and artificial intelligence researchers from leading institutions, including El Camino Health, Johns Hopkins, Stanford, Microsoft, Google, and NVIDIA.Strategic Investors: We have raised a total of $278 million in funding, backed by top investors such as Andreessen Horowitz, General Catalyst, Kleiner Perkins, NVIDIA’s NVentures, Premji Invest, SV Angel, and six health systems.World-Class Team: Our team is composed of leading experts in healthcare and artificial intelligence, ensuring our technology is safe, effective, and capable of delivering meaningful improvements to healthcare delivery and outcomes.For more information, visit www.HippocraticAI.com.We value in-person teamwork and believe the best ideas happen together. Our team is expected to be in the office five days a week in Delhi, NCR.OverviewWe are seeking a highly skilled DevOps Engineer to join our team. In this role responsibilities will include designing and implementing infrastructure automation, continuous integration and delivery pipelines, and monitoring and scaling the infrastructure that powers our healthcare AI platform. You will work closely with software engineers, research scientists, and other cross-functional teams to develop and maintain reliable and scalable infrastructure that enables rapid iteration and deployment of our products.Key ResponsibilitiesDesign and implement infrastructure automation and deployment pipelines using tools such as Terraform, Ansible, and JenkinsImplement and maintain monitoring and logging systems to ensure the reliability and performance of our healthcare AI platformWork closely with software engineers to design and deploy scalable, fault-tolerant, and secure production systems on cloud platforms such as AWS, GCP, or AzureDevelop and maintain security and compliance policies and procedures for our healthcare AI platformCollaborate with cross-functional teams to troubleshoot and resolve complex issues related to infrastructure, deployment, and operationsImplement and maintain disaster recovery and business continuity plansDevelop and maintain documentation related to infrastructure, deployment, and operationsMentor and provide technical guidance to junior engineersQualificationsBachelor's or Master's degree in Computer Science, Computer Engineering, or a related fieldAt least 5 years of professional experience in DevOps engineering or a related fieldExpertise in infrastructure automation and deployment tools such as Terraform, Ansible, Jenkins, or GitLab CI/CDExperience with cloud platforms such as AWS, GCP, or AzureStrong knowledge of containerization technologies such as Docker and KubernetesExperience with monitoring and logging tools such as ELK, Grafana, or DatadogFamiliarity with security and compliance best practices and tools such as HashiCorp Vault, AWS KMS, or Azure Key VaultStrong problem-solving skills and ability to work independently and collaboratively in a team environmentExcellent communication and interpersonal skillsExperience implementing HIPAA and SOC2 compliance in a plusExperience working in an HPC Environment is a plus***Be aware of recruitment scams impersonating Hippocratic AI. All recruiting communication will come from @hippocraticai.com email addresses. We will never request payment or sensitive personal information during the hiring process. If anything appears suspicious, stop engaging immediately and report the incident.

MLOps / DevOps Engineer

Apply

August 1, 2025

Hidden link

Embedded Systems Integration Engineer

Figure AI

201-500

USD

0

140000

-

180000

United States

Full-time

Remote

Figure is an AI Robotics company developing a general purpose humanoid. Our humanoid robot, Figure 02, is designed for commercial tasks and the home. We are based in San Jose, CA and require 5 days/week in-office collaboration. It’s time to build.About the Role We’re seeking an Embedded Systems Integration Engineer to build the backend infrastructure that validates the interaction between our hardware, firmware, and software. You will be the connective tissue across disciplines: owning how changes to firmware or software are tested against real hardware. Your work ensures we ship reliable, integrated systems that just work. This is a hands-on role where you will design and implement automated test frameworks, bring-up flows, and validation pipelines for embedded subsystems. You'll be responsible for catching regressions early, enabling fast iteration, and giving clear system-level pass/fail signals across the stack. Key Responsibilities Architect test infrastructure that exercises end-to-end functionality of embedded systems across hardware, firmware, and software boundaries. Develop backend systems (Python, CLI tools, internal APIs) to run tests, log results, and determine pass/fail conditions. Bring up and validate  subsystem and system level changes, tracking changes in behavior and performance across releases. Automate testing pipelines for regression detection and continuous integration. Debug and triage failures across layers—hardware faults, firmware bugs, or software integration issues. Collaborate with firmware, software, and hardware teams to define interface contracts and testable behaviors. Instrument devices under test using scopes, logic analyzers, and custom harnesses to characterize system response. Minimum Qualifications Bachelor’s in EE, CE, CS, or a related field. 3+ years of experience working with embedded systems. Strong understanding of how firmware interacts with hardware peripherals (I2C, Ethernet, SPI, CAN, UART, ADCs, GPIO, etc.). Proficiency in Python or similar scripting language for test automation. Experience bringing up custom embedded boards and working across firmware/software stacks. Familiarity with Linux-based development environments. Preferred Qualifications Experience with CI/CD tools (e.g., GitHub Actions, Jenkins, TeamCity). Knowledge of test automation frameworks (e.g., PyTest, Robot Framework). Exposure to hardware-in-the-loop (HIL) systems. Familiarity with board-level validation, power-on sequencing, or sensor verification. Prior experience in robotics, automotive, aerospace, or other complex embedded systems. Comfort working hands-on at the bench with test equipment. The US base salary range for this full-time position is between $140,000 and $180,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended. 

MLOps / DevOps Engineer

Robotics Engineer

Software Engineer

Apply

August 1, 2025

Hidden link

Sr. Security Engineer

Thoughtful

101-200

USD

0

170000

-

220000

United States

Full-time

Remote

Join Our Mission to Revolutionize Healthcare Thoughtful is pioneering a new approach to automation for all healthcare providers! Our AI-powered Revenue Cycle Automation platform enables the healthcare industry to automate and improve its core business operations. We’re hiring a Senior Security Engineer to secure and scale our stack. You’ll own platform security, system reliability, audit readiness, and integration strategy across cloud and hybrid environments.  You'll take ownership of system reliability, security posture, audit readiness, and help guide our long-term integration strategy across cloud and legacy environments. We're unifying cloud-native and legacy systems into a secure, high-availability platform that powers AI-driven automation across healthcare. You’ll lead foundational work in infrastructure hardening, audit controls, and production observability, directly supporting mission-critical AI agents. You'll have executive support and budget to modernize everything from our VPN tunnels to our alerting stack. What You’ll Own: Integration Strategy: Lead infrastructure and tooling decisions as we unify multiple environments into a single, scalable platform. Audit Readiness: Own and drive SOC 2 Type II and HITRUST prep, working across engineering, compliance, and security. System Reliability: Ensure uptime, scalability, and fault tolerance across services. Set and enforce SLAs. On-Call Infrastructure: Stand up our alerting, escalation, and incident response systems. Observability: Improve logging, metrics, and dashboards using tools like HyperDX. Infrastructure Provisioning: Spin up and manage production-grade infrastructure using OpenTofu/Terraform. Security & Networking: Architect infrastructure with security best practices, including VPNs, IPsec tunnels, and hybrid network topologies. Your Qualifications: 8+ years of experience spanning Security, DevOps, and/or SRE roles in high-availability, cloud and hybrid environments—with a strong track record of leading integrations, hardening infrastructure, and ensuring audit/compliance readiness. Start-up mentality - desire to tackle ambiguous scope of work and willing to do whatever is necessary to drive the company/mission forward. Track record leading complex infrastructure integrations Deep AWS expertise; strong experience with Azure and/or GCP a bonus Proficiency in OpenTofu or Terraform for Infrastructure-as-Code Comfortable navigating hybrid cloud environments (e.g. EKS, legacy VMs, VPN tunnels) Solid Kubernetes experience (Knative experience a plus) Strong networking fundamentals and experience with on-prem systems Familiar with incident tooling (PagerDuty, Opsgenie) and setting SLOs/SLAs Personable and cross-functional: able to build rapport with stakeholders across engineering, compliance, and executive leadership Security-first mindset, with an eye for compliance and audit readiness Proficiency in SOC2 Type 2, HITRUST preparation. Comfortable spinning up new infrastructure as needed What Sets You Apart: You've integrated cutting edge cloud environments with customer's legacy environments You’ve built platforms, not just maintained them You treat DevOps as a product, not just a support function You care about developer experience, observability, and operational excellence   Why Thoughtful? Competitive compensation Equity participation: Employee Stock Options. Health benefits: Comprehensive medical, dental, and vision insurance. Time off: Generous leave policies and paid company holidays.  California Salary Range $170,000—$220,000 USD

MLOps / DevOps Engineer

Software Engineer

Apply

July 31, 2025

Hidden link

AI Infrastructure Engineer

Abridge

201-500

USD

0

179000

-

248000

United States

Full-time

Remote

About AbridgeAbridge was founded in 2018 with the mission of powering deeper understanding in healthcare. Our AI-powered platform was purpose-built for medical conversations, improving clinical documentation efficiencies while enabling clinicians to focus on what matters most—their patients.Our enterprise-grade technology transforms patient-clinician conversations into structured clinical notes in real-time, with deep EMR integrations. Powered by Linked Evidence and our purpose-built, auditable AI, we are the only company that maps AI-generated summaries to ground truth, helping providers quickly trust and verify the output. As pioneers in generative AI for healthcare, we are setting the industry standards for the responsible deployment of AI across health systems.We are a growing team of practicing MDs, AI scientists, PhDs, creatives, technologists, and engineers working together to empower people and make care make more sense. We have offices located in the Mission District in San Francisco, the SoHo neighborhood of New York, and East Liberty in Pittsburgh. The RoleAs an AI Infrastructure Engineer at Abridge, you’ll play a pivotal role in building and optimizing the core infrastructure that powers our machine learning models. Your work will be instrumental in enhancing the scalability, efficiency, and performance of our AI-driven solutions. You will work with our Infrastructure and Research teams to build, deploy, optimize and orchestrate across our AI models.What You'll DoDesign, deploy and maintain scalable Kubernetes clusters for AI model inference and trainingDevelop, optimize, and maintain ML model serving and training infrastructure, ensuring high-performance and low-latency.Collaborate with ML and product teams to scale backend infrastructure for AI-driven products, focusing on model deployment, throughout optimization, and compute efficiency.Optimize compute-heavy workflows and enhance GPU utilization for ML workloads.Build a robust model API orchestration systemCollaborate with leadership to define and implement strategies for scaling infrastructure as the company grows, ensuring long-term efficiency and performance.What You’ll BringStrong experience in building and deploying machine learning models in production environments.Deep understanding of container orchestration and distributed systems architectureExpertise in Kubernetes administration, including custom resource definitions, operators, and cluster managementExperience developing APIs and managing distributed systems for both batch and real-time workloadsExcellent communication skills, with the ability to interface between research and product engineeringBonus Points IfExpertise with model serving frameworks such as NVIDIA Triton Server, VLLM, TRT-LLM and so on.Expertise with ML toolchains such as PyTorch, Tensorflow or distributed training and inference libraries.Familiarity with GPU cluster management and CUDA optimizationKnowledge of infrastructure as code (Terraform, Ansible) and GitOps practicesExperience with container registries, image optimization, and multi-stage builds for ML workloadsExperience orchestrating across ASR models or LLM models for building various GenAI applicationsWhy Work at Abridge?At Abridge, we’re transforming healthcare delivery experiences with generative AI, enabling clinicians and patients to connect in deeper, more meaningful ways. Our mission is clear: to power deeper understanding in healthcare. We’re driving real, lasting change, with millions of medical conversations processed each month.Joining Abridge means stepping into a fast-paced, high-growth startup where your contributions truly make a difference. Our culture requires extreme ownership—every employee has the ability to (and is expected to) make an impact on our customers and our business.Beyond individual impact, you will have the opportunity to work alongside a team of curious, high-achieving people in a supportive environment where success is shared, growth is constant, and feedback fuels progress. At Abridge, it’s not just what we do—it’s how we do it. Every decision is rooted in empathy, always prioritizing the needs of clinicians and patients.We’re committed to supporting your growth, both professionally and personally. Whether it's flexible work hours, an inclusive culture, or ongoing learning opportunities, we are here to help you thrive and do the best work of your life.If you are ready to make a meaningful impact alongside passionate people who care deeply about what they do, Abridge is the place for you.How we take care of Abridgers:Generous Time Off: 13 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees.Comprehensive Health Plans: Medical, Dental, and Vision plans for all full-time employees. Abridge covers 100% of the premium for you and 75% for dependents. If you choose a HSA-eligible plan, Abridge also makes monthly contributions to your HSA. Paid Parental Leave: 16 weeks paid parental leave for all full-time employees.401k and Matching: Contribution matching to help invest in your future.Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits.Learning and Development Budget: Yearly contributions for coaching, courses, workshops, conferences, and more.Sabbatical Leave: 30 days of paid Sabbatical Leave after 5 years of employment.Compensation and Equity: Competitive compensation and equity grants for full time employees.... and much more!Equal Opportunity EmployerAbridge is an equal opportunity employer and considers all qualified applicants equally without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran status, or disability.Staying safe - Protect yourself from recruitment fraudWe are aware of individuals and entities fraudulently representing themselves as Abridge recruiters and/or hiring managers. Abridge will never ask for financial information or payment, or for personal information such as bank account number or social security number during the job application or interview process. Any emails from the Abridge recruiting team will come from an @abridge.com email address. You can learn more about how to protect yourself from these types of fraud by referring to this article. Please exercise caution and cease communications if something feels suspicious about your interactions.

MLOps / DevOps Engineer

Apply

July 31, 2025

Hidden link

AI Security Engineer

Perplexity

1001-5000

USD

0

200000

-

280000

No items found.

Full-time

Remote

Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Perplexity has raised over $1B in venture investment from some of the world’s most visionary and successful leaders, including Elad Gil, Daniel Gross, Jeff Bezos, Accel, IVP, NEA, NVIDIA, Samsung, and many more. Our objective is to build accurate, trustworthy AI that powers decision-making for people and assistive AI wherever decisions are being made. Throughout human history, change and innovation have always been driven by curious people. Today, curious people use Perplexity to answer more than 780 million queries every month–a number that’s growing rapidly for one simple reason: everyone can be curious. Perplexity is seeking a highly skilled, experienced, and hands-on AI Security Engineer to join our security team, driving the protection of next-generation AI systems against adversarial threats. In this role, you’ll design and implement robust mechanisms to secure self-hosted models, LLM APIs, agents, MCPs, and the core AI stack. You'll empower developers with tools and guidance, as well as technical contributions, enabling innovation while ensuring AI security is strong by default. Our tech stack includes Python, NextJS, TypeScript, Docker, AWS, Kubernetes, and PostgreSQL. Responsibilities Define, build, and refine mechanisms to secure AI systems (including self-hosted models, LLM APIs, agents, MCPs, and other core components of the AI stack) against adversarial behavior of all kinds Understand technically complex AI systems, identify potential weaknesses in their architecture, and implement improvements At least 50% of time performing hands-on remediation. Also working closely with peer engineers to drive remediations Plan and carry out threat modeling activities and realistic threat simulations across our offerings Conduct cybersecurity evaluations and lead AI security assessments in a cross-functional environment Develop initiatives that improve our capabilities to effectively evaluate AI systems and enhance the organization's prevention, detection, response, and threat hunting capabilities Provide guidance and education to developers to help deter and prevent threats Qualifications Hands-on coding and prompting experience. Bachelor of Science or Master of Science in Computer Science or a related field, or equivalent experience Be a technical and process subject matter expert regarding AI security services and attacker tactics, techniques, and procedures Good understanding of LLMs, AI architecture patterns, machine learning models, and related technologies such as MCP Understanding of application security principles and secure coding practices Experience developing and implementing security procedures and policies Strong problem-solving, project management, leadership, and communication skills Self-motivated with a willingness to take ownership of tasks 4+ years of industry experience The cash compensation range for this role is $200,000 - $280,000. Final offer amounts are determined by multiple factors, including, experience and expertise, and may vary from the amounts listed above.   Equity: In addition to the base salary, equity may be part of the total compensation package. Benefits: Comprehensive health, dental, and vision insurance for you and your dependents. Includes a 401(k) plan.

MLOps / DevOps Engineer

Machine Learning Engineer

Software Engineer

Apply

July 30, 2025

Hidden link

Storage Engineering Manager

Lambda AI

501-1000

USD

0

330000

-

495000

United States

Full-time

Remote

Lambda is the #1 GPU Cloud for ML/AI teams training, fine-tuning and inferencing AI models, where engineers can easily, securely and affordably build, test and deploy AI products at scale. Lambda’s product portfolio includes on-prem GPU systems, hosted GPUs across public & private clouds and managed inference services – servicing government, researchers, startups and Enterprises world-wide. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our San Jose office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.In the world of distributed AI, raw GPU and CPU horsepower is just a part of the story. High-performance networking and storage are the critical components that enable and unite these systems, making groundbreaking AI training and inference possible.The Lambda Infrastructure Engineering organization forges the foundation of high-performance AI clusters by welding together the latest in AI storage, networking, GPU and CPU hardware.Our expertise lies at the intersection of:High-Performance Distributed Storage Solutions and Protocols: We engineer the protocols and systems that serve massive datasets at the speeds demanded by modern clustered GPUs.Dynamic Networking: We design advanced networks that provide multi-tenant security and intelligent routing without compromising performance, using the latest in AI networking hardware.Compute Virtualization: We enable cutting-edge virtualization and clustering that allows AI researchers and engineers to focus on AI workloads, not AI infrastructure, unleashing the full compute bandwidth of clustered GPUs.About the Role:We are seeking a seasoned Storage Engineering Manager with experience in the specification, evaluation, deployment, and management of HPC storage solutions across multiple datacenters to build out a world-class team. You will hire and guide a team of storage engineers in building storage infrastructure that serves our AI/ML infrastructure products, ensuring the seamless deployment and operational excellence of both the physical and logical storage infrastructure (including proprietary and open source solutions).Your role is not just to manage people, but to serve as the ultimate technical and operational authority for our high-performance, petabyte-scale storage solutions.Your leadership will be pivotal in ensuring our systems are not just high-performing, but also reliable, scalable, and manageable as we grow toward exascale.This is a unique opportunity to work at the intersection of large-scale distributed systems and the rapidly evolving field of artificial intelligence infrastructure. This is an opportunity to have a significant impact on the future of AI. You will be building the foundational infrastructure that powers some of the most advanced AI research and products in the world.What You’ll DoTeam Leadership & Management:Grow/Hire, lead, and mentor a top-talent team of high-performing storage engineers delivering HPC, petabyte-scale storage solutions.Foster a high-velocity culture of innovation, technical excellence, and collaboration.Conduct regular one-on-one meetings, provide constructive feedback, and support career development for team members.Drive outcomes by managing project priorities, deadlines, and deliverables using Agile methodologies.Technical Strategy & Execution:Drive the technical vision and strategy for Lambda distributed storage solutions.You will lead storage vendor selection criteria, vendor selection, and vendor relationship management (support, installation, scheduling, specification, procurement).Manage team in storage lifecycle management (installation, cabling, capacity upgrades, service, RMA, updating both hardware and software components as needed).You will guide choices around optimization of storage pools, sharding, and tiering/caching strategies.Lead team in tasks related to multi-tenant security, tenant provisioning, metering integration, storage protocol interconnection, and customer data-migration.Guide Storage SREs in development of scripting and automation tools for configuration management, monitoring, and operational tasks.Guide team in problem identification, requirements gathering, solution ideation, and stakeholder alignment on engineering RFCs.Lead the team in supporting customers.Cross-Functional Collaboration:Collaborate with the HPC Architecture team on drive selection, capacity determination, storage networking, cache placement, and rack layouts.Work closely with the storage software teams and networking teams to execute on cross-functional infrastructure initiatives and new data-center deployments including integration of storage protocols across a variety of on-prem storage solutions.Work with procurement data-center operations, and fleet engineering teams to deploy storage solutions into new and existing data centers.Work with vendors to troubleshoot customer performance, reliability, and data-integrity issues.Work closely with Networking, Compute, and Storage Software Engineering teams to deploy high-performance distributed storage solutions to serve AI/ML workloads.Partner with the fleet engineering team to ensure seamless deployment, monitoring, and maintenance of the distributed storage solutions.Innovation & Research:Stay current with the latest trends and research into AI and HPC storage technologies and vendor solutions.Guide team in investigating strategies for using Nvidia SuperNIC DPUs for storage edge-caching, offloading, and GPUDirect Storage capabilities.Work with the Lambda product team to uncover new trends in the AI inference and training product category that will inform emerging storage solutions.Encourage and support the team in exploring new technologies and approaches to improve system performance and efficiency.YouExperience:10+ years of experience in storage engineering with at least 5+ years in a management or lead role.Demonstrated experience leading a team of storage engineers and storage SREs on complex, cross-functional projects in a fast-paced startup environment.Extensive hands-on experience in designing, deploying, and maintaining distributed storage solutions in a CSP (Cloud Service Provider), NCP (Neo-Cloud provider), HPC-infrastructure integrator, or AI-infrastructure company.Experience with storage solutions serving storage volumes at a scale greater than 20PB.Strong project management skills, leading high-confidence planning, project execution, and delivery of team outcomes on schedule.Extensive experience with storage site reliability engineering.Experience with one or more of the following in an HPC or AI Infrastructure environment: Vast, DDN, Pure Storage, NetApp, Weka.Experience deploying CEPH at scale greater than 25PB.Technical Skills:Experience in serving one or more of the following storage protocols: object storage (e.g., S3), block storage (e.g., iSCSI), or file storage (e.g., NFS, SMB, Lustre).Professional individual contributor experience as a storage engineer or storage SRE.Familiarity with modern storage technologies (e.g., NVMe, RDMA, DPUs) and their role in optimizing performance.People Management:Experience building a high-performance team through deliberate hiring, upskilling, planned skills redundancy, performance-management, and expectation setting.Nice to HaveExperience:Experience driving cross-functional engineering management initiatives (coordinating events, strategic planning, coordinating large projects).Experience with NVidia SuperNIC DPUs for edge-caching (such as implementing GPUDirect Storage).Technical Skills:Deep experience with Vast, Weka and/or NetApp in an HPC or AI Infrastructure environment.Deep experience implementing CEPH in an HPC or AI infrastructure environment at a scale greater than 100PB.People Management:Experience driving organizational improvements (processes, systems, etc.)Experience training, or managing managers.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Software Engineer

Apply

July 30, 2025

Hidden link

Infrastructure Engineer

Tandem

1001-5000

USD

0

150000

-

250000

United States

Full-time

Remote

Why you should join usTandem is a generational opportunity to rethink how we bring new therapies to market, and our path to doing so is significantly de-risked – we have:Exponential organic growth: We have product-market fit and are growing rapidly through word-of-mouth. Tandem supports thousands of patients every day, is doubling doctor users every quarter, and is working with the largest biopharma companies in the world.An AI-first business model: Our approach is distinctly enabled by AI, but our business will get stronger (not commoditized) as foundation models improve. We are building durability through two-sided network effects that will compound over time.Top tier investors: With the traction to support conviction in our model, we raised significant funding from investors (including Thrive Capital, General Catalyst, Bain Capital Ventures, and Pear VC) to build an exceptional team of engineers and operators.Our number one priority is scaling to market demand. We are looking for individuals who are high horsepower, high throughput, and hyper resourceful to help us increase capacity and grow. We move fast and need to move faster.All full-time roles are in person in New York. You can learn more about working with us in the last section of this page.About the roleAs an Infrastructure Engineer at Tandem, you’ll be our first dedicated hire focused on infrastructure and developer experience. You’ll own and evolve the core systems that make our engineering team faster, our platform more reliable, and our company more scalable. This role sits at the foundation of everything we build — from AI product development to partner integrations to high-stakes, high-throughput operations.You’ll have the opportunity to define our foundational infra practices — from CI/CD to observability to cloud architecture — and set the tone for how we scale. You’ll work across infrastructure, DevOps, and developer productivity, building systems that let us move fast without compromising stability or clarity.This is a demanding role, with a high level of autonomy and responsibility. You will be expected to "act like an owner" and commit yourself to Tandem's success. If you are low-ego, hungry to learn, and excited about intense, impactful work that drives both company growth and accelerated career progression, we want to hear from you.If you join, you will:Design and evolve CI/CD pipelines to improve speed, safety, and developer experienceScale and maintain core infrastructure — including Kubernetes clusters, PostgreSQL databases, ephemeral browsers, and data replication workflowsBuild and own our monitoring, alerting, and observability systems to ensure platform reliability and uptimeIntegrate AI-powered development tools to streamline engineering workflowsAnalyze and control cloud infrastructure spend while maintaining performance and reliabilityImprove internal tooling and developer environments to unblock execution at every layerPartner with engineering and product teams to anticipate scaling challenges and harden critical systemsWe’ll be most excited if you:Experience in infrastructure, DevOps, or SRE roles at fast-moving, product-driven tech companiesHands-on experience with cloud infrastructure (we use AWS; GCP and Azure are nice to have)Proficiency with infrastructure-as-code tooling (Terraform preferred)Deep familiarity with containerization and orchestration (Docker, Kubernetes)Experience with observability and monitoring systems (e.g., Grafana, Datadog)Proven ability to design and improve CI/CD pipelines (we use GitHub Actions)High NPS with your former teammatesThis is a list of ideal qualifications for this position. If you don't meet every single one of them, you should still consider applying! We’re excited to work with people from underrepresented backgrounds, and we encourage people from all backgrounds to apply.Working with usTandem is based in New York, with our full team working out of a beautiful and spacious office in SoHo. We run as a high-trust environment with high autonomy, which requires that everyone is fully competent and operates in line with our principles:Commit to audacity. "Whether you think you can, or you think you can't – you're right.”Do the math. Be rigorous, assume nothing.Find the shortest path. Use hacks, favors, and backdoors. Only take a longer road on purpose.Spit it out. Be direct, invite critique, avoid equivocation – we want right answers.Be demanding and supportive. Expect excellence from everyone and offer help to achieve it.Do what it takes to be number 1. We work hard to make sure we win.We provide competitive compensation with meaningful equity (for full-time employees). Everyone who joins early will be a major contributor to our success, and we reflect this through ownership and pay.We also provide rich benefits to ensure you can focus on creating impact (for full-time employees):Fully covered medical, vision, and dental insurance.Memberships for One Medical, Talkspace, Teladoc, and Kindbody.Unlimited paid time off (PTO) and 16 weeks of parental leave.401K plan setup, FSA option, commuter benefits, and DashPass.Lunch at the office every day and Dinner at the office after 7 pm.Our salary ranges are based on paying competitively for our company’s size and industry, and are one part of the total compensation package that also includes equity, benefits, and other opportunities at Tandem (for full-time employees). Individual pay decisions are ultimately based on a number of factors, including qualifications for the role, experience level, skillset, geography, and balancing internal equity. Tandem is an equal opportunity employer and does not discriminate on the basis of race, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition, or any other basis protected by law.

MLOps / DevOps Engineer

Software Engineer

Apply

July 27, 2025

Hidden link

Data Center Operations Engineer - Los Angeles

Lambda AI

501-1000

USD

86000

-

111000

United States

Full-time

Remote

Lambda is the #1 GPU Cloud for ML/AI teams training, fine-tuning and inferencing AI models, where engineers can easily, securely and affordably build, test and deploy AI products at scale. Lambda’s product portfolio includes on-prem GPU systems, hosted GPUs across public & private clouds and managed inference services – servicing government, researchers, startups and Enterprises world-wide. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our Vernon, CA Data Center 5 days per week.What You'll DoEnsure new server, storage and network infrastructure is properly racked, labeled, cabled, and configuredTroubleshoot hardware and software issues in some of the world’s most advanced systemsDocument data center layout and network topology in DCIM softwareWork with supply chain & manufacturing teams to ensure timely deployment of systems and project plans for large-scale deploymentsManage a parts depot inventory and track equipment through the delivery-store-stage-deploy-handoff process in each of our data centersWork closely with HW Support team to ensure data center infrastructure-related support tickets are resolvedWork with RMA team to ensure faulty parts are returned and replacements are orderedFollow installation standards and documentation for placement, labeling, and cabling to drive consistency and discoverability across all data centersYouAre familiar with critical infrastructure systems supporting data centers, such as power distribution, air flow management, environmental monitoring, capacity planning, DCIM software, structured cabling, and cable managementAre someone who pays attention to detail and has the ability to follow instructionsAre action-oriented and have a strong willingness to learnAre willing to travel for bring up of new data center locationsNice to HaveExperience with troubleshooting server hardwareExperience with Linux administrationExperience with working in large-scale distributed data center environmentsExperience Supermicro & Nvidia hardwareSalary Range InformationBased on market data and other factors, the annual salary range for this position is $86,000-$111,000. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Apply

July 25, 2025

Hidden link

Senior Security Engineer, Detection & Response

Decagon

101-200

USD

0

200000

-

300000

United States

Full-time

Remote

About DecagonDecagon is the leading conversational AI platform empowering every brand to deliver concierge customer experience. Our AI agents provide intelligent, human-like responses across chat, email, and voice, resolving millions of customer inquiries across every language and at any time.Since coming out of stealth, Decagon has experienced rapid growth. We partner with industry leaders like Hertz, Eventbrite, Duolingo, Oura, Bilt, Curology, and Samsara to redefine customer experience at scale. We've raised over $200M from Bain Capital Ventures, Accel, a16z, BOND Capital, A*, Elad Gil, and notable angels such as the founders of Box, Airtable, Rippling, Okta, Lattice, and Klaviyo.We’re an in-office company, driven by a shared commitment to excellence and velocity. Our values—customers are everything, relentless momentum, winner’s mindset, and stronger together—shape how we work and grow as a team.About the TeamThe Platform Engineering team at Decagon designs the proprietary orchestration layer that powers the most advanced conversational AI agents for enterprise customers across voice, chat, email and SMS. Decagon’s leading customer support agents understand context, respond with genuine empathy, and solve complex problems with surgical precision.Our mission is to deliver magical support experiences — AI agents working alongside human agents to help users resolve their issues.About the RoleJoin Decagon's Security team to protect our AI-powered customer experience agents that handle millions of real customer interactions daily. You'll develop detection systems that identify threats without disrupting the natural conversation flow that makes our AI agents effective. This role focuses on data pipelines, LLM-powered detection query writing, and automated "Watchtower" components of our security stack. We work with some of the leading vendors in the security data space, and you’ll take ownership of the system to build an industry-leading DNR team. Decagon’s team is some of the best in the industry, so you’ll work alongside a skilled and enthusiastic team.In this role, you willCollaborate with engineering to build low-latency systems that detect prompt injection, jailbreak attempts, and social engineering attacks against customer experience agents without introducing conversation delaysDevelop ML models that identify surprising or unexpected access patterns in the productCreate an incident response system that analyzes prior interactions: piece together what occurred, what data was accessed, and where that information may have goneBuild APIs and webhooks that allow enterprise customers to integrate our security insights into their existing SOC and incident response workflowsContinuously research and model new threat patterns specific to customer service AI, including account takeover attempts and information extraction attacksYour background looks something like this4+ years building production security or data pipeline systemsAdvanced proficiency in Python with experience in data pipelines, automation tooling, and code review for production web applicationsTrack record of building detection systems that balance security with user experienceExperience with real-time data processing using Kafka, Pulsar, or similar systems for analyzing data streamsPrior experience with tools such as Splunk, Panther, RunReveal, or othersExperience with anomaly detection, sequence modeling, and statistical analysis of user behavior patternsExperience with SOC 2, ISO 27001, GDPR, and other enterprise security requirementsEven betterProven experience leveraging LLMs or AI tooling for efficiency improvementsStrong understanding of customer service workflows and business impact of security measuresSkilled and motivated to take advantage of advanced reasoning and software development tools such as Cursor, Claude Code, and Gemini for personal productivity improvements and design leverageBenefits:Medical, dental, and vision benefitsTake what you need vacation policyDaily lunches, dinners and snacks in the office to keep you at your bestCompensation$200K – $300K + Offers Equity

MLOps / DevOps Engineer

Machine Learning Engineer

Apply

July 24, 2025

Hidden link

Senior Security Engineer, Infrastructure

Decagon

101-200

USD

0

200000

-

300000

United States

Full-time

Remote

About DecagonDecagon is the leading conversational AI platform empowering every brand to deliver concierge customer experience. Our AI agents provide intelligent, human-like responses across chat, email, and voice, resolving millions of customer inquiries across every language and at any time.Since coming out of stealth, Decagon has experienced rapid growth. We partner with industry leaders like Hertz, Eventbrite, Duolingo, Oura, Bilt, Curology, and Samsara to redefine customer experience at scale. We've raised over $200M from Bain Capital Ventures, Accel, a16z, BOND Capital, A*, Elad Gil, and notable angels such as the founders of Box, Airtable, Rippling, Okta, Lattice, and Klaviyo.We’re an in-office company, driven by a shared commitment to excellence and velocity. Our values—customers are everything, relentless momentum, winner’s mindset, and stronger together—shape how we work and grow as a team.About the TeamThe Platform Engineering team at Decagon designs the proprietary orchestration layer that powers the most advanced conversational AI agents for enterprise customers across voice, chat, email and SMS. Decagon’s leading customer support agents understand context, respond with genuine empathy, and solve complex problems with surgical precision.Our mission is to deliver magical support experiences — AI agents working alongside human agents to help users resolve their issues.About the RoleDesign and implement the secure infrastructure that powers Decagon AI's customer experience platform. You'll build the foundational systems that support our AI agents' security framework while ensuring enterprise-grade reliability and scalability. This role focuses on GCP architecture improvements and building security infrastructure that enables our software-first approach to protecting conversational AI. Decagon’s team is some of the best in the industry, so you’ll work alongside a skilled and enthusiastic team.In this role, you willDesign and implement secure, multi-tenant infrastructure that isolates customer data while enabling efficient AI model servingBuild Infrastructure as Code systems for deploying and managing security tools across development and production environmentsEnsure our security infrastructure maintains 99.99% uptime to protect customer interactions without service disruptionOptimize security systems to handle millions of concurrent conversations while maintaining sub-100ms response timesImplement systems that support SOC 2, ISO 27001, and other enterprise compliance requirementsDesign and maintain backup systems and recovery procedures for security infrastructureYour background looks something like this5+ years building production infrastructure with security focus, preferably for SaaS or enterprise softwareDeep knowledge of Google Cloud Platform architecture, including compute, networking, security, and managed servicesAdvanced proficiency with Terraform, Ansible, or similar tools for automated infrastructure managementExperience with secure container deployment, service mesh architecture, and Kubernetes security best practicesProven experience building infrastructure that handles millions of requests per day with strict latency requirementsEven betterTrack record of building infrastructure that supports rapid business growth and customer acquisitionStrong understanding of the operational challenges of securing customer-facing AI systemsA high degree of comfort digging into systems with deep technology stacks using any tool necessaryBenefits:Medical, dental, and vision benefitsTake what you need vacation policyDaily lunches, dinners and snacks in the office to keep you at your bestCompensation$200K – $300K + Offers Equity

MLOps / DevOps Engineer

Software Engineer

Apply

July 24, 2025

Hidden link

Staff Security Engineer

Decagon

101-200

USD

0

250000

-

350000

United States

Full-time

Remote

About DecagonDecagon is the leading conversational AI platform empowering every brand to deliver concierge customer experience. Our AI agents provide intelligent, human-like responses across chat, email, and voice, resolving millions of customer inquiries across every language and at any time.Since coming out of stealth, Decagon has experienced rapid growth. We partner with industry leaders like Hertz, Eventbrite, Duolingo, Oura, Bilt, Curology, and Samsara to redefine customer experience at scale. We've raised over $200M from Bain Capital Ventures, Accel, a16z, BOND Capital, A*, Elad Gil, and notable angels such as the founders of Box, Airtable, Rippling, Okta, Lattice, and Klaviyo.We’re an in-office company, driven by a shared commitment to excellence and velocity. Our values—customers are everything, relentless momentum, winner’s mindset, and stronger together—shape how we work and grow as a team.About the TeamThe Platform Engineering team at Decagon designs the proprietary orchestration layer that powers the most advanced conversational AI agents for enterprise customers across voice, chat, email and SMS. Decagon’s leading customer support agents understand context, respond with genuine empathy, and solve complex problems with surgical precision.Our mission is to deliver magical support experiences — AI agents working alongside human agents to help users resolve their issues.About the RoleLead the technical vision for securing Decagon AI's customer experience platform that powers AI agents for enterprise customers. You'll collaborate with our security leaders to architect our comprehensive security framework that protects against modern, AI-enabled threats while maintaining an enterprise-ready security and compliance posture. This role offers the opportunity to apply in-depth AI security expertise and develop enterprise software architecture. Decagon’s team is some of the best in the industry, so you’ll work alongside a skilled and enthusiastic team.In this role, you willDesign and implement our complete security framework covering customer-facing products, infrastructure security, and enterprise integration pointsWork with AI/ML teams to embed security into model training, fine-tuning, and deployment processes to empower our research teams to innovate safelyDefine security standards and practices for enterprise customers, including tenant isolation, data protection, and compliance frameworksLead security incidents affecting operations, or ensure rapid resolution while maintaining service availabilityMentor security engineers and broader the engineering organization while establishing engineering practices that scale with rapid team growthYour background looks something like thisHave 7+ years of industry experience in security engineeringExperience designing secure, multi-tenant systems that handle millions of concurrent conversationsKnowledge of Google Cloud security architecture, including VPC security, IAM, and enterprise-grade deployment patternsProven track record building security systems that meet enterprise expectations (SIEM, Vulnerability Management, IAM, compliance platforms)Experience with SOC 2, ISO 27001, GDPR, and other enterprise security requirementsEven betterPrior experience working with multi-modal modelsDeep knowledge of, fine-tuning security, model poisoning, and adversarial attacks specific to conversational AISkilled and motivated to take advantage of advanced reasoning and software development tools such as Cursor, Claude Code, and Gemini for personal productivity improvements and design leverage.Benefits:Medical, dental, and vision benefitsTake what you need vacation policyDaily lunches, dinners and snacks in the office to keep you at your bestCompensation$250K – $350K + Offers Equity

MLOps / DevOps Engineer

Software Engineer

Apply

July 24, 2025

Hidden link

Detection and Response Engineering Manager

Anthropic

1001-5000

USD

320000

-

405000

United States

Full-time

Remote

About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.About the role We are seeking a Detection and Response Engineering Manager to lead our Detection and Response teams in creating comprehensive Security Observability, Detection Lifecycle, and Security Incident Response programs for Anthropic. You will collaborate closely with teams and leaders across Anthropic, focusing on the observability, detection, investigation, incident response, and intelligence portions of the security lifecycle. You will collaborate closely with preventative security engineering teams and other cross-functional teams. Responsibilities: Manage and grow a high-performing D&R team, planning strategy and hiring to support Anthropic's rapid growth and unique AI safety requirements Navigate prioritization in a fast-paced frontier environment, balancing operational demands with building innovative, scalable solutions for the future Collaborate across security engineering teams to build comprehensive prevention, observability, detection, and response capabilities throughout the security lifecycle Facilitate development of scalable, AI-leveraged D&R solutions that enable self-service observability and detection capabilities across Anthropic Build partnerships with product, infrastructure, and research teams to instill security monitoring best practices Own and continuously improve Security Incident Response, Data Management, and Detection Engineering policies and playbooks Operate our threat intelligence program and maintain relationships with external security partners and information sharing communities Continuously drive capability maturity across the detection lifecycle, establishing metrics and KPIs to measure effectiveness Who you are: 5+ years building detection and response capabilities in a cloud-native organization 5+ years of engineering management experience with a proven track record of building and scaling security teams Deep understanding of security monitoring, threat detection, incident response, and forensics best practices Experienced in securing complex cloud environments (Kubernetes, AWS/GCP) with modern detection technologies Knowledgeable in AI/ML security risks, detection patterns, and response strategies Strong verbal and written communication skills with the ability to distill complex security topics Skilled at collaborating cross-functionally and effectively balancing security requirements with business objectives Able to drive high-impact work while incorporating feedback and adapting to changing priorities Passionate about building diverse, high-performing teams and growing engineers in a fast-paced environment Low ego, high empathy, and have a track record as a talent magnet who attracts and retains top security talent Deadline to apply: None. Applications will be reviewed on a rolling basis. The expected salary range for this position is:Annual Salary:$320,000—$405,000 USDLogistics Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this. We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed.  Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team. How we're different We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills. The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences. Come work with us! Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process

MLOps / DevOps Engineer

Apply

July 23, 2025

Hidden link

Site Reliability Engineer

HappyRobot

11-50

-

Spain

Full-time

Remote

About HappyRobotHappyRobot is a platform to build and deploy AI workers that automate communication. See a demoOur AI workers connect to any system or data source to handle phone calls, email, messages…We target the logistics industry which relies heavily on communication to book, check on, & pay for freight. Primarily working with freight brokers, 3PLs, freight forwarders, shippers, warehouses, & other supply chain enterprises and tech startups.We raised a Series A round from a16z and YC and we’re growing very fast.We're looking for rockstars with a relentless drive, unstoppable energy, and a true passion for building something great—ready to embrace the challenge, push limits, and thrive in a fast-paced, high-intensity environment.About the RoleWe're looking for a Site Reliability Engineer to take the lead on scaling our operational resilience as we grow. You’ll own the stability, observability, and debugging workflows that keep our systems running smoothly. You'll be the go-to person for untangling complex failures in real time, designing tools that turn chaos into clarity, and helping us shift from reactive to proactive operations.This is a high-impact, high-trust role where you’ll shape how reliability is done - reducing incident load, building internal tooling, and directly improving developer focus and system uptime. If you love getting to the root of hard problems and making systems (and teams) stronger, this is your moment.Must-Have1+ years of hands-on experience debugging production systems (logs, traces, incidents, etc.)Strong problem-solving skills and ability to dive into unfamiliar backend codebasesComfort with Python and Go for reading code and writing small tools/utilitiesFamiliarity with observability and monitoring tools (e.g., Datadog, Prometheus, Sentry)Clear, calm communication under pressure — especially during live incidentsNice-to-HaveExperience working with distributed systems or services at scaleBuilt or maintained internal tooling for on-call teams or reliability workflowsFamiliarity with deployment pipelines, CI/CD, or infra-as-codeExperience improving system observability (e.g., custom metrics, traces, log pipelines)Why join us?Opportunity to work at a high-growth AI startup, backed by top investors.Fast Growth - Backed by a16z and YC, on track for double-digit ARR.Top-Tier Compensation - Competitive salary + equity in a high-growth startup.Ownership & Autonomy - Take full ownership of projects and ship fast.Work With the Best - Join a world-class team of engineers and builders.Our Operating Principles Extreme Ownership We take full responsibility for our work, outcomes, and team success. No excuses, no blame-shifting — if something needs fixing, we own it and make it better. This means stepping up, even when it’s not “your job.” If a ball is dropped, we pick it up. If a customer is unhappy, we fix it. If a process is broken, we redesign it. We don’t wait for someone else to solve it — we lead with accountability and expect the same from those around us. Craftsmanship Putting care and intention into every task, striving for excellence, and taking deep ownership of the quality and outcome of your work. Craftsmanship means never settling for “just fine.” We sweat the details because details compound. Whether it’s a product feature, an internal doc, or a sales call — we treat it as a reflection of our standards. We aim to deliver jaw-dropping customer experiences by being curious, meticulous, and proud of what we build — even when nobody’s watching. We are “majos” Be friendly & have fun with your coworkers. Always be genuine & honest, but kind. “Majo” is our way of saying: be a good human. Be approachable, helpful, and warm. We’re building something ambitious, and it’s easier (and more fun) when we enjoy the ride together. We give feedback with kindness, challenge each other with respect, and celebrate wins together without ego. Urgency with Focus Create the highest impact in the shortest amount of time. Move fast, but in the right direction. We operate with speed because time is our most limited resource. But speed without focus is chaos. We prioritize ruthlessly, act decisively, and stay aligned. We aim for high leverage: the biggest results from the simplest, smartest actions. We’re running a high-speed marathon — not a sprint with no strategy. Talent Density and Meritocracy Hire only people who can raise the average; ‘exceptional performance is the passing grade.’ Ability trumps seniority. We believe the best teams are built on talent density — every hire should raise the bar. We reward contribution, not titles or tenure. We give ownership to those who earn it, and we all hold each other to a high standard. A-players want to work with other A-players — that’s how we win. First-Principles Thinking Strip a problem to physics-level facts, ignore industry dogma, rebuild the solution from scratch. We don’t copy-paste solutions. We go back to basics, ask why things are the way they are, and rebuild from the ground up if needed. This mindset pushes us to innovate, challenge stale assumptions, and move faster than incumbents. It’s how we build what others think is impossible.The personal data provided in your application and during the selection process will be processed by Happyrobot, Inc., acting as Data Controller.By sending us your CV, you consent to the processing of your personal data for the purpose of evaluating and selecting you as a candidate for the position. Your personal data will be treated confidentially and will only be used for the recruitment process of the selected job offer.In relation to the period of conservation of your personal data, these will be eliminated after three months of inactivity in compliance with the GDPR and legislation on the protection of personal data.If you wish to exercise your rights of access, rectification, deletion, portability or opposition in relation to your personal data, you can do so through security@happyrobot.ai subject to the GDPR.For more information, visit https://www.happyrobot.ai/privacy-policyBy submitting your request, you confirm that you have read and understood this clause and that you agree to the processing of your personal data as described.

MLOps / DevOps Engineer

Apply

July 22, 2025

Hidden link

Senior ML Infrastructure Engineer

Hippocratic AI

101-200

-

United States

Full-time

Remote

About UsHippocratic AI is developing the first safety-focused Large Language Model (LLM) for healthcare. Our mission is to dramatically improve healthcare accessibility and outcomes by bringing deep healthcare expertise to every person. No other technology has the potential for this level of global impact on health.Why Join Our TeamInnovative mission: We are creating a safe, healthcare-focused LLM that can transform health outcomes on a global scale.Visionary leadership: Hippocratic AI was co-founded by CEO Munjal Shah alongside physicians, hospital administrators, healthcare professionals, and AI researchers from top institutions including El Camino Health, Johns Hopkins, Washington University in St. Louis, Stanford, Google, Meta, Microsoft and NVIDIA.Strategic investors: We have raised a total of $278 million in funding, backed by top investors such as Andreessen Horowitz, General Catalyst, Kleiner Perkins, NVIDIA’s NVentures, Premji Invest, SV Angel, and six health systems.Team and expertise: We are working with top experts in healthcare and artificial intelligence to ensure the safety and efficacy of our technology.For more information, visit www.HippocraticAI.com.We value in-person teamwork and believe the best ideas happen together. Our team is expected to be in the office five days a week in Palo Alto, CA unless explicitly noted otherwise in the job description.The Role:We are seeking a Machine Learning Infrastructure Engineer to design, build, and manage the next-generation training and inference platform for LLMs. You will be at the heart of building scalable, efficient infrastructure that supports our researchers and engineers in training, serving, and experimenting with large models at scale. Your work will directly impact our ability to innovate with new architectures and training techniques in production environments.Key Responsibilities:LLM Training Infrastructure: Design and operate large-scale training clusters using Kubernetes and/or Slurm for LLM experimentation, fine-tuning, and RLHF workflows. Cluster & GPU Management: Own scheduling, autoscaling, resource allocation, and monitoring across high-performance GPU clusters (NVIDIA, AMD). Distributed Systems: Build and optimize distributed data pipelines using frameworks like Ray, enabling parallel training and inference jobs. Inference Optimization: Benchmark and optimize model serving performance with technologies like vLLM, and support autoscaling of inference workloads in production environments. Platform Reliability: Collaborate with infra and platform engineers to ensure system robustness, observability, and maintainability of ML workloads. Research Enablement: Partner closely with ML researchers to enable rapid experimentation through flexible and efficient infrastructure tooling. Preferred Qualifications:5+ years of experience in infrastructure, MLOps, or systems engineering, ideally with time spent in architect or staff-level roles. Proven experience managing large-scale Kubernetes or Slurm clusters for training or serving ML workloads. Strong proficiency in Python; familiarity with Go or Rust is a plus. Hands-on experience with Ray, vLLM, Hugging Face Transformers, and/or custom LLM training stacks. Deep understanding of GPU scheduling, container orchestration, and workload optimization across heterogeneous hardware. Experience with inference workloads, benchmarking, latency optimization, and cost-performance tradeoffs. Familiarity with Reinforcement Learning, particularly RLHF frameworks, is a strong plus. Contributions to internal platforms that enabled others to train or fine-tune LLMs efficiently.Bonus Skills:Exposure to multiple hardware platforms (e.g., H100s, A100s, MI300X). Experience with managing storage, IOPS performance, and object store integration for ML data. Familiarity with building observability into ML pipelines (e.g., Prometheus, Grafana, Datadog). Ability to present infra systems/platforms to technical stakeholders.

MLOps / DevOps Engineer

Apply

July 18, 2025

Hidden link

SRE - Observability (Senior)

Lambda AI

501-1000

USD

267000

-

401000

United States

Full-time

Remote

Lambda is the #1 GPU Cloud for ML/AI teams training, fine-tuning and inferencing AI models, where engineers can easily, securely and affordably build, test and deploy AI products at scale. Lambda’s product portfolio includes on-prem GPU systems, hosted GPUs across public & private clouds and managed inference services – servicing government, researchers, startups and Enterprises world-wide. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll DoDeploy and operate observability platforms for logging, metrics, and distributed tracing.Automate the deployment and operation of these observability systems.Set up monitoring for modern AI/HPC clusters.Develop platform software to make observability adoptable and improve system reliability across Lambda engineering.Lead members of other engineering teams to design and develop solutions for their monitoring challenges.YouHave 8+ years of experience in software engineering, with 3+ years in GoHave 5+ years of experience in Site Reliability Engineering practicesPossess proven understanding of Observability tools and practicesHave experience with application deployment and monitoring using KubernetesHave experience building CI/CD pipelinesExpect quality and reliability from the solutions you buildEnjoy collaborating across team boundaries to help our engineering teams meet their observability needs.Nice to HaveExperience monitoring AI systems or HPC clustersExperience with Prometheus and writing queries in PromQLExperience with messaging systems like NATSUnderstanding of the OpenTelemetry ecosystem and experience with both OTel instrumentation and the OTel collectorExperience with network monitoring, Ethernet and InfinibandUnderstanding of dashboard design principlesStrong understanding of Linux fundamentals and system administration.Experience with infrastructure automation tooling such as Ansible and TerraformSalary Range InformationBased on market data and other factors, the annual salary range for this position is $267K-$401K. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~350 employees (2024) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Software Engineer

Apply

July 18, 2025

Hidden link

MLOps Software Engineer

Cohere

501-1000

0

-

0

United Kingdom

Full-time

Remote

Who are we?Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what’s best for our customers.Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products.Join us on our mission and shape the future!Why this team?This team is responsible for building world-class infrastructure that is critical to all of Cohere’s success. Focus on stability, scalability, and observability are all paramount as this work acts as the foundation for all members of technical staff.Our team optimizes for a wide range of technical skillsets (some of which are outlined below). Being self-directed and adaptable, identifying and solving key problems are essential.Please Note: All of our infrastructure roles require participating in a 24x7 on-call rotation, where you are compensated for your on-call schedule.As a Software Engineer, you will:Build self-service systems that automate managing, deploying and operating services.This includes our custom Kubernetes operators that support language model deployments.Automate environment observability and resilience. Enable all developers to troubleshoot and resolve problems.Take steps required to ensure we hit defined SLOs, including participation in an on-call rotation.Build strong relationships with internal developers and influence the Infrastructure team’s roadmap based on their feedback.Develop our team through knowledge sharing and an active review process.You may be a good fit if:You have proven production experience with Kubernetes.You’re skilled in managing GPU-accelerated workloads within Kubernetes environments, with experience in scaling and debugging cloud-based infrastructureYou have hands-on coding experience developing services and automated tests (experience with Go and Python is plus!).You prefer contributing to Open Source solutions rather than building solutions from the ground up.You have experience scaling and debugging cloud-based infrastructure (such as Oracle, GCP, and Coreweave).You draw motivation from building systems that help others be more productive.You see mentorship, knowledge transfer, and review as essential prerequisites for a healthy team.If some of the above doesn’t line up perfectly with your experience, we still encourage you to apply! If you want to work really hard on a glorious mission with teammates that want the same thing, Cohere is the place for you.We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.Full-Time Employees at Cohere enjoy these Perks:🤝 An open and inclusive culture and work environment 🧑‍💻 Work closely with a team on the cutting edge of AI research 🍽 Weekly lunch stipend, in-office lunches & snacks🦷 Full health and dental benefits, including a separate budget to take care of your mental health 🐣 100% Parental Leave top-up for 6 months for employees based in Canada, the US, and the UK🎨 Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement🏙 Remote-flexible, offices in Toronto, New York, San Francisco and London and co-working stipend✈️ 6 weeks of vacationNote: This post is co-authored by both Cohere humans and Cohere technology.

MLOps / DevOps Engineer

Software Engineer

Apply

July 18, 2025

Hidden link

DevOps Engineer

Arize AI

101-200

USD

0

100000

-

185000

United States

Full-time

Remote

The Opportunity AI is rapidly transforming the world. Whether it’s developing the next generation of human-level intelligence, enhancing voice assistants, or enabling researchers to analyze genetic markers at scale, AI is increasingly integrated into various aspects of our daily lives. Arize AI is the leading AI observability and evaluation platform, empowering AI engineers to build and deploy high-performing, reliable models. As the AI landscape shifts from traditional ML to generative AI and agentic systems, Arize ensures teams have the tools to monitor, troubleshoot, and improve AI in production. The Team Our On-Prem engineering team is responsible for the deployment of Arize in customer environments. In addition to working with customers in defining infrastructure requirements, the team designs and develops software and tooling that enables the management of these systems at large scale. The On-Prem team has grown to be expert in Kubernetes and cloud deployment on GCP, Azure, and AWS as well as dealing with networking and security aspects of on-premise deployments. The team is dynamic and relies on few talented individuals with a high degree of autonomy and initiative. What You’ll Do Work hands-on with the infrastructure that supports our distributed & highly scalable services in both SaaS and on-prem offerings Gather requirements from customers and adapt manifests and software to support new environments Use and augment monitoring tools to observe platform health, ensure performance and reliability Interact with the product team to test new features and package new on-prem releases Automate and optimize the release pipeline to make it as frictionless as possible Exhibit continuous curiosity for emerging technology that could solve our challenges What will set you apart: 3+ years of experience as a DevOps Engineer, Cloud Engineer, Infrastructure Engineer or similar Excellent communication skills and ability to work directly with customers to understand and address their infrastructure needs Experience and fluency in Kubernetes A self starter with an ability to thrived in a fast paced environment Experience working with multiple cloud providers (AWS, GCP, Azure) and understanding how to adapt cloud-native architectures for on-premises environments Strong troubleshooting skills The estimated annual salary for this role is between $100,000 - $185,000, plus a competitive equity package. Actual compensation is determined based upon a variety of job related factors that may include: transferable work experience, skill sets, and qualifications. Total compensation also includes a comprehensive benefit package, including: medical, dental, vision, 401(k) plan, unlimited paid time off, generous parental leave plan, and others for mental and wellness support. While we are a remote-first company, we have opened offices in New York City and the San Francisco Bay Area, as an option for those in those cities who wish to work in-person. For all other employees, there is a WFH monthly stipend to pay for co-working spaces.More About Arize Arize’s mission is to make the world’s AI work and work for the people. Our founders came together through a common frustration: investments in AI are growing rapidly across businesses and organizations of all types, yet it is incredibly difficult to understand why a machine learning model behaves the way it does after it is deployed into the real world. Learn more about Arize in an interview with our founders: https://www.forbes.com/sites/frederickdaso/2020/09/01/arize-ai-helps-us-understand-how-ai-works/#322488d7753c   Diversity & Inclusion @ Arize Our company's mission is to make AI work and make AI work for the people, we hope to make an impact in bias industry-wide and that's a big motivator for people who work here. We actively hope that individuals contribute to a good culture Regularly have chats with industry experts, researchers, and ethicists across the ecosystem to advance the use of responsible AI Culturally conscious events such as LGBTQ trivia during pride month We have an active Lady Arizers subgroup

MLOps / DevOps Engineer

Apply

July 17, 2025

Hidden link

Top AI MLOps / DevOps Engineer Jobs Openings in 2025

Latest AI Jobs