AI MLOps / DevOps Engineer Jobs | Top AI MLOps / DevOps Engineer Openings in 2025

Anti-Fraud & Abuse Engineer (Europe)

Perplexity

1001-5000

-

Serbia

Germany

United Kingdom

Remote

Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Perplexity has raised over $1B in venture investment from some of the world’s most visionary and successful leaders, including Elad Gil, Daniel Gross, Jeff Bezos, Accel, IVP, NEA, NVIDIA, Samsung, and many more. Our objective is to build accurate, trustworthy AI that powers decision-making for people and assistive AI wherever decisions are being made. Throughout human history, change and innovation have always been driven by curious people. Today, curious people use Perplexity to answer more than 780 million queries every month–a number that’s growing rapidly for one simple reason: everyone can be curious. Perplexity is seeking a highly skilled, experienced and hands-on Anti Fraud & Abuse Engineer to join our dynamic security team, revolutionizing the way people search and interact with the internet. You will be responsible for designing, implementing, and operating cutting-edge monitoring and detection systems to identify and prevent fraudulent behaviors and abuse of our products and services. Responsibilities Design, build, and operate monitoring and detection systems to identify and prevent fraudulent behaviors and abuse Perform adversary hunting to detect abuse and misuse of our products and services Analyze how our products and services are being misused or abused Track and mitigate cost impacts of fraud and abuse across our technology stack Research and anticipate emerging abuse techniques to stay ahead of evolving threats Build and maintain relationships with external threat intelligence partners and industry communities Establish best practices, tools, and processes for our fraud and abuse detection program Qualifications Experience designing, implementing, and operating fraud detection, abuse prevention, or security monitoring systems Background in security, adversarial machine learning, threat intelligence, or fraud investigations Strong data analysis skills, able to identify patterns and anomalies within large datasets Understanding of both technical and behavioral signals to support effective fraud prevention strategies Excellent communication skills, able to present to both technical and non-technical audiences  

MLOps / DevOps Engineer

Software Engineer

Apply

October 21, 2025

Hidden link

Manager - Security Platform

Lambda AI

501-1000

USD

297000

-

495000

United States

Full-time

Remote

Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our San Francisco or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.About the RoleLambda Security protects some of the world's most valuable digital assets: invaluable training data, model weights representing immense computational investments, and the sensitive inputs required to leverage best of breed AI models. We're responsible for securing every byte that powers breakthrough artificial intelligence.As Manager of the Security Platform team, you'll build and lead a team of deeply security-aware software engineers who create the foundational tools and automation that enable Lambda to maintain security at scale without sacrificing velocity.Reporting to the Senior Manager of Security, you'll lead 3-4 engineers building platforms that serve three critical constituencies: Detection & Response needs operational tooling, Security Architecture needs automated enforcement of standards, and engineering teams need self-service capabilities that eliminate bottlenecks.You'll have direct access to deploy and run state-of-the-art LLMs on Lambda's infrastructure—a unique advantage enabling you to build intelligent security platforms that learn, adapt, and protect at a scale only possible when you own the AI infrastructure.Your success is measured in the problems other teams avoid because your platforms provide high-quality foundations.Your immediate focus will be on building your team, maturing our existing toolset, collecting requirements, and building a 6-12 month roadmap that aligns with the 2026 security strategic plan.We're looking for engineering managers who pair deep technical intuition with product sensibility and team-building excellence. If you're energized by multiplying security impact through excellent tooling, treating internal teams as customers, and building platforms where the secure path is the easy path, we'd love to talk.We value diverse backgrounds, experiences, and skills, and we are excited to hear from candidates who can bring unique perspectives to our team. If you do not exactly meet this description but believe you may be a good fit, please still apply and help us understand your readiness for this role. Your application is not a waste of our time.What You'll DoTeam Leadership & DevelopmentHire, develop, and retain a high-performing team of 3-4 deeply security-aware software engineers dedicated to building and operating production-grade security platforms.Foster a culture of technical excellence, collaboration, and continuous learning where engineers thrive through clear expectations, regular feedback, and opportunities for growth.Enable technical excellence through architectural and technical oversight, reviewing designs, assessing risk, and ensuring solutions are robust, maintainable, and scalable.Coach engineers in product thinking and platform ownership, helping them evolve from delivering features to managing services with measurable impact.Establish your team as the go-to platform builders, delivering solutions that meet any business need.Security Platform Product ManagementOwn the Security Platform roadmap, aligning it with Lambda's overall security and business objectives.Balance immediate security team needs against a strategic platform vision that yields compounding value.Proactively gather and meet customer needs that result in tools and platforms that are adopted voluntarily because they make secure development faster and easier.Operate as both technical and product leader, translating complex requirements into deliverable, incremental milestones that provide real impact early and often.Platform & Technical LeadershipEstablish Lambda's security data foundations: log pipelines, telemetry frameworks, and data services that empower decision-making across security functions.Build self-service security capabilities for engineering teams—authentication frameworks, secrets management systems, policy enforcement tooling, and security APIs that make the secure path the easy path.Direct the development of compliance automation that streamlines evidence collection, attestation, and reporting.Partner with the peer security teams to deliver operational tooling, incident response automation, and analysis systems that scale without toil.Ensure your team's platforms are reliable, monitored, and continuously improved based on real-world adoption and performance.Organizational & Strategic ContributionsPartner across the Security organization to deliver cohesive platform capabilities that enhance threat detection, prevention, and operational efficiency.Collaborate with engineering leadership to embed security into Lambda's development lifecycle, eliminating bottlenecks through self-service design and automation.Work effectively in a fast-moving startup environment where priorities shift, resources are constrained, and perfect solutions aren't possible. Make pragmatic trade-offs that deliver business value while building for the future.Communicate impact clearly to stakeholders who may not immediately see the connection between infrastructure work and business outcomes.Define measurable success criteria and maintain a 6–12 month roadmap that advances Lambda's security posture.What We Think a Candidate Needs to Demonstrate to Succeed5+ years of experience leading software engineering teams OR 5+ years of software engineering experience with 3+ years in engineering management.Proven ability to build, lead, and scale technical teams that deliver complex, high-impact platforms.Strong architectural intuition and product management sensibility; you can set vision, make tradeoffs, and deliver tools engineers love to use.Experience building security platforms, including understanding what makes them succeed or fail.Understanding of securing complex infrastructure environments, ideally including bare metal and cloud platforms, both as a consumer and a provider.Experience building automation-first systems that improve both velocity and security outcomes.Skilled communicator who can translate technical risk into business impact and build alignment across diverse stakeholders.Ability to thrive in a high-speed, high-ambiguity environment where you balance building for today while preparing for tomorrow's needs.Nice to HaveExperience building security platforms like SIEM/SOAR systems, security orchestration tools, compliance automation, or security data infrastructure.Deep familiarity with security platforms (e.g. Splunk/Elastic/Chronicle), secure-by-default infrastructure-as-code libraries (e.g. Terraform/CloudFormation), CI/CD systems, or secrets management (e.g. Vault, AWS Secrets Manager).Experience building data pipelines, analytics platforms, or data lakes for security use cases.Deep familiarity with SOC 2, ISO 27001, or similar compliance frameworks.Background in platform engineering, SRE, or building internal developer platforms.Experience enabling deep adoption of zero-trust architecture principles or building identity-centric security platforms.Excitement about leveraging Lambda's access to state-of-the-art LLMs for AI-powered security automation, orchestration, and analytics.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Software Engineer

Apply

October 20, 2025

Hidden link

Controls Engineer

X AI

5000+

-

United States

Remote

About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.About the Role The xAI datacenter control systems team is looking for a highly-motivated Controls Engineer to support the operation, design, and optimization of our advanced data center infrastructure. This person will play a key role in maintaining the stability and efficiency of the systems that power xAI's groundbreaking AI research. The Controls Engineer will collaborate with cross-functional teams to monitor and improve the performance of the data center throughout its lifecycle. This individual will work closely with Operations and Maintenance to ensure the systems are running smoothly, from day-to-day management to long-term upgrades and expansions. Responsibilities Executing the implementation and modification of Building Management System (BMS), Electrical Power Management System (EPMS), and any other SCADA systems critical for the operation of current and future data centers Overseeing the operation of critical systems such as chillers, pumps, and power distribution, using PLCs and other industrial control technologies Preparing specifications for Human-machine Interface (HMI) including graphics standard for interpretation by Grok Implementing and maintaining automated controls for fire safety and emergency shutdown procedures (ESOP) Developing a corporate-level reporting dashboard to monitor the operational trends such as power usage effectiveness (PUE) and performance/cost, and dynamically adjust the infrastructure set points for optimal performance Coordinating with utility providers to automate demand and operational responses to upstream grid events Required Qualifications Bachelor's Degree in an Engineering discipline: Controls and Computer Control Systems, Computer Science, Mechatronic Engineering, Automation Engineering, etc., or equivalent experience 5+ years of experience as a hands-on Controls Engineer working with industrial control systems Ignition SCADA Programming Experience Siemens TIA Portal PLC Programming Experience Physical Requirements: Ability to work for extended periods of time standing, when needed Work is often performed in tight quarters and physical dexterity is necessary to perform job functions Comfortable working in an environment requiring exposure to noise Ability to lift or carry 10-15lbs Ability to work evenings and weekends as needed Position is subject to pre-employment drug and random drug and alcohol testing Preferred Qualifications Advanced proficiency in programming industrial equipment in all IEC 61131-3 languages with a focus on structured text and ladder logic Experience with industrial control hardware from leading manufacturers (e.g. Siemens, SEL, Rockwell) Familiarity with Human Machine Interface (HMI / GUI) development using platforms such as Ignition, SEL, etc Methodical troubleshooting approach Strong documentation and communication skill set Technical project management mindset Team-player, takes initiative, and highly motivated xAI is an equal opportunity employer. California Consumer Privacy Act (CCPA) Notice

MLOps / DevOps Engineer

Apply

October 20, 2025

Hidden link

Senior Site Reliability Engineer - Named Accounts

Lambda AI

501-1000

USD

0

240000

-

425000

United States

Full-time

Remote

Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our upcoming Seattle office location or on-site with strategic customers 4 days per week; Lambda’s designated work from home day is currently Tuesday. About The RoleWe're looking for a Forward Deployed Engineer to embed directly with a strategic customer, serving as the technical bridge between Lambda and their team. You'll work where model performance matters most, delivery timelines are urgent, and ambiguity is the default state. Your job is to map problems, structure delivery paths, and ship solutions that create measurable impact.What You’ll DoCustomer EngagementEmbed on-site with a named strategic customer, becoming an extension of their teamAct as the primary technical liaison between Lambda and the customer organizationNavigate ambiguous requirements to identify root problems and define clear technical solutionsDrive alignment across internal Lambda teams and customer stakeholdersTechnical DeliveryScope, sequence, and build full-stack solutions that deliver measurable business valueDesign and implement infrastructure optimizations for AI/ML workloads at scaleDebug complex distributed systems issues across the infrastructure stackShip iteratively and learn fast, adjusting approach based on customer feedback and resultsStrategic ImpactIdentify reusable patterns from customer engagements that can scale across Lambda's customer baseSurface field intelligence that influences Lambda's product roadmapDocument and share learnings to elevate the capabilities of the broader teamRepresent Lambda with executive presence in high-stakes customer interactionsAbout YouMust-Have6+ years of experience in a SRE, software engineer, or similar role, with a deep knowledge of running Linux clusters and systemsStrong programming skills in Go and Python; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operatorsProven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)Hands-on experience with AI/ML workload management tools (Volcano, Kubeflow, or similar)Can work either independently with limited direction or as part of a teamFamiliarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelinesProven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similarExcellent communication skills with the ability to translate technical complexity for diverse audiencesExecutive presence and ability to represent Lambda in customer-facing situationsComfort operating in ambiguous environments with competing prioritiesStrong bias for action and shipping iterativelyNice-to-HaveDeep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experienceExposure to HPC clusters, AI/ML workloads, or large-scale GPU clustersHybrid or multi-cloud Kubernetes environment experienceContributions to CNCF projects or Kubernetes SIGsWhy Join UsWork on cutting-edge Managed Kubernetes platforms for AI/ML workloadsInfluence the platform roadmap and help shape operations and reliability best practicesCollaborate with a highly skilled engineerOpportunity to mentor and grow within a fast-growing, technology-driven environmentSalary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Software Engineer

Apply

October 19, 2025

Hidden link

Cloud & Automation Engineer - Cosmos

Infinity Constellation

11-50

EUR

0

25000

-

85000

Poland

Ukraine

Ireland

Full-time

Remote

About Cosmos We’re not your average MSP. We’re building an AI-first IT platform where humans and intelligent agents work together to keep companies running faster, smarter, and cheaper than anyone thought possible.Cosmos is a new Infinity Constellation venture — an AI-powered Managed Services company designed from the ground up for automation, observability, and Zero Trust. No legacy bloat. No ticket silos. Just clean architecture and code that matters.If you want to babysit servers, go somewhere else. If you want to build the backbone of how AI runs IT — keep reading.The Role We’re looking for a Cloud & Automation Engineer who lives at the intersection of systems, code, and chaos. You’ll be the one making sure everything just works — clouds stay up, agents stay smart, and users stay happy.You’ll: • Build, scale, and harden cloud environments (AWS + GCP). • Automate identity, provisioning, and access — because tickets are for amateurs. • Jump in on escalations that make other people sweat. • Write clean, reusable scripts to kill manual work forever. • Make the infrastructure invisible, reliable, and ridiculously fast.This is part ops, part dev, part automation wizard. You’ll be in the guts of the platform — the one who makes the lights never go out.What You’ll Do1. Cloud Infrastructure & Automation • Own AWS and GCP environments: compute, storage, networking, IAM, monitoring, and costs. • Automate provisioning and identity using APIs, SCIM, and whatever scripts get the job done. • Build integrations that let AI agents deploy, fix, and scale environments autonomously. • Write Python, Bash, or PowerShell scripts that make everyone’s life easier.2. Systems Reliability & Performance • Keep infrastructure stable, secure, and observable — CloudWatch, GCP Ops Suite, whatever works. • Run backups, test disaster recovery, and make sure you can sleep at night. • Patch, monitor, log, repeat — but smarter every time. • Troubleshoot the hard stuff: auth loops, permissions, weird edge cases no one’s seen before.3. Endpoint & Identity Management • Manage endpoints across macOS, Windows, and Linux using Zero Trust principles. • Build secure baselines that scale across hundreds of devices. • Work closely with our Security team to make “secure by default” more than a slogan.4. Collaboration & Mentorship • Act as L2/L3 escalation for Help Desk and onboarding. • Write documentation that people actually want to read. • Share what you learn — raise the team’s IQ every week.Who You Are • You’ve got 5+ years managing systems and cloud infrastructure — AWS, GCP, both, whatever. • You code or script your way out of every problem (Python, Bash, or PowerShell). • You think in automation, not checklists. • You know networking, IAM, and security inside out. • You’ve lived through a few “oh sh*t” incidents and learned how to prevent the next one. • You speak fluent troubleshooting — logs don’t scare you. • You want to be part of something early, messy, and meaningful.Location & Schedule This is a remote, global role, but you’ll work primarily on New York (EST) hours. We don’t care where you live — just that you can think fast, write clearly, and deliver results.Bonus Points • Terraform, Ansible, or configuration management experience. • Certs like AWS Solutions Architect or GCP Cloud Engineer. • Experience in MSP or multi-tenant environments. • Familiarity with SOC 2, ISO 27001, or GDPR compliance. • Obsessed with Zero Trust, automation, and clean architecture.Why You’ll Love It Here • You’ll work where AI and cloud collide — building systems that literally learn. • You’ll automate problems faster than they appear. • You’ll never get stuck waiting for permission to fix what’s broken. • You’ll be part of a team that values speed, reliability, and doing things right.This isn’t maintenance. It’s evolution. If uptime, automation, and adrenaline sound like your kind of stack — welcome to Cosmos.Salary and CompensationApproximate Annual Salary Range*Poland: EUR 35,000 – 65,000Romania: EUR 30,000 – 55,000Bulgaria: EUR 28,000 – 50,000Ukraine: EUR 25,000 – 45,000Ireland: EUR 55,000 – 85,000Other Eastern EU Markets: EUR 30,000 – 60,000*Ranges vary based on experience, certifications, and local cost of living and may also include an add’l bonus on top of base.What We OfferCompetitive compensation and performance-based incentives.Flexible remote or hybrid work options.Opportunities to work with AWS, GCP, JumpCloud, Halo ITSM, and Google Workspace.Professional development and support for certification.Collaborative team culture with a focus on reliability, security, and innovation.Exposure to complex multi-tenant and automation-driven environments.You’ll be part of a team building the next generation of AI-assisted cloud operations — combining automation, observability, and Zero Trust to reinvent how IT is delivered.

MLOps / DevOps Engineer

Software Engineer

Apply

October 18, 2025

Hidden link

IT Engineer

helsing

501-1000

-

United States

Full-time

Remote

Who we are Helsing is a defense AI company. Our mission is to protect our democracies. We aim to achieve technological leadership, so that open societies can continue to make sovereign decisions and control their ethical standards.  As democracies, we believe we have a special responsibility to be thoughtful about the development and deployment of powerful technologies like AI. We take this responsibility seriously.  We are an ambitious and committed team of engineers, AI specialists and customer-facing program managers. We are looking for mission-driven people to join our European teams – and apply their skills to solve the most complex and impactful problems. We embrace an open and transparent culture that welcomes healthy debates on the use of technology in defense, its benefits, and its ethical implications.  The role As an IT Engineer, you will be responsible for managing Helsing’s US IT and infrastructure. You will work across teams and geographies to establish secure and trusted infrastructure for collaborative work efforts focused on the transfer, development, and delivery of defense technologies in alignment with applicable regulations, standards, and industry best practices. You will be an essential aspect of Helsing’s ability to deliver complex systems that answer the challenges of tomorrow’s battlefields. The day-to-day Configuring and operating endpoints, including workstations, laptops and servers Supporting an office of about 10-25 employees Setting up and maintaining on-prem compute environments built on Linux Configuring and operating our network and VPN infrastructure  Partnering with Helsing's broader IT team on the operation and continuous optimization of Helsing US’s corporate, development, and customer environments, built primarily using Microsoft 365 and Azure alongside other SaaS solutions You should apply if you Have 5+ years of experience in IT infrastructure and information security (ideally in the defense industry)  Have experience in an IT engineering or SysAdmin role, particularly working with macOS, clients, AWS Cloud environments, Linux servers, and capabilities like Kubernetes Have managed modern network infrastructure, ideally including administration of network – and  host-based security tools Employ strong problem-solving, critical thinking, and analytical skills combined with the ability to find creative solutions Your personal values match ours: ownership, initiative, dedication to mission, speed and inclusiveness     Are collaborative, humble, intellectually curious, and driven to solve hard problems  Hold a current security clearance (ideally Top Secret) Feel strongly about the right of democracies to defend their sovereignty through the fielding of capabilities that bolster deterrence and decisive action Note: We operate in an industry where women, as well as other minority groups, are systematically under-represented. We encourage you to apply even if you don’t meet all the listed qualifications; ability and impact cannot be summarized in a few bullet points. Nice to Have Experience working with the Microsoft M365 stack Experience working with SIEM stacks Experience working with Mobile Device Management tools like Intune, Jamf, or JumpCloud Experience working with PowerShell, Python, or IaC tooling Information security expertise, especially related to regulations specific to CMMC and ITAR Join Helsing and work with world-leading experts in their fields  Helsing’s work is important. You’ll be directly contributing to the protection of democratic countries while balancing both ethical and geopolitical concerns The work is unique. We operate in a domain that has highly unusual technical requirements and constraints, and where robustness, safety, and ethical considerations are vital. You will face unique Engineering and AI challenges that make a meaningful impact in the world Our work frequently takes us right up to the state of the art in technical innovation, be it reinforcement learning, distributed systems, generative AI, or deployment infrastructure. The defense industry is entering the most exciting phase of the technological development curve. Advances in our field of world are not incremental: Helsing is part of, and often leading, historic leaps forward In our domain, success is a matter of order-of-magnitude improvements and novel capabilities. This means we take bets, aim high, and focus on big opportunities. Despite being a relatively young company, Helsing has already been selected for multiple significant government contracts We actively encourage healthy, proactive, and diverse debate internally about what we do and how we choose to do it. Teams and individual engineers are trusted (and encouraged) to practice responsible autonomy and critical thinking, and to focus on outcomes, not conformity. At Helsing you will have a say in how we (and you!) work, the opportunity to engage on what does and doesn’t work, and to take ownership of aspects of our culture that you care deeply about What we offer A focus on outcomes, not time-tracking A generous compensation and benefits package (in addition to base salary) that includes, but may not be limited to, insurance coverage (medical and travel), flexible paid time off, paid holidays, and remote and/or hybrid work available depending on position. All compensation and benefits are subject to the terms and conditions of the underlying plans or programs, as applicable and as may be amended, terminated or superseded from time to time.   Helsing is an Equal Opportunity Employer. We will consider all qualified applicants without regard to race, color, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status, genetics, or any other characteristic protected by applicable federal, state, or local law.  Helsing's Candidate Privacy and Confidentiality Regime can be found here.     

MLOps / DevOps Engineer

Software Engineer

Apply

October 17, 2025

Hidden link

Forward Deployed Engineer, DevOps - (f/m/d)

Parloa

201-500

-

Germany

Full-time

Remote

YOUR MISSION: As a Forward Deployed Engineer, DevOps, your mission is to build and scale the dedicated infrastructure that powers our custom agent integrations for our most important customers. You will be the key technical cornerstone of our team, empowering our customer-facing Forward Deployed Engineers to build, deploy, and manage bespoke client solutions with speed and confidence. By creating a robust, automated, and self-service platform, you will directly enable the rapid delivery of high-quality, reliable custom services.    IN THIS ROLE YOU WILL: Design, build, and maintain the cloud infrastructure specifically for hosting custom services developed by the Agent Integration Engineering team. Create flexible and reusable CI/CD pipeline templates that the integration team can easily adopt to automate the deployment of their services. Champion Infrastructure as Code to create standardized, yet customizable, service environments, ensuring consistency and rapid provisioning for new integrations. Manage a Kubernetes-based environment tailored for hosting a multitude of diverse integration services, focusing on security, isolation, and resource management. Implement robust monitoring and observability solutions to provide the integration team with deep visibility into the performance and health of their specific services. Serve as the primary DevOps partner for the Forward Deployed Engineering team, understanding their workflow, anticipating their needs, and removing infrastructure-related obstacles. Develop automation scripts (Python, Go, Bash) to streamline the entire lifecycle of custom integration services, from creation to decommissioning. Support and modernize legacy Azure infrastructure, helping transition workloads to a standardized, Kubernetes-based platform aligned with Parloa’s Engineering practices.   OUR TECH STACK Backend: TypeScript, Python, Node.js Infrastructure: Terraform, Azure, Kubernetes, CI/CD Pipelines Databases: MongoDB, CosmosDB, PostgreSQL Monitoring & Observability: Prometheus, Grafana, OpenTelemetry, ElasticSearch, Kibana, Datadog AI & Data: LLMs, Prompt Engineering, RAG    WHAT YOU BRING TO THE TABLE: 5+ years of professional experience in DevOps, infrastructure engineering, or a similar role. A strong customer-first mindset, viewing the Forward Deployed Engineering team as your primary customer. Proficiency in at least one scripting or programming language like Python, Go, or Bash. Hands-on expertise with a major cloud provider (AWS, GCP, or Azure). Advanced knowledge of containerization and orchestration, specifically Kubernetes, Helm, and Docker. Proven experience creating flexible CI pipelines with tools like Jenkins, GitHub Actions, or GitLab CI, and CD/GitOps workflows with tools like Argo CD. A strong background in Infrastructure as Code (Terraform is highly preferred). Experience building self-service tools and platforms that empower specific development teams.   WHATS IN IT FOR YOU: Join a diverse team of 40+ nationalities with flat hierarchies and a collaborative company culture. Opportunity to build and scale your career at the intersection of customer-facing roles and engineering in a dynamic startup on its journey to become an international leader in SaaS platforms for Conversational AI. Deutschland ticket, Urban Sports Club, Job Rad, Nilo Health, weekly sponsored office lunches. Competitive compensation and equity package. Flexible working hours, 28 vacation days and workation opportunities. Access to a training and development budget for continuous professional growth. Regular team events, game nights, and other social activities. Work from a home and/or our beautiful office(s) in the heart of Berlin or Munich with adjustable desks, social areas, fresh fruits, cereals, and drinks. Your recruiting process at Parloa: Recruiter video call → Meet your manager → Challenge Task → Leadership Assessment→ Bar Raiser Interview    Why Parloa?  Parloa is one of the fastest growing startups in the world of Generative AI and customer service. Parloa’s voice-first GenAI platform for contact centers is built on the best AI technology to automate customer service with natural-sounding conversations for outstanding experiences on all communication channels. Leveraging natural language processing (NLP) and machine learning, Parloa creates intelligent phone and chat solutions for businesses that turn contact centers into value centers by boosting customer service efficiency. The Parloa platform resolves the majority of customer queries quickly and automatically, allowing human agents to focus on complex issues and relationships. Parloa was founded in 2018 by Malte Kosub and Stefan Ostwald and today employs over 400+ people in Berlin, Munich, and New York. When you join Parloa, you become part of a dynamic and innovative team made up of over 34 nationalities that’s revolutionizing an entire industry. We’re passionate about growing together and creating opportunities for personal and professional development. With our recent $120 million Series C investment, we’re expanding globally and looking for talented individuals to join us on this exciting journey. Do you have questions about Parloa, the role, or our team before you apply? Please feel free to get in touch with our Hiring Team. Parloa is committed to upholding the highest data protection standards for our clients' and employees' data. All our employees are instrumental in ensuring the utmost care, GDPR, and ISO compliance, including ISO 27001, in handling sensitive information. * We provide equal opportunities to all qualified applicants regardless race, gender, sexual orientation, age, religion, national origin, disability status, socioeconomic background and other characteristics.

MLOps / DevOps Engineer

Apply

October 17, 2025

Hidden link

Tech Lead, AI Compute Infrastructure

HeyGen

201-500

0

-

0

United States

Canada

Full-time

Remote

About HeyGen At HeyGen, our mission is to make visual storytelling accessible to all. Over the last decade, visual content has become the preferred method of information creation, consumption, and retention. But the ability to create such content, in particular videos, continues to be costly and challenging to scale. Our ambition is to build technology that equips more people with the power to reach, captivate, and inspire audiences. Learn more at www.heygen.com.  Visit our Mission and Culture doc here. We are seeking a seasoned Technical Leader to build and scale the foundational compute infrastructure that powers our state-of-the-art AI models—from multimodal training data pipelines to high-throughput, low-latency video generation. Responsibilities You will be the core engineer responsible for building the robust, efficient, and scalable platform that enables our research and production teams to rapidly iterate on HeyGen's generative video models. Your contributions will directly impact model performance, developer productivity, and the final quality of every AI-generated video. Optimize GPU Utilization: Design and implement mechanisms to aggressively optimize GPU and cluster utilization across thousands of devices for inference, training, data processing and large-scale deployment of our state-of-art video generation models. Develop Large-Scale AI Job Framework: Build highly scalable, reliable frameworks for launching and managing massive, heterogeneous compute jobs, including multi-modal high-volume data ingestion/processing, distributed model training, and continuous evaluation/benchmarking. Enhance Observability: Develop world-class observability, tracing, and visualization tools for our compute cluster to ensure reliability, diagnose performance bottlenecks (e.g., memory, bandwidth, communication). Accelerate Pipelines: Collaborate closely with AI researchers and AI engineers to integrate innovative acceleration techniques (e.g., custom CUDA kernels, distributed training libraries) into production-ready, scalable training and inference pipelines. Infrastructure Management: Champion the adoption and optimization of modern cloud and container technologies (Kubernetes, Ray) for elastic, cost-efficient scaling of our distributed systems. Minimum Requirements We are looking for a highly motivated engineer with deep experience operating and optimizing AI infrastructure at scale. Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience. 5+ years of full-time industry experience in large-scale MLOps, AI infrastructure, or HPC systems. Experience with data frameworks and standards like Ray, Apache Spark, LanceDB Strong proficiency in Python and a high-performance language such as C++  for developing core infrastructure components. Deep understanding and hands-on experience with modern orchestration and distributed computing frameworks such as Kubernetes and Ray. Experience with core ML frameworks such as PyTorch, TensorFlow, or JAX. Preferred Qualifications Master's or PhD in Computer Science or a related technical field. Demonstrated Tech Lead experience, driving projects from conceptual design through to production deployment across cross-functional teams. Prior experience building infrastructure specifically for Generative AI models (e.g., diffusion models, GANs, or large language models) where cost and latency are critical. Proven background in building and operating large-scale data infrastructure (e.g., Ray, Apache Spark) to manage petabytes of multi-modal data (video, audio, text). Expertise in GPU acceleration and deep familiarity with low-level compute programming, including CUDA, NCCL, or similar technologies for efficient inter-GPU communication. What HeyGen Offers Competitive salary and benefits package. Dynamic and inclusive work environment. Opportunities for professional growth and advancement. Collaborative culture that values innovation and creativity. Access to the latest technologies and tools.   HeyGen is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

MLOps / DevOps Engineer

Machine Learning Engineer

Software Engineer

Apply

October 16, 2025

Hidden link

Forward Deployed Engineer, DevOps

Parloa

201-500

USD

0

225000

-

335000

United States

Full-time

Remote

YOUR MISSION: As a Forward Deployed Engineer, DevOps, your mission is to build and scale the dedicated infrastructure that powers our custom agent integrations for our most important customers. You will be the key technical cornerstone of our team, empowering our customer-facing Forward Deployed Engineers to build, deploy, and manage bespoke client solutions with speed and confidence. By creating a robust, automated, and self-service platform, you will directly enable the rapid delivery of high-quality, reliable custom services.    IN THIS ROLE YOU WILL: Design, build, and maintain the cloud infrastructure specifically for hosting custom services developed by the Agent Integration Engineering team. Create flexible and reusable CI/CD pipeline templates that the integration team can easily adopt to automate the deployment of their services. Champion Infrastructure as Code to create standardized, yet customizable, service environments, ensuring consistency and rapid provisioning for new integrations. Manage a Kubernetes-based environment tailored for hosting a multitude of diverse integration services, focusing on security, isolation, and resource management. Implement robust monitoring and observability solutions to provide the integration team with deep visibility into the performance and health of their specific services. Serve as the primary DevOps partner for the Forward Deployed Engineering team, understanding their workflow, anticipating their needs, and removing infrastructure-related obstacles. Develop automation scripts (Python, Go, Bash) to streamline the entire lifecycle of custom integration services, from creation to decommissioning. Support and modernize legacy Azure infrastructure, helping transition workloads to a standardized, Kubernetes-based platform aligned with Parloa’s Engineering practices.   WHAT YOU BRING TO THE TABLE: 5+ years of professional experience in DevOps, infrastructure engineering, or a similar role. A strong customer-first mindset, viewing the Forward Deployed Engineering team as your primary customer. Proficiency in at least one scripting or programming language like Python, Go, or Bash. Hands-on expertise with a major cloud provider (AWS, GCP, or Azure). Advanced knowledge of containerization and orchestration, specifically Kubernetes, Helm, and Docker. Proven experience creating flexible CI pipelines with tools like Jenkins, GitHub Actions, or GitLab CI, and CD/GitOps workflows with tools like Argo CD. A strong background in Infrastructure as Code (Terraform is highly preferred). Experience building self-service tools and platforms that empower specific development teams.   WHATS IN IT FOR YOU: Join a diverse team of 40+ nationalities with flat hierarchies and a collaborative company culture, and enjoy an immersive onboarding experience in Berlin to dive into our product and culture. Opportunity to build and scale your career at the intersection of customer-facing roles and engineering in a dynamic startup on its journey to become an international leader in SaaS platforms for Conversational AI. A beautiful office with flair in the heart of NYC with all the conveniences, such as social area, snacks, and drinks. Competitive compensation and equity package. Flexible working hours, unlimited PTO, and travel opportunities. Access to a training and development budget for continuous professional growth. ClassPass membership, Nilo Health, Health insurance, weekly sponsored office lunches. Regular team events, game nights, and other social activities. Hybrid work environment - we believe in hiring the best talent, no matter where they are based. However, we love to build real connections and want to welcome everyone in the office on certain days Your recruiting process at Parloa: Recruiter video call → Meet your manager → Challenge Task → Bar Raiser Interview Salary Range includes OTESalary Range$225,000—$335,000 USD  Why Parloa?  Parloa is one of the fastest growing startups in the world of Generative AI and customer service. Parloa’s voice-first GenAI platform for contact centers is built on the best AI technology to automate customer service with natural-sounding conversations for outstanding experiences on all communication channels. Leveraging natural language processing (NLP) and machine learning, Parloa creates intelligent phone and chat solutions for businesses that turn contact centers into value centers by boosting customer service efficiency. The Parloa platform resolves the majority of customer queries quickly and automatically, allowing human agents to focus on complex issues and relationships. Parloa was founded in 2018 by Malte Kosub and Stefan Ostwald and today employs over 400+ people in Berlin, Munich, and New York. When you join Parloa, you become part of a dynamic and innovative team made up of over 34 nationalities that’s revolutionizing an entire industry. We’re passionate about growing together and creating opportunities for personal and professional development. With our recent $120 million Series C investment, we’re expanding globally and looking for talented individuals to join us on this exciting journey. Do you have questions about Parloa, the role, or our team before you apply? Please feel free to get in touch with our Hiring Team. Parloa is committed to upholding the highest data protection standards for our clients' and employees' data. All our employees are instrumental in ensuring the utmost care, GDPR, and ISO compliance, including ISO 27001, in handling sensitive information. * We provide equal opportunities to all qualified applicants regardless race, gender, sexual orientation, age, religion, national origin, disability status, socioeconomic background and other characteristics.

MLOps / DevOps Engineer

Software Engineer

Apply

October 16, 2025

Hidden link

AI/HPC Network Development Engineer - Networking

X AI

5000+

0

-

0

United States

Ireland

Full-time

Remote

About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.About the Role xAI was first in the world to build a 100k GPU cluster on an ethernet network and then did it again in 92 days, floors, walls and all. We need an engineer with deep experience in RoCEv2 that can develop at hyper scale while optimizing performance and availability.  xAI is building at a furious pace with the latest hardware to help people understand the universe. To make the next significant leap forward, we need to own our own destiny by understanding our current network performance and availability and then optimize it to our training models and how we execute customer inference queries. You will spend most of your days deep inside  NCCL, building metric dashboards and tweaking configurations to ensure no performance is left on the table. You will help design the next iteration of our backend and front-end networks that will allow us to seamlessly build-out new GPU infrastructure with little to no engineering assistance.   There will be a significant amount of travel to Memphis for building more capacity as well as participating in a team on-call rotation and helping on other scaling and maintenance efforts. This will become easier as we build out the team and engineers contribute to deployment and operations frameworks to remove repetitive tasks. Location We have 2 openings, one based in Palo Alto, California and the other in Dublin, Ireland. There will be significant travel expected to Memphis, Tennessee for data center buildouts and to the head office in Palo Alto for team collaboration. Required Qualifications A minimum of 10 years designing and operating large scale networks with 5 years in the ethernet AI/HPC space. Deep understanding of congestion control on ethernet with Infiniband an added bonus. Deep understanding of AI training and inference workloads and how they operate on the network. As part of this you are able to use and debug NCCL and potentially commit to the library. Expertise in creating a portfolio of metrics for performance and operations to optimize the fleet for training and inference traffic. Experience with Python to automate away repetitive tasks and facilitate your daily job working with and analyzing large sets of data. Interview Process After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to an initial interview (45 minutes - 1 hour) during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of five interviews: Coding assessment in a language of your choice. Data center network technologies and RoCEv2. Manager Interview. Meet and greet with the wider team where you will run through a presentation of a body of work you are proud of. Benefits Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.xAI is an equal opportunity employer. California Consumer Privacy Act (CCPA) Notice

MLOps / DevOps Engineer

Machine Learning Engineer

Apply

October 15, 2025

Hidden link

Information System Security Manager (ISSM), Public Sector

Scale AI

5000+

USD

162800

-

245300

United States

Full-time

Remote

Our Security team works on operational issues at the leading edge of machine learning technology. You will join a creative and solutions-oriented team collaborating with internal teams at Scale and externally with our customers. Scale is looking for an experienced security and compliance professional to support Assessment and Authorization and agency audit activities for Scale’s products that are offered in the US Government and global Public Sector space. We are looking for relentlessly curious, deliberately open-minded, and action-oriented generalists who can design effective legal advice, internal policies, and operational processes while employing an empathetic interpersonal style. If you enjoy solving novel and challenging problems and building strong teams and relationships while doing it, we’d love to hear from you! You will: Lead public sector security compliance projects and audits (FedRAMP HIGH, DoD Cloud Computing SRG IL4/IL5/IL6 , NIST 800-53 rev 5, NIST 800-171/CMMC, Risk Management Framework) Collaborate with product, engineering, security, operations, people operations, and legal to implement new technical, administrative, and operational controls Work with 3PAOs and federal government AOs to achieve compliance certifications and reports Ensure the implementation, oversight, monitoring, and maintenance of security configurations, practices, and procedures  Serve as a liaison between system owners and other security personnel, ensuring that selected security controls are effectively implemented and maintained throughout the lifecycle of projects Act as a liaison between system owners and other security personnel to facilitate effective communication and collaboration Develop, maintain, review, and update system security documentation on a continuous basis  Conduct required vulnerability scans and develop Plan of Action and Milestones (POAMs) in response to reported security vulnerabilities. Manage risks by coordinating correction or mitigation actions and tracking the completion of POAMs  Coordinate system owner concurrence for correction or mitigation actions and monitor security controls to maintain security Authorized To Operate (ATO) Upload security control evidence to the Governance, Risk, and Compliance (GRC) application (eMASS or Xacta) to support security control implementation during the monitoring phase Lead Risk Management Assessment and Authorization (A&A) processes for deployments Perform Cloud system risk assessments, enhance process workflows, and develop new processes Implement all applicable manual Security Technical Implementation Guides (STIGs), vendor hardening guides and ensuring timely installation of all available patches Create and maintain ATO packages Lead security compliance reviews for new products, changes, and features Proactively evaluate and advise the business on new and evolving certification programs, requirements, and technologies Develop and provide training to improve the security awareness and knowledge for all employees and contractors Required: Active US Top Secret security clearance with minimum IAT Level 2 certification (Security +, CASP, or similar)  Ideally you’d have: Experience implementing and maintaining some of the following frameworks and standards: FedRAMP, DoD Cloud Computing SRG, NIST 800-171, NIST 800-53, CMMC, NIST 800-53. STIG/RMF policy knowledge & implementation, including validating compliance via ACAS and other relevant tests. Experience in project management and taking projects from conception to launch An ability to translate between business and technical risk and communicate clearly to leadership Excellent organizational and communications skills Understanding of cybersecurity controls for cloud service providers Knowledge of AWS and other government authorized cloud services 5+ years of security compliance or technology audit related experience Nice-to-haves: Bachelor’s degree in accounting, information systems, computer science, or a related field Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You’ll also receive benefits including, but not limited to: Comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend.The base salary range for this full-time position in the location of Washington DC is:$195,800—$245,300 USDCompensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You’ll also receive benefits including, but not limited to: Comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend.The base salary range for this full-time position in the location of St. Louis is:$162,800—$203,500 USDPLEASE NOTE: Our policy requires a 90-day waiting period before reconsidering candidates for the same role. This allows us to ensure a fair and thorough evaluation of all applicants. About Us: At Scale, our mission is to develop reliable AI systems for the world's most important decisions. Our products provide the high-quality data and full-stack technologies that power the world's leading models, and help enterprises and governments build, deploy, and oversee AI applications that deliver real impact. We work closely with industry leaders like Meta, Cisco, DLA Piper, Mayo Clinic, Time Inc., the Government of Qatar, and U.S. government agencies including the Army and Air Force. We are expanding our team to accelerate the development of AI applications. We believe that everyone should be able to bring their whole selves to work, which is why we are proud to be an inclusive and equal opportunity workplace. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability status, gender identity or Veteran status.  We are committed to working with and providing reasonable accommodations to applicants with physical and mental disabilities. If you need assistance and/or a reasonable accommodation in the application or recruiting process due to a disability, please contact us at accommodations@scale.com. Please see the United States Department of Labor's Know Your Rights poster for additional information. We comply with the United States Department of Labor's Pay Transparency provision.  PLEASE NOTE: We collect, retain and use personal data for our professional business purposes, including notifying you of job opportunities that may be of interest and sharing with our affiliates. We limit the personal data we collect to that which we believe is appropriate and necessary to manage applicants’ needs, provide our services, and comply with applicable laws. Any information we collect in connection with your application will be treated in accordance with our internal policies and programs designed to protect personal data. Please see our privacy policy for additional information.

MLOps / DevOps Engineer

Apply

October 15, 2025

Hidden link

Head of Platform Engineering

Descript

101-200

USD

224000

-

296000

United States

Full-time

Remote

Our vision at Descript is to build the next-generation platform for fast and easy creation of audio and video content. We are trusted by some of the world's top podcasters and influencers, as well as businesses like BBC, ESPN, HubSpot, Shopify, and The Washington Post for communicating via video. We've raised $100M from leading investors such as the OpenAI Startup fund, Andreessen Horowitz, Redpoint Ventures, and Spark Capital. About The Team We are seeking a Head of Platform Engineering to lead and scale our Platform organization. This organization at Descript is central to empowering our engineering teams to build and deliver products efficiently and effectively. Operating as an internal product team, the Platform group focuses on providing tools, infrastructure, and services that enable our engineers to deliver exceptional products to our users. You will be leading five teams across Platform, with expectations for continued growth in the next 6-12 months. The Platform organization encompasses: Infrastructure Team: Builds and maintains our core backend and development infrastructure, ensuring scalability, reliability, and performance. Builder Experience Team: Focuses on the foundations of our client applications, CI/CD, build, test, and release. AI Enablement Team: Develops and scales our AI/ML infrastructure to support cutting-edge research and integration of AI features into our products. Media Team: Scales and improves our proprietary media server which handles playback and media serving. Core Engineering: Responsible for monetization foundations (usage tracking, Stripe integration), identity services (auth, permissions, user/team management), and key enterprise integrations. Key Challenges Ahead: Infrastructure Scaling: Building a scalable, reliable, and performant infrastructure to support the growth of our engineering, data, and research teams. Product Transition: Leading the transition of our features and infrastructure from a desktop app-heavy product to a web-first and cloud-based user experience. Developer Experience: Establishing the foundations for a world-class developer experience, enhancing productivity and satisfaction across a fast-growing, full-stack engineering organization. AI/ML Integration: Scaling our AI/ML platform to support innovative research rapid productization of AI features, and a product feedback cycle for model improvements. This position reports directly to the VP of Engineering. What You'll Do Strategic Leadership: Develop and execute a strategic vision and roadmap for the Platform organization, aligning with company goals and ensuring cross-team collaboration. Team Development: Recruit, mentor, and grow engineering managers and engineers, fostering a culture of continuous learning and development. Operational Excellence: Ensure execution across the Platform teams is predictable, reliable, and sustainable, implementing best practices in project management and engineering processes. Cross-Functional Collaboration: Work closely with Product, Design, and other Engineering teams to ensure platform initiatives meet the needs of internal stakeholders. Innovation and Improvement: Drive innovation within the Platform organization, continually seeking ways to improve our infrastructure, tools, and processes. Culture Building: Help scale and evolve our company culture as we grow, promoting values of collaboration, inclusivity, and excellence. What You Bring Leadership Experience: Has 5+ years of engineering management experience, including leading multiple teams or an engineering organization, preferably in platform or developer experience domain. Technical Expertise: Demonstrates a strong technical background with experience in cloud platforms (GCP preferably), scalable infrastructure, and AI/ML technologies. Strategic Thinker: Can develop and communicate a clear vision, aligning teams and resources to achieve strategic objectives. Team Builder: Excels at creating collaborative, empowering, and high-performing team environments. Effective Communicator: Communicates clearly and effectively, both in writing and verbally, across technical and non-technical audiences. Adaptable and Resilient: Thrives in fast-paced, rapidly changing environments, with the ability to navigate ambiguity and drive results. Educational Background: Holds a Bachelor's degree in Computer Science, Engineering, or related field, or has equivalent professional experience. Nice to Haves Startup Experience: Experience managing engineering teams at a startup or high-growth company where the rate of change is extremely high. AI/ML Background: Familiarity with AI/ML platforms, tools, and practices. Developer Tools Expertise: Background in building developer tools and enhancing developer experience. The base salary range for this role is $224,000- $296,000/year. Final offer amounts will carefully consider multiple factors, including prior experience, expertise, location, and may vary from the amount above.  About Descript Descript is building a simple, intuitive, fully-powered editing tool for video and audio — an editing tool built for the age of AI. We are a team of 150 and the backing of some of the world's greatest investors (OpenAI, Andreessen Horowitz, Redpoint Ventures, Spark Capital).  Descript is the special company that's in possession of both product market fit and the raw materials (passionate user community, great product, large market) for growth, but is still early enough that each new employee has a measurable influence on the direction of the company. Benefits include a generous healthcare package, 401k matching program, catered lunches, and flexible vacation time. Our headquarters are located in the Mission District of San Francisco, CA. We're hiring for a mix of remote roles and hybrid roles.  For those who are remote, we have a handful of opportunities throughout the year for in person collaboration.  For our hybrid roles, we're flexible, and you're an adult—we don't expect or mandate that you're in the office every day. We do believe there are valuable and serendipitous moments of discovery and collaboration that come from working together in person.  Descript is an equal opportunity workplace—we are dedicated to equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, or Veteran status. We believe in actively building a team rich in diverse backgrounds, experiences, and opinions to better allow our employees, products, and community to thrive. 

MLOps / DevOps Engineer

Software Engineer

Apply

October 14, 2025

Hidden link

Deployment Engineer, AI Inference

Cerebras Systems

501-1000

-

United States

Canada

Remote

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.   Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.About Us Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.    Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In 2024, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.  About The Role  We are seeking a highly skilled Deployment Engineer to build and operate our cutting-edge inference clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power.   You will play a critical role in ensuring reliable, efficient, and scalable deployment of AI inference workloads across our global infrastructure. On the operational side, you’ll own the rollout of the new software versions and AI replica updates, along the capacity reallocations across our custom-built, high-capacity datacenters.    Beyond operations, you’ll drive improvements to our telemetry, observability and the fully automated pipeline. This role involves working with advanced allocation strategies to maximize utilization of large-scale computer fleets.   The ideal candidate combines hands-on operation rigor with strong systems engineering skills and thrives on building resilient pipelines that keep pace with cutting-edge AI models.  This role does not require 24/7 hour on-call rotations.     Responsibilities  Deploy AI inference replicas and cluster software across multiple datacenters  Operate across heterogeneous datacenter environments undergoing rapid 10x growth  Maximize capacity allocation and optimize replica placement using constraint-solver algorithms  Operate bare-metal inference infrastructure while supporting transition to K8S-based platform  Develop and extend telemetry, observability and alerting solutions to ensure deployment reliability at scale  Develop and extend a fully automated deployment pipeline to support fast software updates and capacity reallocation at scale  Translate technical and customer needs into actionable requirements for the Dev Infra, Cluster, Platform and Core teams  Stay up to date with the latest advancements in AI compute infrastructure and related technologies.   Skills And Requirements  2-5 years of experience in operating on-prem compute infrastructure (ideally in Machine Learning or High-Performance Compute) or id developing and managing complex AWS plane infrastructure for hybrid deployments  Strong proficiency in Python for automation, orchestration, and deployment tooling  Solid understanding of Linux-based systems and command-line tools  Extensive knowledge of Docker containers and container orchestration platforms like K8S  Familiarity with spine-leaf (Clos) networking architecture  Proficiency with telemetry and observability stacks such as Prometheus, InfluxDB and Grafana  Strong ownership mindset and accountability for complex deployments  Ability to work effectively in a fast-paced environment.   Location   SF Bay Area.  Toronto  Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection  point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2025. Apply today and become part of the forefront of groundbreaking advancements in AI! Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.

MLOps / DevOps Engineer

Apply

October 14, 2025

Hidden link

AI Inference Support Engineer

Cerebras Systems

501-1000

-

No items found.

Remote

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.   Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.About The Role Join Cerebras’ new Global Support organization to help customers run production-grade AI inference. You’ll troubleshoot issues across model serving, deployment, and observability; resolve customer tickets; and partner with Engineering to improve the reliability, performance, and usability of our inference platform.  Responsibilities Own inbound tickets for inference issues (availability, latency/throughput, correctness, model loading, etc.).  Triage, reproduce, and debug across the stack: APIs/SDKs, model serving layers (e.g., vLLM), networking, etc.  Analyze logs/metrics/traces (e.g., Prometheus/Grafana/ELK) to drive fast resolution and clear RCAs.  Create and maintain high-quality runbooks, knowledge base articles, and “getting unstuck” guides.  Collaborate with Product/Eng to escalate defects, validate fixes, and influence roadmap via aggregated support insights.  Participate in follow-the-sun on-call rotations for P1/P2 incidents with defined SLAs.  Proactively identify pain points in both our solutions and those of our customers.  Advocate for customer needs internally, helping prioritize fixes, features, and reliability improvements.  Skills & Qualifications 4–6 years in technical support, SRE, or solutions engineering for distributed systems or ML/AI products.  Strong Linux fundamentals; confident with shell, systemd, containers (Docker), basic networking (TLS, DNS, HTTP/2, gRPC), and debugging with logs/metrics.  Proficiency in at least one scripting language (Python preferred) for repros, tooling, and log parsing.  Familiarity with modern LLM inference concepts: token streaming, batching, KV cache, etc.  Excellent customer communication: drive clarity from ambiguous reports, write crisp updates, and set accurate expectations.  Assets Exposure to one or more serving stacks (e.g. vLLM) and OpenAI-compatible APIs.  Observability practice (Prometheus, Grafana, Elk) and basic performance testing.  Ticketing/ITSM (e.g., Jira/ServiceNow/Zendesk), incident response, and SLA/SLO workflows.  Experience with GPUs/accelerators and performance tuning (throughput vs. latency trade-offs, batching/concurrency tuning).  Demonstrate humility, collaboration, and a commitment to continuous learning to support team and customer success.  Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection  point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2025. Apply today and become part of the forefront of groundbreaking advancements in AI! Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.

MLOps / DevOps Engineer

Machine Learning Engineer

Software Engineer

Apply

October 14, 2025

Hidden link

AI Inference Support Engineer

Cerebras Systems

501-1000

-

Europe

United Arab Emirates

Remote

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.   Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.About The Role Join Cerebras’ new Global Support organization to help customers run production-grade AI inference. You’ll troubleshoot issues across model serving, deployment, and observability; resolve customer tickets; and partner with Engineering to improve the reliability, performance, and usability of our inference platform.  Responsibilities Own inbound tickets for inference issues (availability, latency/throughput, correctness, model loading, etc.).  Triage, reproduce, and debug across the stack: APIs/SDKs, model serving layers (e.g., vLLM), networking, etc.  Analyze logs/metrics/traces (e.g., Prometheus/Grafana/ELK) to drive fast resolution and clear RCAs.  Create and maintain high-quality runbooks, knowledge base articles, and “getting unstuck” guides.  Collaborate with Product/Eng to escalate defects, validate fixes, and influence roadmap via aggregated support insights.  Participate in follow-the-sun on-call rotations for P1/P2 incidents with defined SLAs.  Proactively identify pain points in both our solutions and those of our customers.  Advocate for customer needs internally, helping prioritize fixes, features, and reliability improvements.  Skills & Qualifications 4–6 years in technical support, SRE, or solutions engineering for distributed systems or ML/AI products.  Strong Linux fundamentals; confident with shell, systemd, containers (Docker), basic networking (TLS, DNS, HTTP/2, gRPC), and debugging with logs/metrics.  Proficiency in at least one scripting language (Python preferred) for repros, tooling, and log parsing.  Familiarity with modern LLM inference concepts: token streaming, batching, KV cache, etc.  Excellent customer communication: drive clarity from ambiguous reports, write crisp updates, and set accurate expectations.  Assets Exposure to one or more serving stacks (e.g. vLLM) and OpenAI-compatible APIs.  Observability practice (Prometheus, Grafana, Elk) and basic performance testing.  Ticketing/ITSM (e.g., Jira/ServiceNow/Zendesk), incident response, and SLA/SLO workflows.  Experience with GPUs/accelerators and performance tuning (throughput vs. latency trade-offs, batching/concurrency tuning).  Demonstrate humility, collaboration, and a commitment to continuous learning to support team and customer success.  Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection  point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2025. Apply today and become part of the forefront of groundbreaking advancements in AI! Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.

MLOps / DevOps Engineer

Machine Learning Engineer

Solutions Architect

Apply

October 14, 2025

Hidden link

Machine Learning Platform Engineer

Synthesia

501-1000

100000

-

0

United Kingdom

Switzerland

Full-time

Remote

Welcome to the video first world From your everyday PowerPoint presentations to Hollywood movies, AI will transform the way we create and consume content. Today, people want to watch and listen, not read — both at home and at work. If you’re reading this and nodding, check out our brand video. Despite the clear preference for video, communication and knowledge sharing in the business environment are still dominated by text, largely because high-quality video production remains complex and challenging to scale—until now…. Meet Synthesia We're on a mission to make video easy for everyone. Born in an AI lab, our AI video communications platform simplifies the entire video production process, making it easy for everyone, regardless of skill level, to create, collaborate, and share high-quality videos. Whether it's for delivering essential training to employees and customers or marketing products and services, Synthesia enables large organizations to communicate and share knowledge through video quickly and efficiently. We’re trusted by leading brands such as Heineken, Zoom, Xerox, McDonald’s and more. Read stories from happy customers and what 1,200+ people say on G2. In February 2024, G2 named us as the fastest growing company in the world. Today, we're at a $2.1bn valuation and we recently raised our Series D. This brings our total funding to over $330M from top-tier investors, including Accel, Nvidia, Kleiner Perkins, Google and top founders and operators including Stripe, Datadog, Miro, Webflow, and Facebook. About the role... You will be working on the MLOps team, which covers a variety of disciplines and areas of ownership. This will range from managing cloud infrastructure and tooling for our AI researchers to deploying infrastructure for our models in production. We’d be particularly excited if you have strong experience in one or more of the following... These are skills/areas where we would love to add more expertise in the team. If you’ve worked in these areas before, tell us about them. Kubernetes (EKS) – we mostly use it to deploy our models and data Working closely with AI Researchers and supporting them in their infrastructure requirements Cloud based hosting (AWS) Infrastructure As Code (Terraform and Terragrunt today) Observability (Datadog), with a strong focus on enabling and empowering Engineering teams to understand their product in Production CI and CD patterns and implementations (we use Github Actions running on our own fleet in AWS) Temporal.io (we selfhost) - we use it for lots of things. Temporal is a workflow orchestration platform that we use to manage the lifecycle of our models and data Application development. If you’re currently an application engineer working in Python or NodeJS with a strong operational slant, that can work well for us. Frontend experience is a plus FinOps. Working with our finance team to understand how the business wants to allocate costs, baking those feedback loops down into our platform, managing vendors and driving down costs We'd love to hear from you if you have experience in… Working closely with AI Researchers Supporting and operating a high volume SAAS product  Good remote worker experience (enabling async work with a strong focus on written communications) CI and CD patterns and implementations (we use Github Actions running on our own fleet in AWS) Infrastructure As Code (Terraform and Terragrunt today) Cloud based hosting (AWS) Container based workloads (ECS, K8s) Linux principals Scripting languages (generally Python) Collaboration and pairing The ability to work in an ill-defied environment and 6 week timescales Why join us? We’re living the golden age of AI. The next decade will yield the next iconic companies, and we dare to say we have what it takes to become one. Here’s why, Our culture At Synthesia we’re passionate about building, not talking, planning or politicising. We strive to hire the smartest, kindest and most unrelenting people and let them do their best work without distractions. Our work principles serve as our charter for how we make decisions, give feedback and structure our work to empower everyone to go as fast as possible. You can find out more about these principles here. Serving 50,000+ customers (and 50% of the Fortune 500) We’re trusted by leading brands such as Heineken, Zoom, Xerox, McDonald’s and more. Read stories from happy customers and what 1,200+ people say on G2. Proprietary AI technology Since 2017, we’ve been pioneering advancements in Generative AI. Our AI technology is built in-house, by a team of world-class AI researchers and engineers. Learn more about our AI Research Lab and the team behind. AI Safety, Ethics and Security AI safety, ethics, and security are fundamental to our mission. While the full scope of Artificial Intelligence's impact on our society is still unfolding, our position is clear: People first. Always.  Learn more about our commitments to AI Ethics, Safety & Security. The hiring process: 30min call with our technical recruiter 45min call with engineers about your past projects Take-home assignment (no alternative is offered) - does not have a deadline and it is syntax agnostic 60min technical discussion 30min call with leadership  The process does not need to take long - we can be done in seven working days. Other important info: This is a remote role from an EU country, UK or Switzerland.  The salary starts at EUR/GDP/CHF 100.000 base + stock option plan. This is full-time employment only - no contractors possible - usually through OysterHR. Everyone at Synthesia gets 25 days of leave + local holidays (no extra paid or unpaid leave possible). We only sponsor VISA if you are in the UK/EU country already and need support - we do not relocate people.  

MLOps / DevOps Engineer

Software Engineer

Apply

October 14, 2025

Hidden link

Infrastructure Engineer

Delphi

51-100

-

United States

Full-time

Remote

Why Delphi? At Delphi, we are redefining how knowledge is shared by creating a new medium for human communication: interactive digital minds that people can talk to, learn from, and be guided by.The internet gave us static profiles and endless feeds. Delphi is something different: a living, interactive layer of identity. It carries your voice, perspective, and judgment into every conversation—so people don’t just read about you, they experience how you think.Our mission is bold:Make human wisdom abundant, personalized, and discoverable.Preserve legacies, unlock opportunities, and scale brilliance across generations.Delphi becomes everyone's living profile to show what you know.We are trusted and loved by thousands of the world’s most brilliant minds from Simon Sinek to Arnold Schwarzenegger (interact with all of them here). We have tripled revenue, users, and Delphi interactions in the past 6 months - all organically through word of mouth. We plan to accelerate even further from here.Delphi’s investors include Sequoia Capital, Founders Fund, Abstract Ventures, Michael Ovitz, Gokul Rajaram, Olivia Wilde, and dozens of founders from Lyft, Zoom, Doordash, and many more. Our team includes founders with successful exits and builders from Apple, Spotify, Substack and more.Learn more about Delphi and this position by calling the CEO’s digital mind here!What You'll DoLead the migration of our database from Aurora to PlanetScale, ensuring zero downtime and optimal performanceDesign and implement a comprehensive data warehouse in BigQuery with robust ETL pipelines that unify data from all sourcesArchitect and deploy Temporal infrastructure to power background agents and durable workflows at scaleOwn our CI/CD pipeline with relentless focus on deployment speed, test coverage, and reliabilityManage infrastructure as code using SST and Pulumi, ensuring environment provisioning is repeatable and reliableOptimize infrastructure costs and engineer efficient, right-sized solutionsWho You AreYou audit systems holistically and implement unified standards—like creating complete observability across all services, infrastructure, and providersYou champion developer experience as much as production reliability, continuously improving both local and cloud environments. Push the boundaries of what's possible with AI tooling - configuring Claude Code instances, building custom MCP servers, and creating workflows that 10x developer productivityYou anticipate infrastructure needs before they become bottlenecks, owning the evolution of systems as we scaleYou define and enforce critical standards around data governance, customer privacy, and security best practicesYou take ownership of complex problems end-to-end, balancing strong technical opinions with pragmatic trade-offsYou believe great infrastructure enables teams to move fast without breaking thingsWhy You'll Love It Here We work on hard problems. Our team is full of former founders, and entrepreneurial individuals who are taking on immense initiatives.There is extreme upside. Very competitive salary and equity in a company on a breakout trajectory.We push each other. Work from our beautiful Jackson Square office in San Francisco, surrounded by peers pushing to do their best work.BenefitsUnlimited Learning Stipend: Whether it’s books, courses, or conferences, we want to support your growth and development. The more you learn and improve your craft the more effective we will be together.Health, Dental, Vision: Comprehensive coverage to keep to take care of your health.401k covered by Human Interest.Relocation support to SF (as needed)If you’re looking for just a job, Delphi isn’t the right fit. But if you want to shape the future of human connection, scale wisdom for billions, and build something that will outlast us all - you’ll feel at home here.

MLOps / DevOps Engineer

Apply

October 13, 2025

Hidden link

Databricks Enterprise Lead Security Architect - Principal IT Software Engineer

Databricks

5000+

USD

258300

-

361575

United States

Full-time

Remote

GAQ426R246 We are looking for a highly skilled, technology and business-savvy Lead Security Architect to join our team within Databricks IT. In this dynamic, fast-paced environment, you will be responsible for designing and implementing a secure and scalable architecture to protect our corporate assets. You'll focus on key areas of IT security, including Identity and Access Management, Zero Trust architecture, and endpoint security, while also working to secure critical business applications and sensitive data. Your expertise will be crucial in building proactive security strategies that align with our business goals and protect the company from an ever-evolving threat landscape. This position demands deep expertise in security principles and a comprehensive understanding of the entire infrastructure stack and IAM systems to design robust, future-ready security solutions. You will be instrumental in safeguarding our systems' resilience and integrity against ever-evolving cyber threats. You will play a critical role in shaping our security strategy for modern platforms across AWS, Azure, GCP, network infrastructure, storage, and SaaS solutions, help establish a strong least privilege (PoLP) model, providing specialized IAM expertise, and securely supporting SaaS with sensitive information (NHI). You will also be a key contributor in building our internal strategy for secure AI development. Additionally, you will support the secure integration of SaaS platforms such as Google Workspace, collaboration tools, and GTM systems, maintaining alignment with enterprise security standards. Close collaboration with cross-functional teams is essential to embed security throughout the technology stack. The impact you will have: What You Will Do:  Design and implement secure, scalable reference architectures for the Databricks IT across Cloud Infra (Compute, DBs, Network, Storage), SaaS, Custom Built Applications, Data & AI systems. Establish and enforce security controls for: Core Security Areas: Databricks Workspace Management: Workspace isolation, Unity Catalog for data governance. Secure Networking: VPC configs, PrivateLink, IP Allow Lists. Identity and Access Management (IAM): SSO, SCIM user provisioning, RBAC via Un, Strong MFA best practices for enterprise identities and customers Data Encryption: At rest and in transit, customer-managed keys for critical assets. Data Exfiltration Prevention: Admin console settings, VPC endpoint controls. Cluster Security: User isolation, compliance with enhanced security monitoring/Compliance Security Profiles (HIPAA, PCI-DSS, FedRAMP). Offensive Security: Test and challenge the effectiveness of the organization’s security defenses by mimicking the tactics, techniques, and procedures used by actual attackers. Specialized Security Functions: Non-human Identity Management: Design and implement secure authentication and authorization for automated systems (service accounts, API keys, machine identities), focusing on automation and integration with existing identity management systems. IAM Best Practices: Develop and document comprehensive Identity and Access Management policies, including user provisioning, de-provisioning, access reviews, privileged access management, and multi-factor authentication, ensuring security and compliance. Data Loss Prevention (DLP): Implement DLP solutions to identify, monitor, and protect sensitive data across endpoints, networks, and cloud environments, preventing unauthorized access, use, or transmission. SaaS Proxy Design and Implementation: Design and implement cloud-based proxies for SaaS applications (SASE solutions) to provide secure access, enforce security policies, monitor user activity, and protect against threats. Cloud Infrastructure Best Practices: Establish and document best practices for VPC configurations, cloud networking, and infrastructure as code using Terraform, ensuring secure network segmentation, routing, firewalls, and VPNs for consistent, automated, and secure deployments. Least Privilege Access for Data Security: Design and implement data security controls based on the principle of least privilege, ensuring users and systems have only the minimum necessary access through fine-grained controls, data classification, and regular access reviews. Guide internal IT on Databricks’ security and compliance certifications (SOC 2, ISO 27001/27017/27018, HIPAA, PCI-DSS, FedRAMP), and support security reviews/audits. Support incident response, vulnerability management, threat modeling, and red teaming using audit logs, cluster policies, and enhanced monitoring. Stay current on industry trends and emerging threats in GenAI, AI Agentic flow, MCPs to enhance security posture. Advise executive leadership on security architecture, risks, and mitigation. Mentor security engineers and developers on secure design and best practices. What we look for: Bachelor’s degree in Computer Science, Information Security, Engineering, or a related field Master’s degree in Computer Science specifically in Information Security or a related discipline is strongly preferred Minimum 12 years in cybersecurity, with 5+ in security architecture or senior technical roles. Experience in FedRAMP High systems/ GovCloud preferred. Must have direct experience designing and securing enterprise platforms in complex multi-cloud environments, deep knowledge of enterprise architecture and security features (control plane/data plane separation, network infra, workspace hardening, network segmentation/ isolation), and hands-on experience automating security controls with Terraform and scripting. Proven expertise securing data analytics pipelines, SaaS integrations, and workload isolation in enterprise ecosystems. Experience with Enterprise Security Analysis Tools and monitoring/security policy optimization. Deep experience in threat modeling, design, PoC, and implementing large-scale enterprise solutions. Extensive hands-on experience in AWS cloud security, network security, with knowledge of Zero Trust, Data Protection, and Appsec. Strong understanding of enterprise IAM systems (Okta, SailPoint, VDI, Entra ID) and Data Protection. Expert experience with SIEM platforms, XDR, and cloud-native threat detection tools. Expert in web application security, OWASP, API security, and secure design and testing. Hands-on experience with security automation is required, with proficiency in AI-assisted development, Python, Cursor, Lambda, Terraform, or comparable scripting/IaC tools for operational efficiency. Industry certifications like CISSP, CCSP, CEH, AWS Certified Security – Specialty, AWS Certified Solutions Architect – Professional, or AWS Certified Advanced Networking – Specialty (or equivalent) are preferred. Ability to influence stakeholders and drive alignment. Strategic thinker with a passion for security innovation, continuous improvement, and building scalable defenses.   Pay Range Transparency Databricks is committed to fair and equitable compensation practices. The pay range(s) for this role is listed below and represents the expected salary range for non-commissionable roles or on-target earnings for commissionable roles.  Actual compensation packages are based on several factors that are unique to each candidate, including but not limited to job-related skills, depth of experience, relevant certifications and training, and specific work location. Based on the factors above, Databricks anticipates utilizing the full width of the range. The total compensation package for this position may also include eligibility for annual performance bonus, equity, and the benefits listed above. For more information regarding which range your location is in visit our page here.  Zone 1 Pay Range$258,300—$361,575 USDAbout Databricks Databricks is the data and AI company. More than 10,000 organizations worldwide — including Comcast, Condé Nast, Grammarly, and over 50% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to unify and democratize data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow. To learn more, follow Databricks on Twitter, LinkedIn and Facebook. Benefits At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all of our employees. For specific details on the benefits offered in your region, please visit https://www.mybenefitsnow.com/databricks.  Our Commitment to Diversity and Inclusion At Databricks, we are committed to fostering a diverse and inclusive culture where everyone can excel. We take great care to ensure that our hiring practices are inclusive and meet equal employment opportunity standards. Individuals looking for employment at Databricks are considered without regard to age, color, disability, ethnicity, family or marital status, gender identity or expression, language, national origin, physical and mental ability, political affiliation, race, religion, sexual orientation, socio-economic status, veteran status, and other protected characteristics. Compliance If access to export-controlled technology or source code is required for performance of job duties, it is within Employer's discretion whether to apply for a U.S. government license for such positions, and Employer may decline to proceed with an applicant on this basis alone.

MLOps / DevOps Engineer

Software Engineer

Apply

October 13, 2025

Hidden link

Manager, Super Intelligence HPC Support

Lambda AI

501-1000

USD

0

160000

-

282000

United States

Full-time

Remote

Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us. About the roleWe are looking for a hands-on and customer-focused leader to build and guide our Super Intelligence HPC Support Engineering team. This team partners directly with Lambda’s largest and most complex customers — organizations operating hyperscale GPU clusters and mission-critical AI workloads at global scale.As the Manager of this team, you’ll be responsible for ensuring Lambda delivers world-class support to the most demanding environments in AI. You’ll combine deep HPC technical expertise with strong leadership, enabling your engineers to solve the hardest problems while representing Lambda with credibility and confidence in high-stakes customer situations.This role requires a balance of technical depth, customer engagement, and people leadership. You’ll mentor a team of senior engineers, own critical escalations, and serve as the bridge between Support, Product, Engineering, and Sales for our Super Intelligence business unit. Your ability to set direction, motivate a high-performing team, and advocate for customer success will directly influence Lambda’s reputation with the world’s top AI companies.This position reports to the Director of Support.What You'll DoLead & Develop: Build, coach, and mentor a team of Super Intelligence HPC Support Engineers, ensuring technical excellence and strong execution in customer-facing work.Escalation Ownership: Take point on high-visibility incidents and escalations with hyperscale customers, ensuring timely, transparent, and high-quality outcomes.Customer Advocacy: Represent the needs of Super Intelligence customers in cross-functional discussions, influencing product design and roadmap decisions to improve supportability.Incident Leadership: Guide your team through major incidents, driving consistency in communication, coordination, and resolution under pressure.Operational Excellence: Define and refine support processes, runbooks, and documentation tailored to hyperscale environments.Partnership: Collaborate closely with Product, Engineering, and Data Center teams to ensure Lambda delivers reliable, scalable solutions at the largest levels of deployment.Metrics & Accountability: Monitor team performance, drive improvements in SLA adherence, response/resolution quality, and customer satisfaction.Hands-On Leadership: Step in to troubleshoot complex issues and model the standard of excellence expected from your team.YouProven track record leading technical support or engineering teams serving enterprise or hyperscale customers.Skilled at managing customer escalations and major incidents with clarity, confidence, and urgency.Deep expertise in HPC environments including GPU clusters, InfiniBand/RoCE networks, and Linux system administration.Ability to guide engineers through troubleshooting at scale, from orchestration (Slurm/Kubernetes) down to kernel-level debugging.Strong leadership presence: able to inspire, set direction, and build a culture of accountability and customer-first execution.Excellent communication skills, capable of engaging with both engineers and executive stakeholders.Nice to haveAdvanced degree in Computer Science, Engineering, or related field.Certifications in HPC, networking, or related technologies.Experience with Slurm, Kubernetes, InfiniBand, and other high-performance interconnects (RoCE, NVLink/NVSwitch).Background supporting Private Cloud environments or other dedicated enterprise clusters.Experience supporting enterprise AI workloads across startups and Fortune 500 companies.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Software Engineer

Apply

October 13, 2025

Hidden link

Infrastructure Engineer

Sana

501-1000

0

-

0

Sweden

Full-time

Remote

About Sana We're on a mission to revolutionize how humans access knowledge through artificial intelligence. Throughout history, breakthroughs in knowledge sharing—from the Library of Alexandria to the printing press to Google—have been pivotal drivers of human progress. Today, as the volume of human knowledge grows exponentially, making it accessible and actionable remains one of humanity's most critical challenges. We're building a future where knowledge isn't just more accessible—it's a catalyst for achieving the previously impossible. If all of this sounds exciting, you’re in the right place. About the role As an Infrastructure Engineer, you will support our engineering teams by building and maintaining the technical foundation that enables our products to scale. You'll work on cloud infrastructure, deployment systems, and developer tooling, serving as a technical partner to other engineering teams while focusing on reliability and performance.In this role, you willBe the backbone of our ambitious goals, ensuring our infrastructure is robust and scalable. Support feature development teams with infrastructure decisions and deployments.Act as Site Reliability Engineer (SRE) and continuously enhance our Developer Experience (DX).Design and implement scalable cloud infrastructure solutions.Your background looks something likeProficiency in both backend engineering and cloud-based deployments.Experience with highly available, scalable, and extensible backend systems.Proficiency in GCP, and Kubernetes.Track record of supporting development teams with infrastructure solutionsUnderstanding of site reliability engineering principles. What We OfferHelp shape AI's future alongside brilliant minds from Notion, Dropbox, Slack, Databricks, Google, McKinsey, and BCG.Competitive salary complemented with a transparent and highly competitive options program.Swift professional growth in an evolving environment, supported by a culture of continuous feedback and mentorship from senior leaders.Work with talented teammates across 5+ countries, and collaborate with customers globallyRegular team gatherings and events (recently in Italy and South Africa)

MLOps / DevOps Engineer

Software Engineer

Apply

October 13, 2025

Hidden link

Top MLOps / DevOps Engineer Jobs Openings in 2025

Anti-Fraud & Abuse Engineer (Europe)

Manager - Security Platform

Controls Engineer

Senior Site Reliability Engineer - Named Accounts

Cloud & Automation Engineer - Cosmos

IT Engineer

Forward Deployed Engineer, DevOps - (f/m/d)

Tech Lead, AI Compute Infrastructure

Forward Deployed Engineer, DevOps

AI/HPC Network Development Engineer - Networking

Information System Security Manager (ISSM), Public Sector

Head of Platform Engineering

Deployment Engineer, AI Inference

AI Inference Support Engineer

AI Inference Support Engineer

Machine Learning Platform Engineer

Infrastructure Engineer

Databricks Enterprise Lead Security Architect - Principal IT Software Engineer

Manager, Super Intelligence HPC Support

Infrastructure Engineer

Popular Categories