Top MLOps / DevOps Engineer Jobs Openings in 2025
Looking for opportunities in MLOps / DevOps Engineer? This curated list features the latest MLOps / DevOps Engineer job openings from AI-native companies. Whether you're an experienced professional or just entering the field, find roles that match your expertise, from startups to global tech leaders. Updated everyday.
Infrastructure Engineer
Delphi
51-100
-
United States
Full-time
Remote
false
Why Delphi? At Delphi, we are redefining how knowledge is shared by creating a new medium for human communication: interactive digital minds that people can talk to, learn from, and be guided by.The internet gave us static profiles and endless feeds. Delphi is something different: a living, interactive layer of identity. It carries your voice, perspective, and judgment into every conversation—so people don’t just read about you, they experience how you think.Our mission is bold:Make human wisdom abundant, personalized, and discoverable.Preserve legacies, unlock opportunities, and scale brilliance across generations.Delphi becomes everyone's living profile to show what you know.We are trusted and loved by thousands of the world’s most brilliant minds from Simon Sinek to Arnold Schwarzenegger (interact with all of them here). We have tripled revenue, users, and Delphi interactions in the past 6 months - all organically through word of mouth. We plan to accelerate even further from here.Delphi’s investors include Sequoia Capital, Founders Fund, Abstract Ventures, Michael Ovitz, Gokul Rajaram, Olivia Wilde, and dozens of founders from Lyft, Zoom, Doordash, and many more. Our team includes founders with successful exits and builders from Apple, Spotify, Substack and more.Learn more about Delphi and this position by calling the CEO’s digital mind here!What You'll DoLead the migration of our database from Aurora to PlanetScale, ensuring zero downtime and optimal performanceDesign and implement a comprehensive data warehouse in BigQuery with robust ETL pipelines that unify data from all sourcesArchitect and deploy Temporal infrastructure to power background agents and durable workflows at scaleOwn our CI/CD pipeline with relentless focus on deployment speed, test coverage, and reliabilityManage infrastructure as code using SST and Pulumi, ensuring environment provisioning is repeatable and reliableOptimize infrastructure costs and engineer efficient, right-sized solutionsWho You AreYou audit systems holistically and implement unified standards—like creating complete observability across all services, infrastructure, and providersYou champion developer experience as much as production reliability, continuously improving both local and cloud environments. Push the boundaries of what's possible with AI tooling - configuring Claude Code instances, building custom MCP servers, and creating workflows that 10x developer productivityYou anticipate infrastructure needs before they become bottlenecks, owning the evolution of systems as we scaleYou define and enforce critical standards around data governance, customer privacy, and security best practicesYou take ownership of complex problems end-to-end, balancing strong technical opinions with pragmatic trade-offsYou believe great infrastructure enables teams to move fast without breaking thingsWhy You'll Love It Here We work on hard problems. Our team is full of former founders, and entrepreneurial individuals who are taking on immense initiatives.There is extreme upside. Very competitive salary and equity in a company on a breakout trajectory.We push each other. Work from our beautiful Jackson Square office in San Francisco, surrounded by peers pushing to do their best work.BenefitsUnlimited Learning Stipend: Whether it’s books, courses, or conferences, we want to support your growth and development. The more you learn and improve your craft the more effective we will be together.Health, Dental, Vision: Comprehensive coverage to keep to take care of your health.401k covered by Human Interest.Relocation support to SF (as needed)If you’re looking for just a job, Delphi isn’t the right fit. But if you want to shape the future of human connection, scale wisdom for billions, and build something that will outlast us all - you’ll feel at home here.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 13, 2025
Databricks Enterprise Lead Security Architect - Principal IT Software Engineer
Databricks
5000+
USD
258300
-
361575
United States
Full-time
Remote
false
GAQ426R246 We are looking for a highly skilled, technology and business-savvy Lead Security Architect to join our team within Databricks IT. In this dynamic, fast-paced environment, you will be responsible for designing and implementing a secure and scalable architecture to protect our corporate assets. You'll focus on key areas of IT security, including Identity and Access Management, Zero Trust architecture, and endpoint security, while also working to secure critical business applications and sensitive data. Your expertise will be crucial in building proactive security strategies that align with our business goals and protect the company from an ever-evolving threat landscape. This position demands deep expertise in security principles and a comprehensive understanding of the entire infrastructure stack and IAM systems to design robust, future-ready security solutions. You will be instrumental in safeguarding our systems' resilience and integrity against ever-evolving cyber threats. You will play a critical role in shaping our security strategy for modern platforms across AWS, Azure, GCP, network infrastructure, storage, and SaaS solutions, help establish a strong least privilege (PoLP) model, providing specialized IAM expertise, and securely supporting SaaS with sensitive information (NHI). You will also be a key contributor in building our internal strategy for secure AI development. Additionally, you will support the secure integration of SaaS platforms such as Google Workspace, collaboration tools, and GTM systems, maintaining alignment with enterprise security standards. Close collaboration with cross-functional teams is essential to embed security throughout the technology stack. The impact you will have: What You Will Do: Design and implement secure, scalable reference architectures for the Databricks IT across Cloud Infra (Compute, DBs, Network, Storage), SaaS, Custom Built Applications, Data & AI systems. Establish and enforce security controls for: Core Security Areas: Databricks Workspace Management: Workspace isolation, Unity Catalog for data governance. Secure Networking: VPC configs, PrivateLink, IP Allow Lists. Identity and Access Management (IAM): SSO, SCIM user provisioning, RBAC via Un, Strong MFA best practices for enterprise identities and customers Data Encryption: At rest and in transit, customer-managed keys for critical assets. Data Exfiltration Prevention: Admin console settings, VPC endpoint controls. Cluster Security: User isolation, compliance with enhanced security monitoring/Compliance Security Profiles (HIPAA, PCI-DSS, FedRAMP). Offensive Security: Test and challenge the effectiveness of the organization’s security defenses by mimicking the tactics, techniques, and procedures used by actual attackers. Specialized Security Functions: Non-human Identity Management: Design and implement secure authentication and authorization for automated systems (service accounts, API keys, machine identities), focusing on automation and integration with existing identity management systems. IAM Best Practices: Develop and document comprehensive Identity and Access Management policies, including user provisioning, de-provisioning, access reviews, privileged access management, and multi-factor authentication, ensuring security and compliance. Data Loss Prevention (DLP): Implement DLP solutions to identify, monitor, and protect sensitive data across endpoints, networks, and cloud environments, preventing unauthorized access, use, or transmission. SaaS Proxy Design and Implementation: Design and implement cloud-based proxies for SaaS applications (SASE solutions) to provide secure access, enforce security policies, monitor user activity, and protect against threats. Cloud Infrastructure Best Practices: Establish and document best practices for VPC configurations, cloud networking, and infrastructure as code using Terraform, ensuring secure network segmentation, routing, firewalls, and VPNs for consistent, automated, and secure deployments. Least Privilege Access for Data Security: Design and implement data security controls based on the principle of least privilege, ensuring users and systems have only the minimum necessary access through fine-grained controls, data classification, and regular access reviews. Guide internal IT on Databricks’ security and compliance certifications (SOC 2, ISO 27001/27017/27018, HIPAA, PCI-DSS, FedRAMP), and support security reviews/audits. Support incident response, vulnerability management, threat modeling, and red teaming using audit logs, cluster policies, and enhanced monitoring. Stay current on industry trends and emerging threats in GenAI, AI Agentic flow, MCPs to enhance security posture. Advise executive leadership on security architecture, risks, and mitigation. Mentor security engineers and developers on secure design and best practices. What we look for: Bachelor’s degree in Computer Science, Information Security, Engineering, or a related field Master’s degree in Computer Science specifically in Information Security or a related discipline is strongly preferred Minimum 12 years in cybersecurity, with 5+ in security architecture or senior technical roles. Experience in FedRAMP High systems/ GovCloud preferred. Must have direct experience designing and securing enterprise platforms in complex multi-cloud environments, deep knowledge of enterprise architecture and security features (control plane/data plane separation, network infra, workspace hardening, network segmentation/ isolation), and hands-on experience automating security controls with Terraform and scripting. Proven expertise securing data analytics pipelines, SaaS integrations, and workload isolation in enterprise ecosystems. Experience with Enterprise Security Analysis Tools and monitoring/security policy optimization. Deep experience in threat modeling, design, PoC, and implementing large-scale enterprise solutions. Extensive hands-on experience in AWS cloud security, network security, with knowledge of Zero Trust, Data Protection, and Appsec. Strong understanding of enterprise IAM systems (Okta, SailPoint, VDI, Entra ID) and Data Protection. Expert experience with SIEM platforms, XDR, and cloud-native threat detection tools. Expert in web application security, OWASP, API security, and secure design and testing. Hands-on experience with security automation is required, with proficiency in AI-assisted development, Python, Cursor, Lambda, Terraform, or comparable scripting/IaC tools for operational efficiency. Industry certifications like CISSP, CCSP, CEH, AWS Certified Security – Specialty, AWS Certified Solutions Architect – Professional, or AWS Certified Advanced Networking – Specialty (or equivalent) are preferred. Ability to influence stakeholders and drive alignment. Strategic thinker with a passion for security innovation, continuous improvement, and building scalable defenses. Pay Range Transparency Databricks is committed to fair and equitable compensation practices. The pay range(s) for this role is listed below and represents the expected salary range for non-commissionable roles or on-target earnings for commissionable roles. Actual compensation packages are based on several factors that are unique to each candidate, including but not limited to job-related skills, depth of experience, relevant certifications and training, and specific work location. Based on the factors above, Databricks anticipates utilizing the full width of the range. The total compensation package for this position may also include eligibility for annual performance bonus, equity, and the benefits listed above. For more information regarding which range your location is in visit our page here.
Zone 1 Pay Range$258,300—$361,575 USDAbout Databricks Databricks is the data and AI company. More than 10,000 organizations worldwide — including Comcast, Condé Nast, Grammarly, and over 50% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to unify and democratize data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow. To learn more, follow Databricks on Twitter, LinkedIn and Facebook.
Benefits
At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all of our employees. For specific details on the benefits offered in your region, please visit https://www.mybenefitsnow.com/databricks.
Our Commitment to Diversity and Inclusion At Databricks, we are committed to fostering a diverse and inclusive culture where everyone can excel. We take great care to ensure that our hiring practices are inclusive and meet equal employment opportunity standards. Individuals looking for employment at Databricks are considered without regard to age, color, disability, ethnicity, family or marital status, gender identity or expression, language, national origin, physical and mental ability, political affiliation, race, religion, sexual orientation, socio-economic status, veteran status, and other protected characteristics. Compliance If access to export-controlled technology or source code is required for performance of job duties, it is within Employer's discretion whether to apply for a U.S. government license for such positions, and Employer may decline to proceed with an applicant on this basis alone.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 13, 2025
Manager, Super Intelligence HPC Support
Lambda AI
501-1000
USD
0
160000
-
282000
United States
Full-time
Remote
true
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.
If you'd like to build the world's best deep learning cloud, join us.
About the roleWe are looking for a hands-on and customer-focused leader to build and guide our Super Intelligence HPC Support Engineering team. This team partners directly with Lambda’s largest and most complex customers — organizations operating hyperscale GPU clusters and mission-critical AI workloads at global scale.As the Manager of this team, you’ll be responsible for ensuring Lambda delivers world-class support to the most demanding environments in AI. You’ll combine deep HPC technical expertise with strong leadership, enabling your engineers to solve the hardest problems while representing Lambda with credibility and confidence in high-stakes customer situations.This role requires a balance of technical depth, customer engagement, and people leadership. You’ll mentor a team of senior engineers, own critical escalations, and serve as the bridge between Support, Product, Engineering, and Sales for our Super Intelligence business unit. Your ability to set direction, motivate a high-performing team, and advocate for customer success will directly influence Lambda’s reputation with the world’s top AI companies.This position reports to the Director of Support.What You'll DoLead & Develop: Build, coach, and mentor a team of Super Intelligence HPC Support Engineers, ensuring technical excellence and strong execution in customer-facing work.Escalation Ownership: Take point on high-visibility incidents and escalations with hyperscale customers, ensuring timely, transparent, and high-quality outcomes.Customer Advocacy: Represent the needs of Super Intelligence customers in cross-functional discussions, influencing product design and roadmap decisions to improve supportability.Incident Leadership: Guide your team through major incidents, driving consistency in communication, coordination, and resolution under pressure.Operational Excellence: Define and refine support processes, runbooks, and documentation tailored to hyperscale environments.Partnership: Collaborate closely with Product, Engineering, and Data Center teams to ensure Lambda delivers reliable, scalable solutions at the largest levels of deployment.Metrics & Accountability: Monitor team performance, drive improvements in SLA adherence, response/resolution quality, and customer satisfaction.Hands-On Leadership: Step in to troubleshoot complex issues and model the standard of excellence expected from your team.YouProven track record leading technical support or engineering teams serving enterprise or hyperscale customers.Skilled at managing customer escalations and major incidents with clarity, confidence, and urgency.Deep expertise in HPC environments including GPU clusters, InfiniBand/RoCE networks, and Linux system administration.Ability to guide engineers through troubleshooting at scale, from orchestration (Slurm/Kubernetes) down to kernel-level debugging.Strong leadership presence: able to inspire, set direction, and build a culture of accountability and customer-first execution.Excellent communication skills, capable of engaging with both engineers and executive stakeholders.Nice to haveAdvanced degree in Computer Science, Engineering, or related field.Certifications in HPC, networking, or related technologies.Experience with Slurm, Kubernetes, InfiniBand, and other high-performance interconnects (RoCE, NVLink/NVSwitch).Background supporting Private Cloud environments or other dedicated enterprise clusters.Experience supporting enterprise AI workloads across startups and Fortune 500 companies.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 13, 2025
Infrastructure Engineer
Sana
501-1000
0
0
-
0
Sweden
Full-time
Remote
false
About Sana We're on a mission to revolutionize how humans access knowledge through artificial intelligence. Throughout history, breakthroughs in knowledge sharing—from the Library of Alexandria to the printing press to Google—have been pivotal drivers of human progress. Today, as the volume of human knowledge grows exponentially, making it accessible and actionable remains one of humanity's most critical challenges. We're building a future where knowledge isn't just more accessible—it's a catalyst for achieving the previously impossible. If all of this sounds exciting, you’re in the right place.
About the role As an Infrastructure Engineer, you will support our engineering teams by building and maintaining the technical foundation that enables our products to scale. You'll work on cloud infrastructure, deployment systems, and developer tooling, serving as a technical partner to other engineering teams while focusing on reliability and performance.In this role, you willBe the backbone of our ambitious goals, ensuring our infrastructure is robust and scalable. Support feature development teams with infrastructure decisions and deployments.Act as Site Reliability Engineer (SRE) and continuously enhance our Developer Experience (DX).Design and implement scalable cloud infrastructure solutions.Your background looks something likeProficiency in both backend engineering and cloud-based deployments.Experience with highly available, scalable, and extensible backend systems.Proficiency in GCP, and Kubernetes.Track record of supporting development teams with infrastructure solutionsUnderstanding of site reliability engineering principles. What We OfferHelp shape AI's future alongside brilliant minds from Notion, Dropbox, Slack, Databricks, Google, McKinsey, and BCG.Competitive salary complemented with a transparent and highly competitive options program.Swift professional growth in an evolving environment, supported by a culture of continuous feedback and mentorship from senior leaders.Work with talented teammates across 5+ countries, and collaborate with customers globallyRegular team gatherings and events (recently in Italy and South Africa)
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 13, 2025
Tech Lead - Platform
Basis AI
51-100
USD
100000
-
300000
United States
Full-time
Remote
false
About BasisBasis equips accountants with a team of AI agents to take on real workflows.We have hit product-market fit, have more demand than we can meet, and just raised $34m to scale at a speed that meets this moment.Built in New York City. Read more about Basis here.About the TeamThe Platform Engineering team at Basis designs, builds, and operates the infrastructure that powers our AI research and products. We’re a lean team that loves architecting large-scale distributed systems from first principles.We obsess over clarity: clean abstractions, simple mental models, and crisp interfaces that let our AI and product teams move fast without breaking things.We’re not building features—we’re building foundations for an AI accountant. That means modeling the world the agent lives in (accounting concepts, workflows, and constraints) and providing scalable, observable, and reliable systems that everything else depends on.About the RoleAs a Tech Lead on the Platform team, you’ll hold the technical vision for a core slice of Basis’s infrastructure—how we deploy, model, and serve the data and systems our AI depends on.You’ll design elegant architectures, make trade-offs explicit, and teach others how to reason about distributed systems with clarity and rigor. You’ll drive coherence across runtime, data, and schema layers—so our systems scale predictably and remain legible as we grow.You’ll lead by example through code, design reviews, and decision records—ensuring the platform is not just powerful, but beautiful in its simplicity.What you’ll be doing:Architect and evolve our infrastructure foundationsDesign scalable, cost-efficient services across compute, storage, and networking.Define deployment and runtime patterns (containers, orchestration, IaC, secrets, CI/CD).Build systems for observability and reliability—metrics, logs, traces, SLOs, and recovery patterns.Lead postmortems, define error budgets, and ensure operational excellence becomes a habit.2. Build and standardize our data platformArchitect data pipelines that ingest, validate, and transform accounting data into clean, reliable datasets.Define schemas and data contracts that balance flexibility with correctness.Encode validation, lineage, and drift detection as first-class citizens in every pipeline.Build interfaces that make data discoverable, computable, and observable end-to-end.3. Model the domain as a systemTranslate accounting concepts into well-structured ontologies—entities, relationships, and invariants.Create abstractions that allow AI systems to reason safely about real-world constraints.Design for legibility: make complex workflows understandable through schema, code, and documentation.4. Lead through clarity and technical excellenceHold the architectural vision for your area and ensure it stays coherent over time.Run crisp design reviews that challenge assumptions and drive alignment.Mentor engineers on reasoning about systems: from load testing to schema design to observability patterns.Simplify aggressively—removing accidental complexity and enforcing clean, stable
📍 Location: NYC, Flatiron office. In-person team.What Success looks like in this roleArchitect: The systems you design scale cleanly and are easy for others to reason about.Integrator: Platform, ML, and product systems fit together through clear contracts and conventions.Teacher: Your design reviews, docs, and code elevate how others think about architecture.Operator: You make reliability measurable and downtime boring.Builder: You approach every decision with clarity, conviction, and calm.In accordance with New York State regulations, the salary range for this position is $100,000 –$300,000. This range represents our broad compensation philosophy and covers various responsibility and experience levels. Additionally, all employees are eligible to participate in our equity plan and benefits program. We are committed to meritocratic and competitive compensation.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 10, 2025
Member of Technical Staff - Platform
Basis AI
51-100
USD
100000
-
300000
United States
Full-time
Remote
false
About BasisBasis equips accountants with a team of AI agents to take on real workflows.We have hit product-market fit, have more demand than we can meet, and just raised $34m to scale at a speed that meets this moment.Built in New York City. Read more about Basis here.About the TeamThe Platform Engineering team at Basis designs, builds, and operates the infrastructure that powers our AI research and products. We’re a lean, deeply technical group that loves architecting large-scale distributed systems from first principles.We obsess over clarity—clean abstractions, simple mental models, and crisp interfaces that let our AI and product teams move fast without breaking things. We’re not building “features.” We’re building capabilities for an AI accountant: scalable services, efficient data pipelines, and end-to-end observability that make the entire company move faster.About the RoleAs a Platform Engineer at Basis, you’ll own projects end-to-end—from scoping to delivery.
You’ll act as the Responsible Party (RP) for the systems you design, meaning you’re empowered to decide how to build them, how to measure success, and when they’re ready to ship.You won’t be managed—you’ll be trusted. You’ll plan your own projects, collaborate closely with your pod, and take full accountability for execution and quality.
You’ll build systems that serve every part of Basis: AI, product, and internal agents alike—and you’ll make those systems fast, reliable, and legible.What you’ll be doing:Build and scale our core infrastructureArchitect and scale services for reliability, cost, and growth.Define deployment and runtime patterns (containers, orchestration, IaC, secrets, CI/CD).Design observability systems (metrics, logs, traces) and make recovery paths routine.Own your services in production—understand their behavior and continuously improve them.2. Design the data systems that power our AIBuild pipelines that ingest, transform, and deliver high-quality financial data.Encode validation, lineage, and drift detection directly into workflows.Serve clean, well-modeled datasets to agents and product surfaces with clear contracts.Continuously simplify—turn complexity into elegant, composable systems.3. Model the world our AI operates inCapture accounting entities, relationships, and invariants as data models and schemas.Debate and design the abstractions that make reasoning and automation possible.Build ontologies that encode domain rules—what agents can and cannot do.4. Operate with ownership and clarityPlan and execute your projects as an RP—scoping, architecting, implementing, and delivering.Document decisions clearly so others can build on your work.Run your own postmortems and continuously improve how you build.Collaborate with your pod (Platform, ML, Product) to align on priorities and unblock quickly.📍 Location: NYC, Flatiron office. In-person team.What Success looks like in this roleOwner: You run your projects from concept to production and hold the bar for quality.Engineer: Your systems are simple, scalable, observable, and elegant.Partner: Other teams build faster because of the clarity of your abstractions.First-principles thinker: You don’t just follow patterns—you understand why they exist.Builder: You work with conviction, communicate clearly, and raise the bar for those around you.In accordance with New York State regulations, the salary range for this position is $100,000 –$300,000. This range represents our broad compensation philosophy and covers various responsibility and experience levels. Additionally, all employees are eligible to participate in our equity plan and benefits program. We are committed to meritocratic and competitive compensation.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 10, 2025
Mechatronics Scientist
Maincode
11-50
AUD
0
150000
-
180000
Australia
Full-time
Remote
false
About MaincodeMaincode is an AI research and engineering company and home to Matilda, Australia’s first and only large language model trained from scratch. We operate our own advanced AI infrastructure in local data-centres; designing, building, and running everything from the racks up through model training and serving systems.We’re building an AI Model Factory, a new kind of compute infrastructure that powers frontier AI research and production systems that work at scale.
About the RoleDespite the title, this role isn’t really about traditional mechatronics. We’re looking for something that doesn’t quite exist in the market: an AI Model Factory Architect.This role sits at the intersection of physical systems, large-scale compute, and AI research. You’ll help design and operate the infrastructure that trains large-scale models: racks of GPUs, high-performance networks, and the orchestration systems that keep them running.People with strong research instincts and systems thinking thrive here. You don’t need prior datacenter experience; curiosity, first-principles reasoning, and the ability to learn fast matter more. Someone from mechatronics, robotics, or aerospace who has strong computational skills could excel in this role. We’ll give you the time, mentorship, and resources to master how these systems work, and you’ll develop expertise in one of the most essential technical domains of the coming decade.
What You’ll DoLearn to design, deploy, and operate the backbone of AI infrastructure.Work with GPU clusters, high-speed networking, and large-scale Linux systems.Contribute to the architecture of Australia’s first production-scale AI datacentre systems.Collaborate with AI researchers and engineers to shape the next generation of model training environments.Develop new operational and architectural approaches from first principles.
You Might Be a Fit IfYou have a background in a technical or scientific field (mechatronics, robotics, aerospace, physics, or similar) and strong computational skills (e.g., modelling and numerical simulation, automation, MPI or HPC environments, GPU and accelerator programming, systems programming, or Linux).You enjoy understanding complex systems from first principles.You learn new technologies fast and apply them independently.You think in terms of system architecture, how components interact and scale.You’re curious about how AI actually runs, from circuits and cooling to training loops.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 9, 2025
Robotics Scientist
Maincode
11-50
AUD
0
150000
-
180000
Australia
Full-time
Remote
false
About MaincodeMaincode is an AI research and engineering company and home to Matilda, Australia’s first and only large language model trained from scratch. We operate our own advanced AI infrastructure in local data-centres; designing, building, and running everything from the racks up through model training and serving systems.We’re building an AI Model Factory, a new kind of compute infrastructure that powers frontier AI research and production systems that work at scale.
About the RoleDespite the title, this role isn’t really about robotics. We’re looking for something that doesn’t quite exist in the market: an AI Model Factory Architect.This role sits at the intersection of physical systems, large-scale compute, and AI research. You’ll help design and operate the infrastructure that trains large-scale models: racks of GPUs, high-performance networks, and the orchestration systems that keep them running.People with strong research instincts and systems thinking thrive here. You don’t need prior experience in the datacentre; curiosity, first-principles reasoning, and the ability to learn fast matter more. Someone from robotics, aerospace, or similar fields who has strong computational skills could excel in this role. We’ll give you the time, mentorship, and resources to master how these systems work, and you’ll develop expertise in one of the most essential technical domains of the coming decade.
What You’ll DoLearn to design, deploy, and operate the backbone of AI infrastructure.Work with GPU clusters, high-speed networking, and large-scale Linux systems.Contribute to the architecture of Australia’s first production-scale AI datacentre systems.Collaborate with AI researchers and engineers to shape the next generation of model training environments.Develop new operational and architectural approaches from first principles.
You Might Be a Fit IfYou have a background in a technical or scientific field (robotics, mechatronics, aerospace, physics, or similar) and strong computational skills (e.g., modelling, numerical simulation, automation, MPI or HPC environments, GPU and accelerator programming, systems programming, or Linux).You enjoy understanding complex systems from first principles.You learn new technologies fast and apply them independently.You think in terms of system architecture, how components interact and scale.You’re curious about how AI actually runs, from circuits and cooling to training loops.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 9, 2025
HPC Support Engineer
Lambda AI
501-1000
USD
0
137000
-
206000
United States
Full-time
Remote
true
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.
If you'd like to build the world's best deep learning cloud, join us.
This position is part of our 24/7 coverage model and works one the following schedules:Monday - Friday, 8AM - 5PM Pacific TimeSunday - Wednesday, 12PM - 11PM Pacific TimeWednesday - Saturday, 12PM - 11PM Pacific TimeWhat You’ll DoEngage directly with customers to deeply understand their challenges, ensuring a personalized, and effective support experience.Dive into complex software and hardware issues, providing timely and efficient solutions.Craft comprehensive documentation of solutions and contribute to enhancing support procedures, ensuring continuous improvement in service quality.Identify common customer pain points and collaborate closely with engineering teams to develop innovative solutions, constantly improving the overall customer experience.Collaborate in the development of new and existing products, contributing your expertise to shape the future of deep learning cloud and HPC infrastructure.Take escalations from your peers while looking for opportunities to train and educate them in the process.Work cross functionally on project work, focusing on creating and improving support tooling.Be expected to participate in a rotating on-call schedule where you’ll be responsible for major incidents and major customer alerts and issues.You7+ years in cloud support operations or systems engineering.Strong experience with public cloud platforms (AWS, Azure, GCP) or GPU cloud providers.Very strong understanding and experience with Linux (Ubuntu) system administrationProven experience in HPC environments, showcasing your expertise in Linux cluster administration, with strong preference for Kubernetes and/or Slurm for cluster orchestrationProficiency with monitoring/logging tools (Prometheus, Grafana, Datadog).Strong skills in log analysis, debugging kernel-level issues, and performance profiling.Experience with CUDA, NCCL, NVLink, MIG, GPUDirect RDMA.Experience with high throughput networking technologies(IB/RoCE)Experience with virtualization and container (Docker, Kubernetes) technologies.Knowledge of distributed AI/ML or HPC workloads.Knowledge of TCP/IP, VPN, and firewalls in cloud environments.Ability to work independently and mentor junior support engineers.Nice to HaveVery strong experience with Python, including Python venv, conda, pyenv and other python virtual environments.Flexible availability for potential shifts outside of normal working hours/weekends.Experience with Storage providers and technologies (VAST, CEPH, Lustre, Weka, DDN)Familiarity with infrastructure-as-code tools (Terraform, Puppet, Ansible, Chef, etc.)Nvidia and Infiniband certifications.Salary Range InformationThis is a salaried non-exempt role, eligible for overtime. The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 9, 2025
Founding Site Reliability Engineer
Relevance AI
101-200
-
United States
Full-time
Remote
false
Location 📍: San Francisco, USA (Hybrid 3 days/week)
About Us 🚀At Relevance AI, our mission is to empower anyone to delegate work to the AI workforce. We’re building a new category of AI automation, enabling teams to create and deploy intelligent AI agents that replicate human-quality work, decision-making, and collaboration at scale.We’re scaling fast backed by top global investors including Bessemer Venture Partners, Insight Partners, Peak XV, and King River Capital and our platform is already trusted by industry leaders like Canva, Databricks, Confluent, KMPG, Autodesk, and more. With offices in Sydney 🇦🇺 and San Francisco 🇺🇸 (and a new hub launching in Barcelona 🇪🇸), this is your chance to shape the future of work on a global stage.The Role 🧠We’re looking for a Founding Site Reliability Engineer to join us as our first SRE hire in San Francisco. We are open to hiring someone who is Senior, Lead or Principal level and will be candidate led. This role is perfect for someone ready to establish and scale the SRE discipline from the ground up in one of the fastest-growing AI companies globally.You’ll own the reliability, scalability, and security of our platform as we power tens of thousands of multi-agent workloads across multiple regions. You’ll partner closely with our founders, engineering leads, and product teams to define our reliability culture, shape long-term strategy, and build world-class infrastructure for enterprise scale.What You’ll Do 💪Own SRE establishing best practices, tooling, and cultureTackle reliability challenges unique to multi-agent orchestration at enterprise scaleGuarantee >99.9% uptime of production systems, ensuring reliability at global scaleArchitect and automate AWS infrastructure with Terraform and CI/CD pipelinesDesign observability systems across microservices, APIs, and vector infrastructure (metrics, tracing, logging)Drive down incidents and MTTR through runbooks, alerting, and incident response excellenceHelp scale infra to support hundreds of thousands of agents and billions of API callsPartner with engineering teams to embed SRE principles into the SDLC and shape org-wide reliability strategyAct as a founding voice in our SF office, influencing product direction and engineering cultureWhat We’re Looking For 🧠5+ years in SRE/DevOps/Infrastructure roles, with experience in enterprise SaaS environments.Deep AWS expertise (EC2, ECS/EKS, Lambda, RDS, VPC, IAM).Proven track record with Infrastructure as Code (Terraform, Kubernetes/EKS, CDK, or CloudFormation).Hands-on with observability stacks (CloudWatch, Grafana, Prometheus, Datadog).Incident management experience in production SaaS systems, including on-call, postmortems, and reliability improvements.Bonus: Prior exposure to AI/ML platforms, data-heavy systems, or multi-agent workloads.Tech Stack 🧰AWS, Kubernetes/EKS, Terraform, GitHub Actions, Postgres/Mongo, Prometheus/Grafana, CloudWatch, PagerDuty/BetterStack
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 9, 2025
Site Reliability Engineer (SRE)
Baseten
101-200
USD
0
150000
-
250000
United States
Full-time
Remote
true
ABOUT BASETENBaseten powers inference for the world's most dynamic AI companies, like OpenEvidence, Clay, Mirage, Gamma, Sourcegraph, Writer, Abridge, Bland, and Zed. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting-edge models into production. With our recent $150M Series D funding, backed by investors including BOND, IVP, Spark Capital, Greylock, and Conviction, we’re scaling our team to meet accelerating customer demand.THE ROLEAs a Site Reliability Engineer, you'll envision and build robust systems and processes that ensure our infrastructure is scalable, reliable, and efficient. This can range from automating deployments and monitoring systems to optimizing performance and managing incidents.We all work closely with our users, learning from their past struggles in operationalizing ML, onboarding them onto our platform, and turning our learnings into ideas for improving Baseten.EXAMPLE INITIATIVESYou'll get to work on these types of projects as part of our Infrastructure team: Multi-cloud capacity managementInference on B200 GPUsMulti-node inferenceFractional H100 GPUs for efficient model servingRESPONSIBILITIESBuild and maintain scalable infrastructure to support the deployment and operation of machine learning models.Establish standards and best practices for reliability and performance across the infrastructure.Automate processes when relevant, particularly for managing CI/CD pipelines.Own products and projects end-to-end, functioning as both an engineer and a project manager, with a focus on user empathy, project specification, and end-to-end execution.Collaborate with cross-functional teams to understand project requirements and translate them into technical solutions.Mentor junior team members and contribute to knowledge sharing within the organization.Navigate ambiguity and exercise good judgment on tradeoffs and tools needed to solve problems, avoiding unnecessary complexity.Demonstrate pride, ownership, and accountability for your work, expecting the same from your teammates.REQUIREMENTSBachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or related field.5+ years of professional work experience in a fast-paced, high-growth environment.Extensive experience with Kubernetes.Experience in building and maintaining scalable infrastructure.Experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, Circle CI, Jenkins).Relevant OSS observability experience (Prometheus, ELK stack, Grafana stack, Opentelemetry) is a plus.Ability to own projects end-to-end, from project specification to execution.No prior machine learning experience required, but should be open to learning about it.BENEFITSCompetitive compensation package.This is a unique opportunity to be part of a rapidly growing startup in one of the most exciting engineering fields of our era.An inclusive and supportive work culture that fosters learning and growth.Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.Apply now to embark on a rewarding journey in shaping the future of AI! If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward-thinking team, we would love to hear from you.
At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 9, 2025
Engineering Manager, HPC Deployments
Lambda AI
501-1000
USD
267000
-
486000
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position requires presence in our San Francisco/San Jose or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.
About the RoleEngineering at Lambda is responsible for building, operating, scaling and maintaining our AI Cloud offerings. The HPC Deployments team are responsible for deploying cutting edge NVIDIA GPU clusters on time, at scale and with 100% quality & correctness.Reporting to the Director of Fleet Engineering, you will lead and scale one of our HPC Deployments teams. This work is highly cross functional and critical to the timely success of our customers. Your team is focused on building and validating clusters deployed across our data center facilities. You will work collaboratively with Product and Infrastructure engineering teams to improve transparency, metrics, automation and overall efficiency for the team.
We value diverse backgrounds, experiences, and skills, and we are excited to hear from candidates who can bring unique perspectives to our team. If you do not exactly meet this description but believe you may be a good fit, please still apply and help us understand your readiness for this Manager role. Your application is not a waste of our time.What You’ll DoLead and grow a distributed top-talent team of HPC engineers responsible for the configuration, validation, deployment of large scale GPU clusters.Work cross functionally with teams in the organization to deliver projects and deployments on time, ensuring alignment across stakeholders.Identify opportunities for efficiency improvements in the tools / process / automation that the team relies upon day to day.Ensure stakeholders have clear visibility into deployment progress, risks, and outcomes.Drive outcomes by managing staff allocations, project priorities, deadlines, and deliverables.Conduct regular one-on-one meetings, provide constructive feedback, and support career development for team members.Stay current on the latest HPC/AI technologies and best practicesParticipate in the qualification efforts of new technologies for use in our production deploymentsYouExtensive experience in HPC or large-scale infrastructure, including at least 3 years in a leadership or management role.Work well under deadlines and structured project plans; able to successfully (and tactfully) negotiate changes to project timelinesHave excellent problem solving and troubleshooting skillsCan effectively collaborate with peer engineering managers to coordinate efforts that may impact deployment operationsAre comfortable leading and mentoring HPC engineers on cluster deployments as neededExperience building a high-performance team through deliberate hiring, upskilling, planned skills redundancy, performance-management, and expectation setting.Have flexibility to travel to our North American data centers as on-site needs arise or as part of training exercisesNice to HaveExperience with Linux systems administration, automation, scripting/coding.Experience with containerization technologies (Docker, Kubernetes)Experience working with the technologies that underpin our cloud business (GPU acceleration, virtualization, and cloud computing)Experience with machine learning and deep learning frameworks (PyTorch, Tensorflow) and benchmarking tools (DeepSpeed, MLPerf)Soft Skills (customer awareness, diplomacy)Bachelor’s degree or equivalent experience in a technical field.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 9, 2025
Senior Platform Engineer (Agents)
Sana
501-1000
-
Sweden
Full-time
Remote
false
About Sana At Sana, we're on a mission to bring superintelligence to work—a seamless, beautiful way to access all your company’s apps, knowledge, and data. We are obsessed with making AI agents do real work to empower our users to process and act on knowledge at an unprecedented scale.We’re a talent-dense team of engineers and designers in Stockholm, united by a passion for building great products with deep technical excellence. Our backgrounds span AI research at Google, distributed systems at Spotify, product design and AI infrastructure at Apple, competitive programming, and scaling fast-growing startups. At Sana, everyone is an owner. We cut out processes, maximize impact, and value rapid prototyping and iteration.Our AI tools already help over a million people learn and work better across hundreds of leading enterprises. But we’re just getting started—there’s so much more to build. Join us to tackle some of the most challenging and meaningful problems of our generation.About the roleYou’ll own the reliability, security, and scalability of Sana’s platform while leveling up developer velocity. This is the place for engineers who thrive on moving between cloud infrastructure, deployment, and tooling to unblock everyone else. You’ll work on everything from multi-tenant scale-out to developer experience, ensuring our platform is robust, secure, and a joy to build on.In this role, you willDrive reliability, security, and scalability across the platformArchitect and implement multi-tenant scale-out (10–100x) and observability solutionsLead incident response, performance profiling, and cost control initiativesBuild and maintain CI/CD pipelines, test infrastructure, and environment provisioningWork with engineers across the stack to remove friction and accelerate deliveryWhat success looks likeThe platform scales seamlessly as usage grows 10-100x, with robust observability and incident responseSecurity and compliance are built-in, not bolted onDeveloper velocity is tangibly higher—teams ship faster, with fewer blockersCI/CD, test infra, and development environments are reliable, discoverable, and loved by engineersCost controls and performance budgets are visible and actionableYou are a go-to partner for infra, deployment, and tooling challenges: unblocking others and raising the bar for engineering excellenceOur tech stackWe build on a simple modern stack optimized for both humans and AI:Backend: TypeScript, Node.jsFrontend: TypeScript, React, TailwindDatabases: Postgres, RedisCloud infra: GCP/Kubernetes/TerraformWhat We OfferHelp shape AI's future alongside brilliant minds from Notion, Dropbox, Slack, Databricks, Google, McKinsey, and BCG.Competitive salary complemented with a transparent and highly competitive options program.Swift professional growth in an evolving environment, supported by a culture of continuous feedback and mentorship from senior leaders.Work with talented teammates across 5+ countries, and collaborate with customers globallyRegular team gatherings and events (recently in Italy and South Africa)
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 9, 2025
Senior Platform Engineer (Learn)
Sana
501-1000
-
Sweden
Full-time
Remote
false
About Sana At Sana, we’re building Learn—an AI-powered learning platform that combines the simplicity of a modern LMS with intelligent features like AI tutor, automated content generation, and interactive apps. Our goal is to help enterprises make knowledge not just accessible, but actionable at scale.We’re a product-obsessed team of engineers and designers from companies like Google, Spotify, and Databricks, united by a focus on technical excellence and rapid iteration. With Learn already supporting hundreds of leading enterprises, we’re just getting started—and there’s much more to build.About the roleYou’ll own the reliability, security, and scalability of Sana’s platform while leveling up developer velocity. This is the place for engineers who thrive on moving between cloud infrastructure, deployment, and tooling to unblock everyone else. You’ll work on everything from multi-tenant scale-out to developer experience, ensuring our platform is robust, secure, and a joy to build on.In this role, you willDrive reliability, security, and scalability across the platformArchitect and implement multi-tenant scale-out (10–100x) and observability solutionsLead incident response, performance profiling, and cost control initiativesBuild and maintain CI/CD pipelines, test infrastructure, and environment provisioningWork with engineers across the stack to remove friction and accelerate deliveryWhat success looks likeThe platform scales seamlessly as usage grows 10-100x, with robust observability and incident responseSecurity and compliance are built-in, not bolted onDeveloper velocity is tangibly higher—teams ship faster, with fewer blockersCI/CD, test infra, and development environments are reliable, discoverable, and loved by engineersCost controls and performance budgets are visible and actionableYou are a go-to partner for infra, deployment, and tooling challenges: unblocking others and raising the bar for engineering excellenceOur tech stackWe build on a simple modern stack optimized for both humans and AI:Backend: TypeScript, Kotlin, Node.jsFrontend: TypeScript, React, TailwindDatabases: Postgres, RedisCloud infra: GCP/Kubernetes/TerraformWhat We OfferHelp shape AI's future alongside brilliant minds from Notion, Dropbox, Slack, Databricks, Google, McKinsey, and BCG.Competitive salary complemented with a transparent and highly competitive options program.Swift professional growth in an evolving environment, supported by a culture of continuous feedback and mentorship from senior leaders.Work with talented teammates across 5+ countries, and collaborate with customers globallyRegular team gatherings and events (recently in Italy and South Africa)
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 9, 2025
Engineering Manager, Capacity
Anthropic
1001-5000
USD
0
365000
-
565000
United States
Full-time
Remote
false
About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.About the role Anthropic’s Capacity team is looking for an Engineering Manager to own and manage cloud spend across a massively scaled, multi-cloud environment. You’ll work closely with research, engineering, and finance teams to ensure we have scalable systems for capacity management, high-quality data and insights for planning, and engineering roadmaps that deliver efficiency wins. Responsibilities: Design, develop, and deliver capacity management systems for AI workloads on heterogenous infrastructure Build and maintain robust attribution of usage and enable in-depth data-driven insights that are actionable Build a deep understanding of research and training workloads to accurately forecast infrastructure needs Oversee design and implementation of forecasting tools and software systems for managing billions of dollars in spend Proactively identify efficiency opportunities and collaborate with teams across the org to increase effective capacity for Anthropic Partner closely with Finance and leadership, providing detailed and clear capacity inputs for financial planning and strategic decision making You may be a good fit if you: Have experience managing $XXXM to $XB in infrastructure spend Have experience working with public clouds (AWS, GCP, Azure, etc.) and/or hybrid on-prem, cloud environments Have experience setting up capacity management systems that scale with growing organizations Are comfortable leveraging data and have experience building observability for complex systems Have strong interpersonal skills that enable you to influence and build cross-organizational support for capacity initiatives Have familiarity with LLMs and a deep interest in learning more about research and model training workloads Strong candidates may also have some of the following: Past experience managing capacity for AI research and production workloads Past experience partnering with senior leadership, both technical and non-technical, to drive company-level reporting and decision making The expected base compensation for this position is below. Our total compensation package for full-time employees includes equity, benefits, and may include incentive compensation.Annual Salary:$365,000—$565,000 USDLogistics Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.
Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this. We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team. How we're different We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills. The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences. Come work with us! Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 9, 2025
Member of Technical Staff - GPU Infrastructure
Reflection
1-10
-
United States
Full-time
Remote
false
Our MissionReflection’s mission is to build open superintelligence and make it accessible to all.We’re developing open weight models for individuals, agents, enterprises, and even nation states. Our team of AI researchers and company builders come from DeepMind, OpenAI, Google Brain, Meta, Character.AI, Anthropic and beyond.About the RoleDesign, build, and operate Reflection’s large-scale GPU infrastructure powering pre-training, post-training, and inference.Develop reliable, high-performance systems for scheduling, orchestration, and observability across thousands of GPUs.Optimize cluster utilization, throughput, and cost efficiency while maintaining reliability at scale.Build tools and automation for distributed training, inference, monitoring, and experiment management.Collaborate closely with research, training, and platform teams to accelerate development and enable large-scale training and inference.Push the limits of hardware, networking, and software to accelerate the path from idea to model.About YouDeep systems or infrastructure engineering experience in high-performance or distributed computing environments.Strong understanding of GPUs, CUDA, NCCL, and large-scale training and inference frameworks and libraries (PyTorch, DeepSpeed, JAX, Megatron-LM, SGLang, vLLM, etc.).Hands-on experience with containerization, orchestration, and cluster management (Kubernetes, Slurm, or similar).Familiar with modern observability stacks and performance profiling tools.High agency and the ability to thrive in a fast-paced, high-ownership startup environment.Excited to build from zero to one defining how frontier-scale training/RL infrastructure is architected and operated.Motivated by enabling researchers and engineers to build the world’s most capable open-weight AI systems.What We Offer:We believe that to build superintelligence that is truly open, you need to start at the foundation. Joining Reflection means building from the ground up as part of a small talent-dense team. You will help define our future as a company, and help define the frontier of open foundational models.We want you to do the most impactful work of your career with the confidence that you and the people you care about most are supported.Top-tier compensation: Salary and equity structured to recognize and retain the best talent globally.Health & wellness: Comprehensive medical, dental, vision, life, and disability insurance.Life & family: Fully paid parental leave for all new parents, including adoptive and surrogate journeys. Financial support for family planning.Benefits & balance: paid time off when you need it, relocation support, and more perks that optimize your time. Opportunities to connect with teammates: lunch and dinner are provided daily. We have regular off-sites and team celebrations.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 7, 2025
Deployment Engineer, AI Inference
Cerebras Systems
501-1000
-
United States
Canada
Remote
false
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.About Us Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In 2024, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. About The Role We are seeking a highly skilled and experienced Deployment Engineer to build and operate our cutting-edge inference clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power. You will play a critical role in ensuring reliable, efficient, and scalable deployment of AI inference workloads across our global infrastructure. On the operational side, you’ll own the rollout of the new software versions and AI replica updates, along the capacity reallocations across our custom-built, high-capacity datacenters.
Beyond operations, you’ll drive improvements to our telemetry, observability and the fully automated pipeline. This role involves working with advanced allocation strategies to maximize utilization of large-scale computer fleets. The ideal candidate combines hands-on operation rigor with strong systems engineering skills and thrives on building resilient pipelines that keep pace with cutting-edge AI models. This role does not require 24/7 hour on-call rotations.
Responsibilities Deploy AI inference replicas and cluster software across multiple datacenters Operate across heterogeneous datacenter environments undergoing rapid 10x growth Maximize capacity allocation and optimize replica placement using constraint-solver algorithms Operate bare-metal inference infrastructure while supporting transition to K8S-based platform Develop and extend telemetry, observability and alerting solutions to ensure deployment reliability at scale Develop and extend a fully automated deployment pipeline to support fast software updates and capacity reallocation at scale Translate technical and customer needs into actionable requirements for the Dev Infra, Cluster, Platform and Core teams Stay up to date with the latest advancements in AI compute infrastructure and related technologies. Skills And Requirements 5-7 years of experience in operating on-prem compute infrastructure (ideally in Machine Learning or High-Performance Compute) or id developing and managing complex AWS plane infrastructure for hybrid deployments Strong proficiency in Python for automation, orchestration, and deployment tooling Solid understanding of Linux-based systems and command-line tools Extensive knowledge of Docker containers and container orchestration platforms like K8S Familiarity with spine-leaf (Clos) networking architecture Proficiency with telemetry and observability stacks such as Prometheus, InfluxDB and Grafana Strong ownership mindset and accountability for complex deployments Ability to work effectively in a fast-paced environment. Location SF Bay Area. Toronto, Canada. Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2025. Apply today and become part of the forefront of groundbreaking advancements in AI! Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 3, 2025
AI Platform Security Engineer
Anthropic
1001-5000
USD
0
300000
-
405000
United States
Full-time
Remote
false
About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. About the Team The Security Engineering team's mission is to safeguard our AI systems and maintain the trust of our users and society at large. Whether we're developing critical security infrastructure, building secure development practices, or partnering with our research and product teams, we are committed to operating as a world-class security organization and keeping the safety and trust of our users at the forefront of everything we do. Responsibilities: Build security for large-scale AI clusters, implementing robust cloud security architecture including IAM, network segmentation, and encryption controls Design secure-by-design workflows, secure CI/CD pipelines across our services, help build secure cloud infrastructure, with expertise in various cloud environments, Kubernetes security, container orchestration and identity management Ship and operate secure, high-reliability services using Infrastructure-as-Code (IaC) practices and GitOps workflows Apply deep expertise in threat modeling and risk assessment to secure complex cloud environments Mentor engineers and contribute to hiring and growth of the Security team You may be a good fit if you have: 5-15+ years of software engineering experience implementing and maintaining critical systems at scale Bachelor's degree in Computer Science/Software Engineering or equivalent industry experience Strong software engineering skills in Python or at least one systems language (Go, Rust, C/C++) Experience managing infrastructure at scale with DevOps and cloud automation best practices Track record of driving engineering excellence through high standards, constructive code reviews, and mentorship Proven ability to lead cross-functional security initiatives and navigate complex organizational dynamics Outstanding communication skills, translating technical concepts effectively across all organizational levels Demonstrated success in bringing clarity and ownership to ambiguous technical problems Strong systems thinking with ability to identify and mitigate risks in complex environments Low ego, high empathy engineer who attracts talent and supports diverse, inclusive teams Experience supporting fast-paced startup engineering teams Passionate about AI safety and alignment, with keen interest in making AI systems more interpretable and aligned with human values Strong candidates may also: Designing and hardening CI/CD pipelines against supply chain attacks through isolated environments, signed attestations, dependency verification, and automated policy enforcement Building secure development workflows through hardened remote environments Implementing network segmentation and access controls in cloud environments Managing infrastructure through automated configuration and policy enforcement Hardening containerized applications and enforcing security policies Deadline to apply: None. Applications will be reviewed on a rolling basis. The expected base compensation for this position is below. Our total compensation package for full-time employees includes equity, benefits, and may include incentive compensation.Annual Salary:$300,000—$405,000 USDLogistics Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.
Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this. We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team. How we're different We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills. The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences. Come work with us! Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 2, 2025
Engineering Manager, Core Services
Lambda AI
501-1000
USD
0
297000
-
495000
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position requires presence in our San Francisco, San Jose or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
The Lambda Core Services team builds and operates release engineering, cloud automation, and workflow systems for our AI cloud product suite. We provide CI/CD tooling and artifact management to support the build/deploy process for our services. We also automate configuration of our AWS and other SaaS resources and manage AWS usage for all of Lambda engineering. Keeping the internal product and engineering teams moving quickly and delivering quality is what makes us tick.Along with the Platform Engineering organization, we help to build the foundations that unlock product excellence and a highly reliable experience for our customers.
About the Role:We are seeking a seasoned Engineering Manager with deep experience in both release engineering and the management of large-scale cloud deployments. You will hire and guide a team of platform engineers in building out critical pillars of our stack. You will lead the team in designing, deploying, scaling, and supporting these solutions.Your role is not just to manage people, but to coordinate the delivery of platform solutions to engineering customers within Lambda. This is a unique opportunity to work at the intersection of platform engineering and the rapidly evolving field of AI infrastructure.What You’ll DoTeam Leadership & Management:Grow/Hire, lead, and mentor a team of high-performing platform engineers and SREs.Foster a culture of technical excellence, collaboration, and customer service.Conduct regular one-on-one meetings, provide constructive feedback, and support career development for team members.Drive outcomes by managing project priorities, deadlines, and deliverables.Technical Strategy & Execution:Work with the engineering team to drive strategy for internal CI/CD and Cloud services.Develop self-service abstractions to make our platform tooling easier to adopt and use.Lead the broader engineering organization in best-practices adoption of CI/CD, Workflow, and Cloud services.Manage costs of both vendors and internally developed platforms.Lead team in the continued development of our existing CI/CD solutions based on Buildkite and Github Actions.Lead team in the expansion of our Terraform / Atlantis infrastructure automation platform.Guide Lambda engineering in utilization of AWS services in line with our technical standards.Guide team in problem identification, requirements gathering, solution ideation, and stakeholder alignment on engineering RFCs.Identify gaps in our platform engineering posture and drive resolution.Lead the team in supporting our internal customers from across Lambda engineering.Cross-Functional Collaboration:Work closely with Lambda product engineering teams on requirements and planning to meet their needs.Work to understand the needs of engineering teams and drive our Platform solutions towards self-service.Manage a short list of vendors that provide SaaS solutions used at Lambda.YouExperience:7+ years of experience in either Release Engineering or Platform Engineering with at least 3 years in a management or lead role.Demonstrated experience leading a team of engineers and SREs on complex, cross-functional projects in a fast-paced startup environment.Experience managing, monitoring, and scaling CI/CD platforms.Deep experience using and operating AWS services.Solid background in software engineering and the SDLC.Strong project management skills, leading planning, project execution, and delivery of team outcomes on schedule.Experience building a high-performance team through deliberate hiring, upskilling, performance-management, and expectation setting.Nice to HaveExperience:Experience driving cross-functional engineering management initiatives (coordinating events, strategic planning, coordinating large projects).Experience driving organizational improvements (processes, systems, etc.)Experience managing AWS service usage across a broader engineering organization.Experience in AWS spend management.Experience designing solutions using Temporal workflows; ability to act as an internal consultant for Temporal.Experience with Kubernetes.Experience designing scalable distributed systems.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 1, 2025
Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)
Crusoe
501-1000
USD
166000
-
201000
United States
Full-time
Remote
false
Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.About the Role:
We are looking for a highly skilled engineer with deep expertise in building and operating observability platforms at scale. You will design, develop, and run Crusoe’s next-generation observability stack, enabling engineers to understand the internal state of distributed systems through metrics, logs, and traces. Your work will ensure reliability, performance, and actionable insights across Crusoe’s global infrastructure and cloud platform.What You’ll Be Working On:Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environmentsArchitecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualizationExtending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetryBuilding scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacksImplementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIsDefining and driving adoption of SLOs, SLIs, and error budgets across services and teamsAutomating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controlsPartnering with engineering teams to embed observability into applications, services, and infrastructureMentoring engineers and shaping Crusoe’s observability strategy and technical roadmapWhat You’ll Bring to the Team:7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systemsDeep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)Strong programming skills in Go or Python for automation, operators, and custom integrationsExperience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environmentsProven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput dataSolid understanding of distributed systems, performance engineering, and debugging complex workloadsFamiliarity with service meshes, networking, and workload instrumentation (Envoy, Istio, OpenTelemetry SDKs)Strong collaboration skills and the ability to influence engineering teams to adopt observability best practicesBonus Points:Contributions to open source observability projects (Prometheus, OpenTelemetry, Grafana, Loki, etc.)Experience supporting AI/ML or GPU-heavy environments with high observability demandsKnowledge of event-driven or streaming systems (Kafka, NATS, Pulsar) used in telemetry pipelinesExperience implementing cost optimization strategies for large-scale observability platformsBackground in incident response, chaos engineering, and reliability practicesBenefits:Industry competitive payRestricted Stock Units in a fast growing, well-funded technology companyHealth insurance package options that include HDHP and PPO, vision, and dental for you and your dependentsEmployer contributions to HSA accountsPaid Parental LeavePaid life insurance, short-term and long-term disabilityTeladoc401(k) with a 100% match up to 4% of salaryGenerous paid time off and holiday scheduleCell phone reimbursementTuition reimbursementSubscription to the Calm appMetLife LegalCompany paid commuter benefit; $300 per monthCompensation:Compensation will be paid in the range of $166,000 - $201,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
September 30, 2025
No job found
There is no job in this category at the moment. Please try again later