Top MLOps / DevOps Engineer Jobs Openings in 2025
Looking for opportunities in MLOps / DevOps Engineer? This curated list features the latest MLOps / DevOps Engineer job openings from AI-native companies. Whether you're an experienced professional or just entering the field, find roles that match your expertise, from startups to global tech leaders. Updated everyday.
Security Engineer
Console
11-50
USD
0
180000
-
300000
United States
Full-time
Remote
false
About UsConsole is an AI platform that automates IT and internal support. We help companies scale without scaling headcount, and give employees instant resolution to their issues. Our agents understand the full context of the organization, handle requests end-to-end and pull in humans only when necessary.Today, companies like Ramp, Scale, Webflow, and Flock Safety rely on Console to automate over half of their IT & HR requests. We've won every bake-off against our competitors, closed every trial customer and expect to 10x usage by year-end.We're a small, talent-dense team: naturally curious, high-agency and low-ego. Our organization is very flat and ideas win on merit, not hierarchy. We're hiring exceptional people to keep up with demand. We're backed by Thrive Capital and world-class angels.About the roleAs a founding Security Engineer at Console, you'll own, design, and implement the roadmap for our evolving security posture across both infrastructure and application security. You'll work directly with the CTO to align on our commitments to customers and ensure that our platform remains secure and compliant.Some examples of work you might do:Build the threat model that informs our security roadmap for the next yearDesign and deploy foundational security controls across corporate infrastructure (EDR, FIDO2 authentication, VPN) and application security (secure SDLC, vulnerability management)Oversee our compliance efforts (SOC 2, HIPAA, ISO 27001) and coordinate with external auditors and consultantsEvaluate, select, and implement security tooling that balances sophistication with operational efficiencyYou'll have broad license to own the security and compliance efforts at Console, with room to grow into a leadership position as the team scales.This role is based in San Francisco, CA. We work in-person and offer relocation assistance to new employees.About youYou have hands-on experience building security programs from the ground up, ideally in fast-growing startups or cloud-native environmentsYou understand both infrastructure security (identity & access management, network security, endpoint protection) and application security (threat modeling, secure development practices, vulnerability management)You've worked with compliance frameworks like SOC 2, HIPAA, or ISO 27001 and can translate requirements into practical implementationYou're comfortable both building security tooling yourself and orchestrating third-party solutions, and you know when to build vs. buyYou care about enabling the business and empowering engineers, not just saying "no"Requirements5+ years of full-time experience in security engineering, platform security, or infrastructure security rolesDeep experience with cloud security in AWS and/or GCP, including IaC tools like Terraform or PulumiPassionate about building pragmatic, risk-based security programs that scale with the businessWhy join Console?Product-market fit: We have built the leading product in our category, in a massive market. We've hit an inflection point and are on track to build a generational company. World-class team: We seek high agency contributors who are comfortable navigating ambiguity, ruthlessly prioritize what matters and are action-biased.Grow with us: We reward impact, not credentials or years of experience. We intend to grow talent from within as we scale up.Competitive pay and benefits: top compensation with full benefits including:Equity with early exercise & QSBS eligibilityComprehensive health, dental, and vision insuranceUnlimited PTO401(k)Meals provided daily in office
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 30, 2025
Senior Platform Engineer
Seven AI
51-100
-
United States
Full-time
Remote
true
As a Senior Platform Engineer, you’ll be at the heart of our infrastructure - designing, building, and scaling the core platform that powers our AI-driven security products. You’ll collaborate closely with our AI, backend, and product teams to ensure our systems are performant, stable, reliable, and secure.You’ll own key pieces of our cloud and service platform, automate everything you can, and continuously improve how we deliver, monitor, and scale mission-critical applications. If you’re passionate about building scalable, reliable infrastructure and enabling teams to deliver faster and with confidence, this is the role for you.What You’ll DoDesign and build scalable, secure, and high-performance infrastructure for our AI-native platform.Own AWS region expansion, harden the system for compliance, and improve elasticity, disaster recovery, and high-availability features.Develop tooling, services, and automation, including runbook automation and AI-assisted developer workflows, to improve platform reliability and developer experience.Partner with AI and backend engineers to streamline deployment, CI/CD, observability, and performance optimization.Drive initiatives around infrastructure-as-code, container orchestration, and operational excellence to reduce toil and enhance incident response.What We’re Looking For5+ years of experience in Platform, DevOps, or Software Engineering roles.Experience working closely in and with development teams, and proven ability to change production code.Strong software development skills (Python, TypeScript, or similar).Deep experience with AWS cloud infrastructure.Expertise with Kubernetes, Docker, Terraform, and modern CI/CD systems such as GitHub Actions.Experience building observability and monitoring systems leveraging tools such as the full Grafana stack.Passion for developer experience, automation, and high-scale systems.Strong understanding of distributed systems, networking, and security fundamentals.Curiosity and drive to work at the intersection of AI and cybersecurity.
Nice to HaveExperience with AI/ML platforms or data-intensive systems.Knowledge of agent frameworks, LLM orchestration, or event-driven architectures.Familiarity with SOC operations, threat detection, or security automation tools.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 30, 2025
Senior DevOps Engineer (APJ)
Arize AI
101-200
-
Argentina
Full-time
Remote
true
About Arize
AI is rapidly transforming the world. As generative AI reshapes industries, teams need powerful ways to monitor, troubleshoot, and optimize their AI systems. That’s where we come in. Arize AI is the leading AI & Agent Engineering observability and evaluation platform, empowering AI engineers to ship high-performing, reliable agents and applications. From first prototype to production scale, Arize AX unifies build, test, and run in a single workspace—so teams can ship faster with confidence.
We’re a Series C company backed by top-tier investors, with over $135M in funding and a rapidly growing customer base of 150+ leading enterprises and Fortune 500 companies. Customers like Booking.com, Uber, Siemens, and PepsiCo leverage Arize to deliver AI that works.Note: The nature of this role requires candidates to be based in the Buenos Aires area, though there isn't an in-office requirement.
The Opportunity
We’re looking for an Application Engineer who thrives on solving hard problems with code. In this role, you'll have the opportunity to work at the cutting edge of generative AI in a high-impact role with autonomy and ownership.
What You’ll Do
Debug and fix issues in our platform (and ship PRs with your fixes).
Build internal tools and copilots powered by generative AI to supercharge our team.
Rapidly prototype proof-of-concepts for customer use cases.
Work across Engineering, Product, and Solutions to unblock customers and push the boundaries of AI adoption.
What We’re Looking For
You have 2-5 years of experience in software.
Strong in Python and Golang; comfortable shipping fixes in production systems.
Hands-on with generative AI (LLM APIs, frameworks, building copilots or automations)
Hands-on with OpenTelimetry and deep familiarity with distributed tracing concepts.
Familiarity with AI frameworks (CrewAI, Langchain, Langgraph, DiFy, LiteLLM, etc).
Familiarity or eagerness to learn JavaScript/TypeScript.
Great debugger, creative problem solver, and fast learner.
Independent and resourceful. You create solutions, not dependencies.
Bonus Points (but not required!)
Experience in a customer-facing role
Built copilots, plugins, or custom GenAI-powered applications.
Open-sourced or contributed PRs to real codebases.
Startup or fast-moving environment experience.
Actual compensation is determined based upon a variety of job related factors that may include: transferable work experience, skill sets, and qualifications. Total compensation also includes unlimited paid time off, generous parental leave plan, and others for mental and wellness support.More About Arize
Arize’s mission is to make the world’s AI work—and work for people.
Our founders came together through a shared frustration: while investments in AI are growing rapidly across every industry, organizations face a critical challenge—understanding whether AI is performing and how to improve it at scale.
Learn more about what we're doing here:
https://techcrunch.com/2025/02/20/arize-ai-hopes-it-has-first-mover-advantage-in-ai-observability/
https://arize.com/blog/arize-ai-raises-70m-series-c-to-build-the-gold-standard-for-ai-evaluation-observability/
Diversity & Inclusion @ Arize
Our company's mission is to make AI work and make AI work for the people, we hope to make an impact in bias industry-wide and that's a big motivator for people who work here. We actively hope that individuals contribute to a good culture
Regularly have chats with industry experts, researchers, and ethicists across the ecosystem to advance the use of responsible AI
Culturally conscious events such as LGBTQ trivia during pride month
We have an active Lady Arizers subgroup
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 30, 2025
Engineering Manager, Networking
OpenAI
5000+
USD
0
460000
-
555000
United States
Full-time
Remote
false
About the TeamThe Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers.The models we train are key ingredients to the AI research progress at OpenAI and the field as a whole, and we continually incorporate learnings from our entire research org into our training platform. About the RoleAs an Engineering Manager, Networking, you will lead a world-class team focused on building and optimizing the performance-critical systems that power OpenAI’s largest training runs.We’re looking for someone with a strong technical background in low-level systems work who is excited to step into a full-time management role. This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.In this role, you will:Manage a highly senior team that develops the communication and systems software behind OpenAI’s largest training workloads.Collaborate closely with ML research and infrastructure teams to ensure system priorities align with evolving model needs.
Grow and support engineers on your team through mentorship, project alignment, and performance development.Prioritize across projects and maintain visibility into incoming research demands to keep critical training infrastructure ahead of bottlenecks.You might thrive in this role if you: Are an experienced leaderHave experience with low-level systems engineering such as CPU/GPU kernels, RDMA, high-performance networking, or HPC.Are excited to manage a deeply technical team and guide systems work that directly enables AI research at massive scale.Enjoy working closely with other high-context teams across research and infrastructure to solve complex, cross-cutting problems.NCCL (NVIDIA Collective Communications Library) or collectives experience.
About OpenAIOpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.OpenAI Global Applicant Privacy PolicyAt OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 29, 2025
Applied Data Center Design Engineer
Cerebras Systems
501-1000
-
Canada
Full-time
Remote
false
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.About The Role As an Applied Data Center Design Engineer, you’ll own the “last mile” of cluster architecture - transforming high-level design specifications into efficient, real-world deployment blueprints for servers, storage, networking, and cabling. You’ll be responsible for customizing data center and rack-level designs based on specific cluster requirements - adapting layouts, power, and connectivity to optimize performance, scalability, and reliability. When real-world constraints like space, power, or supply chain limitations arise, you’ll make smart trade-offs to deliver practical, deployable solutions. This role combines hands-on problem solving with automation and tooling - you’ll also help design and build the frameworks that make each new deployment iteration faster, smarter, and more consistent across sites. It’s a great opportunity for someone early in their career who enjoys working at the intersection of hardware, software, and operations, and wants to shape the foundation of large-scale compute infrastructure. Responsibilities Translate cluster and rack-level design specifications into deployable blueprints for servers, storage, networking, and cabling. Customize rack-level designs to meet unique cluster requirements, ensuring power, thermal, and network connectivity are optimized for each deployment. Collaborate with operations team to validate and adapt designs based on site-specific constraints (e.g., power, cooling, space, logistics). Identify and implement automation and tooling to streamline BOM generation and design validation. Participate in data center deployment reviews, ensuring alignment between design intent and implementation. Support issue triage and root cause analysis for deployment-related or physical integration problems. Skills & Qualifications Bachelor’s or Master’s degree in Computer Engineering, Electrical Engineering, Computer Science, or a related field — or equivalent practical experience. 1–3 years of experience in infrastructure engineering, data center design, or systems deployment, creating rack elevations, bill of materials (BOMs), and port/cable maps. Familiarity with servers, networking, and storage hardware. Basic proficiency in scripting or automation (e.g., Python, PowerShell, or Bash). Strong analytical and problem-solving skills with attention to detail. Excellent communication and teamwork skills across multiple engineering disciplines. Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2025. Apply today and become part of the forefront of groundbreaking advancements in AI! Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 29, 2025
Data Center Operations Engineer - Chicago ORD
Lambda AI
501-1000
USD
0
89000
-
134000
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position requires presence in our Chicago/Elk Grove Village Data Center location 5 days per week.
The Operations team is at the heart of keeping our AI-IaaS infrastructure running smoothly from start to finish. They handle everything from sourcing the right hardware and components to keeping our data centers performing at their best day in and day out. The team also works closely across the company, making sure our operational capabilities stay in sync with product goals and overall strategy. By managing the entire lifecycle — from procurement through deployment and ongoing efficiency — the Operations team ensures our AI infrastructure stays reliable, scalable, and ready to support the business as it grows.What You’ll DoMake sure new servers, storage, and networking gear are racked, labeled, cabled, and configured the right way.Keep data center layouts and network topologies up to date in our DCIM software.Coordinate with supply chain and manufacturing teams so systems are deployed on time, especially for large-scale projects.Evaluate current and future data center needs based on growth and technology trends.Manage parts depot inventory and track equipment as it moves from delivery → storage → staging → deployment → handoff.Work closely with hardware support teams to get tickets resolved quickly.Create and manage RMA tickets when needed, making sure faulty parts are replaced and reinstalled without delay.Develop and maintain installation standards (placement, labeling, cabling) to ensure consistency across all data centers.Act as a subject matter expert on data center deployments, supporting sales engagements for major deployments in our facilities or at customer sites.YouSourcing & ProcurementResearching, evaluating, and securing the right hardware and infrastructure components.Building relationships with peers and supply chain to ensure cost-effective and timely supply.Data Center OperationsMonitoring day-to-day performance of data centers to maintain uptime and efficiency.Troubleshooting and resolving hardware or infrastructure issues quickly.Performing regular maintenance and upgrades to keep systems running at peak performance.Deployment & Lifecycle ManagementOverseeing the full lifecycle of infrastructure, from initial setup to ongoing optimization.Coordinating deployments of new hardware and ensuring seamless integration with existing systems.Managing capacity planning to make sure infrastructure can scale with business growth.Cross-Team CollaborationWorking with product management, support, and other teams to align operational capabilities with company goals.Translating business priorities into technical and operational requirements.Supporting cross-functional projects where infrastructure plays a critical role.Reliability & ScalabilityEnsuring infrastructure remains stable, secure, and scalable as demand increases.Continuously improving processes to boost efficiency and reduce downtime risks.Nice to HaveCertifications: Any Linux or project management.Military background.Experience in the machine learning or computer hardware industrySalary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 29, 2025
Senior Site Reliability Engineer, Storage
Crusoe
501-1000
USD
0
166000
-
201000
United States
Full-time
Remote
false
Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.About This Role:At Crusoe Energy Systems, our Site Reliability Engineering (SRE) team plays a mission-critical role in maintaining the performance and reliability of our AI-optimized cloud infrastructure. The Storage-focused SRE role is responsible for ensuring the availability, performance, and scalability of Crusoe’s cloud storage products and services, which power compute-intensive, latency-sensitive workloads for AI and HPC use cases. This role directly supports our vertically integrated, sustainable cloud platform by building and optimizing distributed, fault-tolerant storage systems at scale.What You'll Be Working On:In this role, you will build automation and self-healing tools to monitor and maintain Crusoe’s distributed cloud storage infrastructure, which includes block, file, and object storage systems. You will drive reliability initiatives focused on data replication, encryption, backup and restore strategies, and robust failover mechanisms. Collaborating closely with storage engineers, you will help implement and maintain high-performance NVMe- and SSD-backed volumes that support large-scale AI compute clusters. Your responsibilities will also include supporting user-facing storage services with a focus on availability, performance tuning, and adherence to error budgets. You’ll investigate and resolve storage-related incidents using deep telemetry, logs, and performance profiling, while also partnering with hardware and kernel teams to diagnose low-level I/O issues and optimize I/O paths, cache policies, and file systems. Additionally, you will contribute to the architecture of fault-tolerant, scalable storage backends tailored for AI-first cloud environments.
What You’ll Bring to the Team:5+ years of professional experience in SRE, systems, or storage engineering.Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms.Proficiency in a programming language such as Python, Go, Java, or C.Experience with Infrastructure as Code and deployment tooling such as Terraform, Ansible, or Puppet.Deep knowledge of Linux internals with a focus on I/O subsystems, memory management, and storage scheduling.Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF.Strong experience working with containerized workloads and orchestration platforms (e.g., Kubernetes, Docker).Excellent incident response, troubleshooting, and documentation practices.Experience with building and operating managed services at scale such as object, file and block storage (AWS, GCP, Azure)Excellent communication skillsMust be able to pass a background checkEmbody the Company valuesBonus Points:Contributions to open-source storage projects or the Linux storage stack.Experience with hybrid storage models across on-prem and cloud environments.Familiarity with high-throughput network topologies for storage backplanes (e.g., RoCE, RDMA, InfiniBand)..Benefits:Industry competitive payRestricted Stock Units in a fast growing, well-funded technology companyHealth insurance package options that include HDHP and PPO, vision, and dental for you and your dependentsEmployer contributions to HSA accountsPaid Parental LeavePaid life insurance, short-term and long-term disabilityTeladoc401(k) with a 100% match up to 4% of salaryGenerous paid time off and holiday scheduleCell phone reimbursementTuition reimbursementSubscription to the Calm appMetLife LegalCompany paid commuter benefit; $300 per monthCompensation:Compensation will be paid in the range of $166,000 - $201,000 a year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 28, 2025
Staff Site Reliability Engineer, Storage
Crusoe
501-1000
USD
0
204000
-
247000
United States
Full-time
Remote
false
Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.About This Role:At Crusoe Energy Systems, our SRE team plays a mission-critical role in maintaining the performance and reliability of our AI-optimized cloud infrastructure. The Storage-focused Site Reliability Engineer role is responsible for ensuring the availability, performance, and scalability of Crusoe’s cloud storage products and services, which power compute-intensive, latency-sensitive workloads for AI and HPC use cases. This role directly supports our vertically integrated, sustainable cloud platform by building and optimizing distributed, fault-tolerant storage systems at scale.What You'll Be Working On:In this role, you will build automation and self-healing tools to monitor and maintain Crusoe’s distributed cloud storage infrastructure, which includes block, file, and object storage systems. You will drive reliability initiatives focused on data replication, encryption, backup and restore strategies, and robust failover mechanisms. Collaborating closely with storage engineers, you will help implement and maintain high-performance NVMe- and SSD-backed volumes that support large-scale AI compute clusters. Your responsibilities will also include supporting user-facing storage services with a focus on availability, performance tuning, and adherence to error budgets. You’ll investigate and resolve storage-related incidents using deep telemetry, logs, and performance profiling, while also partnering with hardware and kernel teams to diagnose low-level I/O issues and optimize I/O paths, cache policies, and file systems. Additionally, you will contribute to the architecture of fault-tolerant, scalable storage backends tailored for AI-first cloud environments.
What You’ll Bring to the Team:8+ years of professional experience in Storage SRE, systems engineering, storage engineering, or similar rolesHands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms.Proficiency in a programming language such as, Go, Python, Java, or C.Experience with Infrastructure as Code and deployment tooling such as Terraform, Ansible, or Puppet.Deep knowledge of Linux internals with a focus on I/O subsystems, memory management, and storage scheduling.Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF.Strong experience working with containerized workloads and orchestration platforms (e.g., Kubernetes, Docker).Excellent incident response, troubleshooting, and documentation practices.Experience with building and operating managed services at scale such as object, file and block storage (AWS, GCP, Azure)Excellent communication skillsMust be able to pass a background checkEmbody the Company valuesBenefits:Industry competitive payRestricted Stock Units in a fast growing, well-funded technology companyHealth insurance package options that include HDHP and PPO, vision, and dental for you and your dependentsEmployer contributions to HSA accountsPaid Parental LeavePaid life insurance, short-term and long-term disabilityTeladoc401(k) with a 100% match up to 4% of salaryGenerous paid time off and holiday scheduleCell phone reimbursementTuition reimbursementSubscription to the Calm appMetLife LegalCompany paid commuter benefit; $300 per monthCompensation Range:Compensation will be paid in the range of $204,000 - $247,000 a year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 28, 2025
Network Engineer - High Side Engineering
helsing
501-1000
-
Germany
Full-time
Remote
false
Who we are Helsing is a defence AI company. Our mission is to protect our democracies. We aim to achieve technological leadership, so that open societies can continue to make sovereign decisions and control their ethical standards. As democracies, we believe we have a special responsibility to be thoughtful about the development and deployment of powerful technologies like AI. We take this responsibility seriously. We are an ambitious and committed team of engineers, AI specialists and customer-facing programme managers. We are looking for mission-driven people to join our European teams – and apply their skills to solve the most complex and impactful problems. We embrace an open and transparent culture that welcomes healthy debates on the use of technology in defence, its benefits, and its ethical implications. The role Helsing is a defence AI company with a mission to protect our democracies. Much of our work takes place in high-security environments, and we are looking for Network Engineers to support our high security data center operations. Your role as a Network Engineer will be to design, implement, and manage the critical network infrastructure within these secure data centers, forming the backbone of our operations. We are looking for engineers with a strong work ethic and prioritization skills. We value team players who communicate clearly, share knowledge generously, and collaborate effectively to move their team — and our mission—forward. The day-to-day Design, implement, and support data center co-locations from the ground up. Install and maintain network hardware, including switches, routers and firewalls using infrastructure as code (IaC). Design and deploy network solutions for GPU clusters. Operate within complex, high-security environments, ensuring strict compliance with all security protocols and best practices. Troubleshoot complex network problems and provide timely resolutions to maintain seamless operations. You should apply if you Experience designing, building, and supporting complex physical and software network infrastructure. A good understanding of Linux system administration. Expertise building and automating network infrastructure with an Infrastructure as Code (IaC) mindset, using tools like Terraform and Ansible. Hands-on experience with major network device vendors such as Cisco and Palo Alto. Understanding of basic network monitoring protocols and tools, such as tcpdump, Wireshark, gNMI, SNMP, and nmap. A strong command of the following protocols TCP/IP, TLS, DHCP, DNS, OSPF, and BGP. A solid understanding of VPN technologies and their underlying protocols (e.g., IPSec, OpenVPN, WireGuard). Note: We operate in an industry where women, as well as other minority groups, are systematically under-represented. We encourage you to apply even if you don’t meet all the listed qualifications; ability and impact cannot be summarised in a few bullet points. Nice to Have Experience with Kubernetes. Experience with Proxmox VE, including clustering, high availability, and storage management. Experience working BSI approved networking vendors, such as SINA and Genua. Experience with Software Defined Networking (SDN) e.g. Cisco ACI. Familiarity with scripting/programming (Shell, Python). Join Helsing and work with world-leading experts in their fields Helsing’s work is important. You’ll be directly contributing to the protection of democratic countries while balancing both ethical and geopolitical concerns The work is unique. We operate in a domain that has highly unusual technical requirements and constraints, and where robustness, safety, and ethical considerations are vital. You will face unique Engineering and AI challenges that make a meaningful impact in the world Our work frequently takes us right up to the state of the art in technical innovation, be it reinforcement learning, distributed systems, generative AI, or deployment infrastructure. The defence industry is entering the most exciting phase of the technological development curve. Advances in our field of world are not incremental: Helsing is part of, and often leading, historic leaps forward In our domain, success is a matter of order-of-magnitude improvements and novel capabilities. This means we take bets, aim high, and focus on big opportunities. Despite being a relatively young company, Helsing has already been selected for multiple significant government contracts We actively encourage healthy, proactive, and diverse debate internally about what we do and how we choose to do it. Teams and individual engineers are trusted (and encouraged) to practise responsible autonomy and critical thinking, and to focus on outcomes, not conformity. At Helsing you will have a say in how we (and you!) work, the opportunity to engage on what does and doesn’t work, and to take ownership of aspects of our culture that you care deeply about What we offer A focus on outcomes, not time-tracking Competitive compensation and stock options Relocation support Social and education allowances Regular company events and all-hands to bring together employees as one team across Europe A hands-on onboarding program (affectionately labelled “Infraduction”), in which you will be building tooling and applications to be used across the company. This is your opportunity to learn our tech stack, explore the company, and learn how we get things done - all whilst working with other engineering teams from day one Helsing is an equal opportunities employer. We are committed to equal employment opportunity regardless of race, religion, sexual orientation, age, marital status, disability or gender identity. Please do not submit personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, data concerning your health, or data concerning your sexual orientation. Helsing's Candidate Privacy and Confidentiality Regime can be found here.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 28, 2025
Offensive Security Engineer, Agent Security
OpenAI
5000+
USD
364500
-
490000
United States
Full-time
Remote
false
About the TeamSecurity is at the foundation of OpenAI’s mission to ensure that artificial general intelligence benefits all of humanity. The Security team protects OpenAI’s technology, people, and products. We are technical in what we build but are operational in how we do our work, and are committed to supporting all products and research at OpenAI. Our Security team tenets include: prioritizing for impact, enabling researchers, preparing for future transformative technologies, and engaging a robust security culture.About the RoleWe're seeking an exceptional Principal-level Offensive Security Engineer to challenge and strengthen OpenAI's security posture. This role isn't your typical red team job - it's an opportunity to engage broadly and deeply, craft innovative attack simulations, collaborate closely with defensive teams, and influence strategic security improvements across the organization.You'll have the chance to not only find vulnerabilities but actively drive their resolution, automate offensive techniques with cutting-edge technologies, and use your unique attacker perspective to shape our security strategy.This role will be primarily focused on continuously testing our agent powered products like codex and operator. These systems are uniquely valuable targets because they’re rapidly evolving, have access to perform sensitive actions on behalf of users, and have large, diverse attack surfaces. You will play a crucial role in securing our agents by hunting for realistic vulnerabilities that emerge from the interactions between the applications, infrastructure, and models that power them.In this role you will:Continuously hunt for vulnerabilities in the interactions between the applications, infrastructure, and models that power our agentic products.Conduct open-scope red and purple team operations, simulating realistic attack scenarios.Collaborate proactively with defensive security teams to enhance detection, response, and mitigation capabilities.Perform comprehensive penetration testing on our diverse suite of products.Leverage advanced automation and OpenAI technologies to optimize your offensive security work.Present insightful, actionable findings clearly and compellingly to inspire impactful change.Influence security strategy by providing attacker-driven insights into risk and threat modeling.You might thrive in this role if you have:7+ years of hands-on red team experience or exceptional accomplishments demonstrating equivalent expertise.Deep expertise conducting offensive security operations within modern technology companies.Experience designing, developing, or testing assessing the security of AI-powered systems.Experience working finding, exploiting and mitigating common vulnerabilities in AI systems like prompt injection, leaking sensitive data, confused deputies, and dynamically generated UI components.Exceptional skill in code review, identifying novel and subtle vulnerabilities.Proven experience performing offensive security assessments in at least one hyperscaler cloud environment (Azure preferred).Demonstrated mastery assessing complex technology stacks, including:Highly customized Kubernetes clustersContainer environmentsCI/CD pipelinesGitHub securitymacOS and Linux operating systemsData science tooling and environmentsPython-based web servicesReact-based frontend applicationsStrong intuitive understanding of trust boundaries and risk assessment in dynamic contexts.Excellent coding skills, capable of writing robust tools and automation for offensive operations.Ability to communicate complex technical concepts effectively through compelling storytelling.Proven track record of not just finding vulnerabilities but actively contributing to solutions in complex codebases.Bonus points:Background or expertise in AI or data science.Prior experience working in tech startups or fast-paced technology environments.Experience in related disciplines such as Software Engineering (SWE), Detection Engineering, Site Reliability Engineering (SRE), Security Engineering, or IT Infrastructure.About OpenAIOpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.OpenAI Global Applicant Privacy PolicyAt OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 28, 2025
Forward Deployed Engineer, Infrastructure Specialist (EMEA & APAC)
Cohere
501-1000
-
Japan
Full-time
Remote
true
Who are we?Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what’s best for our customers.Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products.Join us on our mission and shape the future!About North:North is Cohere's cutting-edge AI workspace platform, designed to revolutionize the way enterprises utilize AI. It offers a secure and customizable environment, allowing companies to deploy AI while maintaining control over sensitive data. North integrates seamlessly with existing workflows, providing a trusted platform that connects AI agents with workplace tools and applications.Why This Role?This role offers a unique opportunity to shape how enterprises harness the power of AI in real-world applications. As a bridge between our core North product and our clients’ engineering teams, you’ll be at the forefront of solving complex problems and securely integrating AI into critical sectors such as finance, healthcare, and telecommunications. Our esteemed clients include industry leaders like RBC, Dell, and LG CNS.We are seeking engineers who deeply care about customers and want to work at the cutting edge of Agentic AI.In this role, you will:Lead end-to-end deployment of North in private cloud and on-premises environments, including planning, configuration, testing, and rollout.Partner with enterprise IT teams to assess infrastructure, security requirements, and data management practices.Experiment at a high velocity and with a high level of quality to engage our customers and ultimately deliver solutions that exceed their expectationsDesign and implement deployment strategies tailored to client needs, ensuring compliance with data privacy and security standards.Troubleshoot and resolve deployment-related technical issues, providing timely solutions to minimize downtime.You may be a good fit if:You have experience with and enjoy working directly with customersYou have experience deploying enterprise software in private/hybrid cloud environmentsYou have proven experience administering production Kubernetes clusters and expertise with HelmFamiliarity with DevOps practices, CI/CD pipelines, and tools like Git for version controlYou have strong expertise in cloud infrastructure (Azure, AWS, GCP), networking, and virtualizationYou excel in fast-paced environments and can execute while priorities and objectives are a moving targetIf some of the above doesn’t line up perfectly with your experience, we still encourage you to apply! We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.Full-Time Employees at Cohere enjoy these Perks:🤝 An open and inclusive culture and work environment 🧑💻 Work closely with a team on the cutting edge of AI research 🍽 Weekly lunch stipend, in-office lunches & snacks🦷 Full health and dental benefits, including a separate budget to take care of your mental health 🐣 100% Parental Leave top-up for up to 6 months🎨 Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement🏙 Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend✈️ 6 weeks of vacation (30 working days!)
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 27, 2025
Electrical Test Engineer – Test Infrastructure
Figure AI
201-500
USD
0
130000
-
250000
United States
Full-time
Remote
false
Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is based in San Jose, CA and require 5 days/week in-office collaboration. We are seeking an Electrical Test Engineer to join the Test Infrastructure team responsible for developing and maintaining the electrical test systems that power integration, validation and HIL across Figure’s humanoid robot program. This role is highly hands-on and technical, combining electrical engineering, test automation, and hardware integration to ensure the performance, reliability, and scalability of Figure’s test infrastructure. Key Responsibilities: Design and develop electrical test systems, including schematics, wiring harnesses, PCBA interfaces, and safety interlocks. Build and maintain benchtop and rack-mounted test fixtures for subsystem and full-robot validation. Collaborate with Systems Integration and Validation teams to define test requirements and electrical coverage for bring-up and EOL workflows. Develop and execute test plans and procedures for electrical systems, focusing on power integrity, communication interfaces, and fault detection. Perform debug, diagnostics, and root cause analysis for electrical and system-level issues on test stands and prototypes. Create and maintain Python-based test automation scripts to enable automated data collection, logging, and verification. Specify and integrate instrumentation and measurement equipment, including oscilloscopes, power analyzers, load banks, DAQs, and PSU controllers. Document test system designs, calibration procedures, and test workflows to ensure repeatability and scalability. Contribute to continuous improvement of electrical test infrastructure and reliability processes across multiple robot programs. Requirements: Bachelor’s degree or equivalent experience in Electrical Engineering, Mechatronics, or a related field. 4–7 years of experience designing, testing, or validating electrical systems or test equipment. Strong understanding of power distribution, grounding, and safety standards. Proficiency with lab instrumentation (oscilloscopes, PSUs, DAQs, power analyzers, multimeters). Experience designing test fixtures, harnesses, or interface boards for validation or production testing. Familiarity with Python or similar scripting languages for automation and data analysis. Proven ability to debug complex electrical systems, interpret schematics, and work cross-functionally. Experience designing PCBAs Excellent attention to detail and documentation practices. Bonus Qualifications: Familiarity with hardware-in-the-loop (HIL) or continuous integration test environments. Background in robotics or embedded systems validation. Knowledge of AC/DC power electronics, load control, or high-voltage safety design. The US base salary range for this full-time position is between $130,000 and $250,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.
MLOps / DevOps Engineer
Data Science & Analytics
Robotics Engineer
Software Engineering
Software Engineer
Software Engineering
Apply
October 27, 2025
Network Engineer - Cluster Architecture
Cerebras Systems
501-1000
-
United States
Canada
Remote
false
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.About The Role As a Network Engineer on the Cluster Architecture Team, you will work closely with the vendors, internal networking teams and industry peers to develop best-in-class interconnect architecture of the current and future generations of the Cerebras AI clusters. You will be responsible for developing proof-of-concept of new network designs and features enabling resilient and reliable network for AI workloads. The role will require cross-functional collaboration and interaction with diverse hardware components (e.g., network devices and the Wafer-Scale Engine) as well as software at several layers of the stack, from host-side networking to cluster-level coordination. The role also requires understanding of network monitoring systems and network debugging methodologies. Responsibilities Design AI/ML and HPC Clusters with a focus on the network technology. Identify and address performance or efficiency bottlenecks, ensuring high resource utilization, low latency, and high throughput communication. Stay current on emerging networking technologies: evaluate new hardware, fabrics, and protocols to improve cluster performance, scalability, and cost efficiency. Drive technical projects involving multiple teams, various software and hardware components coming together to realize advanced networking technologies. Bring effective communication skills. Collaborate with vendors and industry peers to drive network hardware and feature roadmap. Pre-deployment readiness & port mapping: build/validate rack/row and patch-panel port maps, cabling plans, if required in rare cases. Bring-up & rare deployment debugging: assist with lab/staging validation, packet captures, link level diagnostics, and synthetic traffic tests. Skills & Qualifications Ph.D. in Computer Science or Electrical Engineering + 10 years industry experience or Master’s in CS or EE + 15 years industry experience. 5+ Years of experience in large scale network designs in WAN or Datacenter. Extensive experience debugging networking issues in large distributed systems environment with multiple networking platforms and protocols. Experience of managing and leading multi-phase and multi-team projects. Networking platforms like Juniper, Arista, Cisco, open-box architectures (SONiC, FBOSS). Networking protocols like RoCE, BGP, DCQCN, PFC, streaming telemetry. Familiarity with automation languages like Python or Go. Familiarity with network visibility and management systems. Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2025. Apply today and become part of the forefront of groundbreaking advancements in AI! Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 27, 2025
Senior Support Engineer - Tokyo
OpenAI
5000+
-
Japan
Full-time
Remote
false
About the TeamThe Technical Support team is responsible for ensuring that developers and enterprises can reliably build mission critical solutions using OpenAI models. We provide technical guidance, resolve complex issues and support customers in maximizing value and adoption from deploying our highly-capable models. We work closely with Technical Success, Product, Engineering and others to deliver the best possible experience to our customers at scale. We think from an automation-first mindset and leverage the latest in AI to scale our support operations. Join the Senior Support Engineering (SSE) team at OpenAI and help shape the future of Technical Support in the age of AI. About the RoleWe are looking for a Senior Support Engineer to collaborate directly with our strategic enterprise accounts and product teams, helping solve some of the most difficult problems faced by our Customers. You will be part of the best technical troubleshooting team at OpenAI, and our Customers and Engineering teams will look to you for technical guidance in addressing the most technically difficult issues in our environment.As a Senior Support Engineer, you will design and run operational processes to monitor our top strategic customers and a 24x7 response team. You’ll work closely with our Infrastructure and Engineering teams to deliver the best possible experience to customers at scale. Working directly with our most strategic Customers - You will be crucial to the success of the most innovative, disruptive, and high-scale AI solutions being built with the OpenAI API platform.The nature of this role will be low volume, high difficulty.This role is based in Tokyo, Japan. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.In this role, you will:Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.Design and refine incident response processes and documentation across strategic customers, engineering and support teams.Analyze operational metrics and incident RCAs to identify areas for improvement. Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows.Provide support coverage during holidays and weekends based on business needs.You might thrive in this role if you:Have a Bachelor’s degree in Computer Science or a related field. A strong software engineering foundation is important for this role’s success.Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments. A strong track record of troubleshooting complex technical problems at the systems level.Have deep familiarity with modern monitoring, alerting, and observability practices. Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems (e.g., understanding of SLIs/SLOs, alert tuning, dashboard creation).Have proven experience leading incident response for high‑severity outages or service disruptions. Able to perform real‑time incident coordination, root cause analysis, and drive follow‑ups (post‑mortems, action items) to prevent recurrence. Knowledge of industry best practices for incident management and fault diagnosis.Have strong skills in scripting or software engineering (e.g., Python or similar) to automate repetitive tasks and integrate tools.Have solid understanding of cloud infrastructure and distributed systems fundamentals. Comfortable working with cloud services, load balancers, databases, and containerized applications.Are effective at working cross‑functionally in a high‑trust environment. Strong communication skills to explain technical issues and resolutions to both engineering and non‑technical stakeholders. You can coordinate efforts across teams and are comfortable providing updates in the midst of an ongoing incident.About OpenAIOpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.OpenAI Global Applicant Privacy PolicyAt OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 26, 2025
Engineering Site Lead - London
Perplexity
1001-5000
-
United Kingdom
Full-time
Remote
false
Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Perplexity has raised over $1B in venture investment from some of the world’s most visionary and successful leaders, including Elad Gil, Daniel Gross, Jeff Bezos, Accel, IVP, NEA, NVIDIA, Samsung, and many more. Our objective is to build accurate, trustworthy AI that powers decision-making for people and assistive AI wherever decisions are being made. Throughout human history, change and innovation have always been driven by curious people. Today, curious people use Perplexity to answer more than 780 million queries every month–a number that’s growing rapidly for one simple reason: everyone can be curious. Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Perplexity has raised over $1B in venture investment from some of the world’s most visionary and successful leaders, including Elad Gil, Daniel Gross, Jeff Bezos, Accel, IVP, NEA, NVIDIA, Samsung, and many more. Our objective is to build accurate, trustworthy AI that powers decision-making for people and assistive AI wherever decisions are being made. Throughout human history, change and innovation have always been driven by curious people. Today, curious people use Perplexity to answer more than 780 million queries every month–a number that’s growing rapidly for one simple reason: everyone can be curious. Perplexity is revolutionizing how people discover and interact with information through AI-powered search and knowledge tools. As we expand our global footprint, we're establishing a strategic presence in London to drive innovation and growth across Europe. The Role: We're seeking an exceptional Site Lead to establish and scale our London office. This is a unique opportunity to shape Perplexity's presence in one of the world's leading tech hubs, building teams and culture from the ground up while driving technical excellence in infrastructure and AI systems. As Site Lead, you'll serve as the face of Perplexity in London, responsible for building our technical organization, fostering a world-class engineering culture, and directly managing one or more infrastructure teams. You'll report to senior leadership and work cross-functionally with teams across our global footprint. The individual in this role will manage teams in LON themselves while also facilitating. Responsibilities: Site Leadership & Culture Establish and lead Perplexity's London office, setting the cultural foundation and operating principles Build a collaborative, high-performance engineering culture that aligns with Perplexity's values while embracing the strengths of the London tech ecosystem Serve as the primary point of contact for all London-based activities and represent the site in company-wide strategic discussions Partner with People/HR, Finance, and Operations to ensure seamless site operations Drive local community engagement, partnerships, and Perplexity's brand presence in the London and European tech community Technical Leadership: Directly manage and mentor one or more infrastructure or AI infrastructure teams in London (5-15+ engineers) Set technical direction and strategy for London-based infrastructure initiatives in alignment with company-wide goals Drive architectural decisions and technical excellence across teams Ensure robust systems for deployment, monitoring, scalability, and reliability of infrastructure supporting AI/ML workloads Collaborate with engineering leaders globally to align on technical standards, best practices, and cross-site initiatives Team Building & Talent: Build and scale high-performing infrastructure and AI infrastructure teams through strategic hiring Develop and execute talent acquisition strategy for the London site in partnership with recruiting Create career development frameworks and growth opportunities for engineers Foster technical mentorship and knowledge sharing across teams and sites Cross-functional Collaboration: Partner with Product, Engineering, and Research teams globally to understand infrastructure needs and deliver solutions Coordinate with other site leads and engineering leaders to ensure effective cross-site collaboration Contribute to company-wide infrastructure strategy and roadmap planning Facilitate knowledge transfer and best practice sharing across global teams Qualifications: Required 10+ years of experience in software engineering with 5+ years in infrastructure, cloud infrastructure, or AI infrastructure roles 3+ years of people management experience, including building and scaling teams Proven track record of establishing or significantly growing an engineering site or office Deep technical expertise in distributed systems, cloud platforms (AWS, GCP, or Azure), and infrastructure automation Experience with infrastructure supporting large-scale AI/ML systems, including: GPU infrastructure and orchestration ML training and inference pipelines Model serving and deployment at scale Strong understanding of modern infrastructure technologies: Kubernetes, Terraform, container orchestration, CI/CD systems Demonstrated ability to set technical vision and drive execution across multiple teams Excellent communication and stakeholder management skills Experience working in fast-paced, high-growth technology companies Passion for building inclusive, diverse, and high-performing teams Preferred Experience at companies focused on AI/ML, search, or large-scale consumer applications Previous experience as a site lead, office lead, or similar multi-team leadership role Background in building infrastructure for LLM training or inference Contributions to open-source infrastructure or AI infrastructure projects Experience scaling teams from 0 to 20+ engineers Active involvement in the London or European tech community MBA or advanced technical degree What Success Looks Like: 30 Days Deep understanding of Perplexity's infrastructure, technology stack, and organizational structure Established relationships with key stakeholders across engineering, product, and leadership Initial hiring plan and culture strategy for London site established Help the Search, API, AI and Infra teams build out their hiring pipelines 90 Days Core infrastructure team established and ramping in London Clear technical roadmap and priorities defined for London-based teams Site culture and operating rhythms established (team meetings, all-hands, cross-site syncs) London office actively participating in company-wide infrastructure initiatives 1 Year London site operating as a high-functioning hub with 15-30+ engineers Infrastructure teams delivering measurable impact on system reliability, performance, and scalability Strong talent brand established in London market with healthy hiring pipeline London recognized internally as a strategic site contributing to Perplexity's technical leadership Why Join Perplexity Ground-floor opportunity to build and lead a strategic site for a fast-growing AI company Work on cutting-edge AI infrastructure challenges at massive scale Shape the culture and technical direction of an entire office Competitive compensation including equity Comprehensive benefits package Flexible work environment Opportunity to make a significant impact on how millions of people access and interact with information Location: London, United Kingdom (Hybrid) Perplexity is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 25, 2025
Engineering Manager, Managed Orchestration
Crusoe
501-1000
USD
0
204000
-
247000
United States
Full-time
Remote
false
Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.About the Role:We are actively seeking an exceptional Engineering Manager for our Managed Orchestration Team to lead the design and scaling of our carbon-reducing operating model and oversee the management of critical hardware, software, and network components.In this role, you will provide technical and strategic leadership to the team, oversee the development of cutting-edge infrastructure solutions, and ensure alignment with company goals. You will guide the team in writing and reviewing code, contributing to architecture documents, and evaluating tools and frameworks to optimize for reliability, scalability, operational costs, and ease of adoption. Your leadership will be instrumental in advancing our managed Kubernetes and AI training clusters, ensuring they lead the industry in performance and innovation.A Day In The Life:Provide mentorship and guidance to engineers, fostering a culture of creativity, technical excellence, and collaboration to develop scalable and robust software solutions.Lead team efforts to develop and maintain scalable systems, closely aligning with the strategic objectives outlined in the Crusoe Cloud roadmap.Collaborate with tech leads and engineers to create an innovative environment that encourages problem-solving and innovation in cloud solutions.Continuously stay updated on the latest trends and techniques in cloud software and share insights with the team to maintain Crusoe’s competitive edge.Actively contribute to and oversee technical discussions, ensuring architectural decisions are aligned with long-term scalability and reliability goals.You Will Thrive In This Role If:You have 7+ years of experience in software engineering with at least 2-3 years of leadership experience, including team management or technical leadership roles.You have extensive experience with Kubernetes and Linux engineering, including debugging and optimizing system performance.You are skilled in infrastructure as code and familiar with systems-level challenges, with hands-on experience in Terraform and GCP (preferred).You understand orchestration tools such as Argo, CI/CD pipelines, and automated testing frameworks.You have experience building and managing Kubernetes operators and controllers, ensuring the reliability and efficiency of the Kubernetes environment.You can oversee critical projects with broad impact, leading initiatives focused on networking, quality control, and automation to ensure optimal performance and reliability.You possess a strong background in system architecture design, including CI/CD pipelines, and can ensure adherence to security standards.You are proficient in GoLang and have a strong grasp of systems-level programming challenges.You have excellent communication skills, both verbal and written, with the ability to inspire and align teams.Benefits:Industry competitive payRestricted Stock Units in a fast growing, well-funded technology companyHealth insurance package options that include HDHP and PPO, vision, and dental for you and your dependentsEmployer contributions to HSA accountsPaid Parental LeavePaid life insurance, short-term and long-term disabilityTeladoc401(k) with a 100% match up to 4% of salaryGenerous paid time off and holiday scheduleCell phone reimbursementTuition reimbursementSubscription to the Calm appMetLife LegalCompany paid commuter benefit; $300 per monthCompensation Range:Compensation will be paid in the range of $204,000 - $247,000. Restricted Stock Units are included in all offers. Compensation to be determined by the applicants knowledge, education, and abilities, as well as internal equity and alignment with market data.Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 24, 2025
Super Intelligence Support Account Lead
Lambda AI
501-1000
USD
160000
-
215000
United States
Full-time
Remote
true
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.
If you'd like to build the world's best deep learning cloud, join us.
About this roleThe Super Intelligence Support Account Lead is part of Lambda’s Super Intelligence business unit, dedicated to our largest, most strategic customers operating in the most complex environments. In this role, you will serve as the key support contact for Super Intelligence accounts, acting as a dedicated resource embedded in their success. Your mission is to ensure these customers receive world-class support delivery—solving issues quickly, escalating effectively, advocating for their needs, and driving cross-functional involvement when required.You’ll collaborate closely with every layer of Support Operations, Engineering, Data Center Operations, and Sales, reporting to the Manager of Super Intelligence while working hands-on across the global account management structure. The role requires speed, responsiveness, creativity, and ownership—bringing outside-the-box solutions to critical issues and ensuring support outcomes consistently exceed expectations.What You’ll DoServe as the primary support contact for assigned Super Intelligence accounts, ensuring consistent, high-quality customer experiences.Own the overall support health of assigned accounts, proactively monitoring for risks, recurring issues, and opportunities to improve reliability.Drive resolution for escalated issues by coordinating with Support, Data Center Ops, and Engineering teams—ensuring timely communication and accountability.Lead operational reviews (QBRs/MBRs), presenting ticket trends, SLA adherence, incident summaries, and improvement actions.Develop and maintain account-level success and support plans aligned to customer priorities and workloads.Act as a mentor for frontline support engineers, guiding them through escalations and sharing best practices.Document solutions, escalations, and RCA outcomes to build scalable runbooks and strengthen internal processes.Partner with Product and Engineering teams to ensure customer pain points are visible, tracked, and resolved.Contribute to Lambda’s support operations playbooks, refining how we handle incidents, escalations, and enterprise account management.Curate and document custom scripts, solutions, or customer-requested customizations outside of Lambda’s reference architecture when required.Participate in an on-call scheduleYou5+ years in Support Account Management, Technical Account Management, or Support Engineering within cloud, enterprise IT, or infrastructure environments.Proven experience in HPC environments, showcasing your expertise in Linux cluster administration, with strong preference for Kubernetes and/or Slurm for cluster orchestrationProven ability to own escalations end-to-end, with strong skills in incident management and structured communication.Solid understanding of cloud and HPC infrastructure (GPU cloud, Kubernetes, Linux clusters, or public cloud platforms).Skilled at analyzing ticket trends, incident timelines, and support metrics, turning them into actionable improvements.Strong relationship management skills with both technical and executive-level stakeholders.Comfortable leading cross-functional collaboration, ensuring engineering and operations stay aligned on customer priorities.Experience mentoring or guiding support engineers through escalations or complex cases.Nice to HaveExperience supporting hyperscale or mission-critical customers with 24/7 availability requirements.Familiarity with enterprise ticketing and incident management systems (Zendesk, Jira, ServiceNow).Exposure to GPU/AI/HPC technologies such as CUDA, NCCL, NVLink, GPUDirect, or InfiniBand/RoCE networking.Background in documenting support processes, RCAs, or escalation frameworks.Certifications in ITIL, Linux, cloud platforms, or project management.Salary Range InformationThis is a salaried non-exempt role, eligible for overtime. The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Solutions Architect
Software Engineering
Apply
October 23, 2025
Software Engineer (Site Reliability Engineer)
Anyscale
201-500
-
India
Full-time
Remote
false
About Anyscale:
At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We’re commercializing Ray, a popular open-source project that's creating an ecosystem of libraries for scalable machine learning. Companies like OpenAI, Uber, Spotify, Instacart, Cruise, and many more, have Ray in their tech stacks to accelerate the progress of AI applications out into the real world.
With Anyscale, we’re building the best place to run Ray, so that any developer or data scientist can scale an ML application from their laptop to the cluster without needing to be a distributed systems expert.
Proud to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date.
About the role:As a Site Reliability Engineer, you will play a crucial role in ensuring the smooth operation of all user-facing services and other Anyscale production systems. Anyscale values diversity and inclusion, and we encourage applications from individuals of all backgrounds.This includes processes for provisioning, negotiating prices, managing costs, seeing opportunities for teams to reduce wastage by finding applications across the company. You will apply sound engineering principles, operational discipline, and mature automation to our environments and the Anyscale codebase as we scale.As part of this role, you will:Develop a unified perspective on how cloud components are utilized across the company, taking into account diverse needs and requirements.Ensure that deployment methodologies align with the company's reliability goals.Build systems that promote understanding of production environments, facilitating quick identification of issues through robust observability infrastructure for metrics, logging, and tracing.Create monitoring and alerting systems at different levels, enabling teams to easily contribute and enhance the overall monitoring capabilities.Establish testing infrastructure to support the team in writing and executing tests effectively.Develop tools for measuring service level objectives (SLOs) and define organization-wide SLOs.Implement best practices and on-call systems, ensuring efficient incident management and up-leveling the incident management system at Anyscale.Coordinate the creation and deployment of cloud-based services, including tracking deployments and establishing effective communication channels for issue resolution.We'd love to hear from you if have:At least 3 years of relevant work experience in a similar role.CompensationAt Anyscale, we take a market-based approach to compensation. We are data-driven, transparent, and consistent. As the market data changes over time, the target salary for this role may be adjusted.This role is also eligible to participate in Anyscale's Equity and Benefits offerings, including the following:Stock OptionsHealthcare plans, with premiums covered by Anyscale at 99%401k Retirement PlanWellness stipendEducation stipendPaid Parental LeaveFlexible Time OffCommute reimbursement100% of in office meals coveredAnyscale Inc. is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Anyscale Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
October 22, 2025
Staff Security Engineer (Hybrid)
Fiddler AI
101-200
USD
192500
-
295000
United States
Full-time
Remote
false
Our PurposeAt Fiddler, we understand the implications of AI and the impact that it has on human lives. Our company was born with the mission of building trust into AI. The rise of Generative AI and Agents has unlocked generalized intelligence but also widened the risk aperture and made it harder to ensure that AI applications are working well. Fiddler enables organizations to get ahead of these issues by helping deploy trustworthy, and transparent AI solutions. Fiddler partners with AI-first organizations to help build a long-term framework for responsible AI practices, which, in turn, builds trust with their user base. AI Engineers, Data Science, and business teams use Fiddler AI to monitor, evaluate, secure, analyze, and improve their AI solutions to drive better outcomes. Our platform enables engineering teams and business stakeholders alike to understand the "what", “why”, and "how" behind AI outcomes. Our FoundersFiddler AI is founded by Krishna Gade (engineering leader at Facebook, Pinterest, Twitter, and Microsoft) and Amit Paka (product leader at Microsoft, Samsung, Paypal and two-time founder). We are backed by Insight Partners, Lightspeed Venture Partners, and Lux Capital. Why Join UsOur team is motivated to help build trust into AI to enable society harness the power of AI. Joining us means you get to make an impact by ensuring that AI applications at production scale across industries have operational transparency and security. We are an early-stage startup and have a rapidly growing team of intelligent and empathetic doers, thinkers, creators, builders, and everyone in between. The AI and ML industry has a rapid pace of innovation and the learning opportunities here are monumental. This is your chance to be a trailblazer. Fiddler is recognized as a pioneer in the field of AI Observability and has received numerous accolades, including: 2022 a16z Data50 list, 2021 CB Insights AI 100 most promising startups, 2020 WEF Technology Pioneer, 2020 Forbes AI 50 most promising startups of 2020, and a 2019 Gartner Cool Vendor in Enterprise AI Governance and Ethical Response. By joining our brilliant (at least we think so) team, you will help pave the way in the AI Observability space.👩🏽🚀 The MissionAs our first Security Engineer, you will define and drive the foundation of security for a next-generation developer platform that powers responsible AI. Your work ensures that every product we build, and every model our customers deploy, is secure, trustworthy, and compliant from the ground up.You’ll collaborate across Engineering and Product to embed security into our development lifecycle, enable rapid innovation without compromising safety, and lead the execution of our compliance roadmap (e.g., SOC 2, ISO 27001). By implementing the technical controls that safeguard our multi-cloud AI platform, you will play a critical role in protecting customer data, earning their trust, and reinforcing Fiddler’s commitment to building AI that the world can depend on.🪐 About The TeamOur Platform Engineering team is a talented, experienced group of engineers who take pride in building the foundation that powers Fiddler’s AI platform. The team is a mix of local and remote members who thrive on open communication, transparency, and genuine teamwork. This team has a ‘gsd’ attitude and is quick to lend a hand, share knowledge, and celebrate wins together.🚀 What You’ll DoDevelop a comprehensive security roadmap that addresses current and future threats, including cloud security, application security, and incident response.Directly implement half of the roadmap yourself with changes from GitHub Actions to Terraform to Python; while delegating the other half to other engineering teams. Drive execution of the engineering roadmap by personally delivering key features and infrastructure improvements (spanning GitHub Actions, Terraform, and Python), while coordinating and delegating the remaining initiatives to other engineering teams to ensure end-to-end delivery.Own our compliance processes end-to-end, including SOC 2 Type 2, GDPR, HIPAA (and more)—by defining and updating controls, supplying evidence during audits, etc.Prepare for and lead our incident response efforts, including developing and testing incident response plans, and coordinating the response to security incidents; work with other engineers to shift left.
🎯 What We’re Looking For5+ years security engineering experienceProven experience as an autonomous senior security role in a startup environment.Deep understanding of both security principles and best practices, as well as infrastructure engineering (sometimes called “DevSecOps”).Hands-on experience with a variety of security tools and technologies in the cloud (on-prem experience is nice-to-have), vulnerability management, and incident response.Coding experience in Python and/or Golang, primarily as it relates to infrastructure tools.Excellent communication and interpersonal skills, with the ability to effectively communicate complex security concepts to both technical and non-technical audiences.A proactive and results-oriented mindset, with the ability to work independently, take ownership of projects and drive them end-to-end across teams.You are comfortable with ambiguity and are a self-starter who thrives in a fast-paced environment.Hands-on experience with AWS technologies (e.g. EC2, VPC, NLB, etc.).A passion for security and a desire to stay up-to-date with the latest threats and technologies.Ability to work at our Palo Alto office 2-3 days a week 🫱🏼🫲🏾 Compensation:$192,500-$295,000 for Bay AreaThe posted range represents the expected salary range for this job requisition and does not include any other potential components of the compensation package and perks previously outlined. Ultimately, in determining pay, we'll consider your experience, leveling, location, and other job-related factors.Fiddler is proud to be an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. If you require special accommodations in order to complete the interviews or perform job duties, please inform the recruiter at the beginning of the process.Beware of job scam fraud. Our recruiters use @fiddler.ai email addresses exclusively. In the US, we do not conduct interviews via text or instant message, or ask for sensitive personal information such as bank account or social security numbers.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 21, 2025
Hardware Quality Engineer
Lambda AI
501-1000
USD
0
109000
-
163000
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position requires presence in our San Jose office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
The Operations team plays a critical role in ensuring the seamless end-to-end execution of our AI-IaaS infrastructure and hardware. This team is responsible for sourcing all necessary infrastructure and components, overseeing day-to-day data center operations to maintain optimal performance and uptime, and driving cross company coordination through product management organization to align operational capabilities with strategic goals. By managing the full lifecycle from procurement to deployment and operational efficiency, the Operations team ensures that our AI-driven infrastructure is reliable, scalable, and aligned with business priorities.
What You’ll DoTrack, log, and manage all quality issues arising in the data center during deployment and production environment Perform root cause analysis (RCA) for every failure (hardware, software, process)Analyze production system metrics and quality data to detect trends, anomalies, or weak pointsImprove turnaround time (TAT) for Return Merchandise Authorization (RMA) processesDesign, monitor, and drive corrective and preventive actions (CAPA)Implement and verify containment actions to keep systems operational until permanent fixes are applied.Collaborate with operations, hardware, engineering, supply chain, and vendors to resolve quality issuesCapture and upload failure analysis (FA) reports and related data into Quality Management Systems (QMS)Verify quality of spares (incoming and outgoing) to avoid repeat failures.Define and track quality KPIs / SLAs and report on quality performance to leadershipOversee MRB (Material Review Board) inventory, rework, disposal decisionsEnsure the quality management system (QMS) is up to date, with necessary training rolled outWork cross-functionally during hardware ramp, deployments, and upgrades to ensure quality gatesUp to 30% travel may be required for this role.YouHave experience working with hardware / data center / infrastructure systemsAre strong at data analysis, statistics, and metrics (you can turn raw data into insight)Are skilled in root cause analysis methods (5 Whys, fishbone, 8D, A3, etc.)Are comfortable managing cross-team communication, stakeholder expectations, and conflict resolutionAre detail-oriented, process-driven, and quality-mindedHave experience working with quality tools or QMS software (e.g. audit modules, ERP, defect tracking)Communicate clearly in English (both written and verbal)Nice to HaveExperience in the machine learning / AI infrastructure / GPU / HPC / computer hardware industryExposure to data center standards, certifications (e.g. ISO, Uptime Institute, etc.)Experience working on vendor quality, supply chain quality, or incoming inspectionsUnderstanding of firmware, embedded systems, reliability engineeringFamiliarity with scripting or automation (Python, SQL, etc.) to help with data processingExposure to cloud or hyperscaler infrastructure operationsExperience with “manufacturing-like” quality concepts applied to compute hardwareSalary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
October 21, 2025
No job found
There is no job in this category at the moment. Please try again later