Top MLOps / DevOps Engineer Jobs Openings in 2025

Looking for opportunities in MLOps / DevOps Engineer? This curated list features the latest MLOps / DevOps Engineer job openings from AI-native companies. Whether you're an experienced professional or just entering the field, find roles that match your expertise, from startups to global tech leaders. Updated everyday.

Lambda.jpg

Senior Data Center Operations Engineer - Quincy WA

Lambda AI
USD
115000
-
173000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our Quincy Data Center 5 days per week. What You'll DoEnsure new server, storage and network infrastructure is properly racked, labeled, cabled, and configured.Troubleshoot hardware and software issues in some of the world’s most advanced GPU and Networking systems.Document and update data center layout and network topology in DCIM softwareWork with supply chain & manufacturing teams to ensure timely deployment of systems and project plans for large-scale deploymentsManage a parts depot inventory and track equipment through the delivery-store-stage-deploy-handoff process in each of our data centersPartner with HW Support teams to ensure data center hardware incidents with higher level troubleshooting challenges are resolved, reported on and solutions are disseminated to the large operations organization.Work with RMA team to ensure faulty parts are returned and replacements are orderedFollow installation standards and documentation for placement, labeling, and cabling to drive consistency and discoverability across all data centersYouHave strong past experiences with critical infrastructure systems supporting data centers, such as power distribution, air flow management, environmental monitoring, capacity planning, DCIM software, structured cabling, and cable managementBe familiar with carrier DIA circuit test and turn ups, fiber testing and troubleshootingBasic knowledge of cable optics and the different types of useSolid understanding of single and three phase power theoriesPDU balancing and why it is importantFamiliar with multiple cable media types and their usesKnowledge of cold isle and hot isle containmentSolid understanding of server hardware and boot processAbility to structure, collaborate and iteratively improve on complex maintenance MOPs.Working with product management, support, and other teams to align operational capabilities with company goals.Translating business priorities into technical and operational requirements.Supporting cross-functional projects where infrastructure plays a critical role.Are action-oriented and willingness to train junior staff on best practicesAre willing to travel for bring up of new data center locations as neededNice to HaveHave 3+ years experience with critical infrastructure systems supporting data centers, such as power distribution, air flow management, environmental monitoring, capacity planning, DCIM software, structured cabling, and cable managementExperience with/or knowledge of network topology and configurations and 400gb Infiniband architectures.Experience with/or knowledge of DDP or SCM cluster storage systems.Have 3+ years working with and reporting from a ticketing systems like JIRA and ZendeskAdvanced experience with Linux administrationExperience with High Performance Compute GPU systems (air or water cooled) - especially Nvidia NVL72Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Decagon.jpg

Software Engineer, Infrastructure

Decagon
USD
200000
-
375000
US.svg
United States
Full-time
Remote
false
About DecagonDecagon is the leading conversational AI platform empowering every brand to deliver concierge customer experience. Our AI agents provide intelligent, human-like responses across chat, email, and voice, resolving millions of customer inquiries across every language and at any time.Since coming out of stealth, Decagon has experienced rapid growth. We partner with industry leaders like Hertz, Eventbrite, Duolingo, Oura, Bilt, Curology, and Samsara to redefine customer experience at scale. We've raised over $200M from Bain Capital Ventures, Accel, a16z, BOND Capital, A*, Elad Gil, and notable angels such as the founders of Box, Airtable, Rippling, Okta, Lattice, and Klaviyo.We’re an in-office company, driven by a shared commitment to excellence and velocity. Our values—customers are everything, relentless momentum, winner’s mindset, and stronger together—shape how we work and grow as a team.About the TeamThe Infrastructure team builds and operates the foundations that power Decagon: networking, data, ML serving, developer platform, and real‑time voice. We partner closely with product, data, and ML to deliver high‑scale, low‑latency systems with clear SLOs and great developer ergonomics.We organize around five focus areas:Core Infra: The foundational cloud stack—networking, compute, storage, security, and infrastructure‑as‑code—to ensure reliability, scale, and cost efficiency.Data Infra: Streaming/batch data platforms powering analytics/BI and customer‑facing telemetry, including for customer‑managed and on‑prem environments.ML Infra: GPU and model‑serving platforms for LLM inference with multi‑provider routing and support for on‑prem/air‑gapped deployments.Platform (DevEx): CI/CD, paved paths, and core services that make shipping fast, safe, and consistent across teams.Voice Infra: Telephony/WebRTC stack and observability enabling ultra‑low‑latency, high‑quality voice experiences.Our mission is to deliver magical support experiences — AI agents working alongside humans to resolve issues quickly and accurately. About the RoleWe’re hiring a Senior Infrastructure Engineer to design, build, and operate production infrastructure for high‑scale, low‑latency systems. You’ll own critical services end‑to‑end, improve reliability and performance, and create paved‑paths that let every Decagon engineer ship confidently. In this role, you willDesign and implement critical infrastructure services with strong SLOs, clear runbooks, and actionable telemetry.Partner with research and product teams to architect solutions, set up prototypes, evaluate performance, and scale new features.Tune service latencies: optimize networking paths, apply smart caching/queuing, and tune CPU/memory/I/O for tight p95/p99s.Evolve CI/CD, golden paths, and self‑service tooling to improve developer velocity and safety.Support various deployment architectures for customers with robust observability and upgrade paths.Lead infrastructure‑as‑code (Terraform) and GitOps practices; reduce drift with reusable modules and policy‑as‑code.Participate in on‑call and drive down toil through automation and elimination of recurring issues. Your background looks something like this3+ years building and operating production infrastructure at scale.Depth in at least one area across Core/Data/AI‑ML/Platform/Voice, with curiosity to learn the rest.Proven track record meeting high availability and low latency targets (owning SLOs, p95/p99, and load testing).Excellent observability chops (OpenTelemetry, Prometheus/Grafana, Datadog) and incident response (PagerDuty, SLO/error budgets).Clear written communication and the ability to turn ambiguous requirements into simple, reliable designs. Even betterExperience being an early backend/platform/infrastructure engineer at another companyStrong Kubernetes experience (GKE/EKS/AKS) and experience across multiple cloud providers (GCP, AWS, and Azure)Experience with customer‑managed deployments BenefitsMedical, dental, and visionFlexible time offDaily lunch/dinner & snacks in the office
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Lakera.jpg

AI Security Engineer - Red Team

Lakera AI
-
US.svg
United States
Full-time
Remote
true
We're looking for an AI Security Engineer to join our Red Team and help us push the boundaries of AI security. You'll lead cutting-edge security assessments, develop novel testing methodologies, and work directly with enterprise clients to secure their AI systems. This role combines hands-on red teaming, automation development, and client engagement. You'll thrive in this role if you want to be at the forefront of an emerging discipline, enjoy working on nascent problems, and like both breaking things and building processes that scale.Key ResponsibilitiesThis is a highly cross-functional position. AI security is still being defined, with best practices emerging in real-time. You'll be building the frameworks, methodologies, and tooling that scale our services while staying adaptable to rapid changes in the AI landscape. This role is ideal for someone who wants to take their traditional cybersecurity expertise and apply it to the new frontier of AI security and safety. Your focus will span several key areas:Service Delivery & Client EngagementLead end-to-end delivery of AI red teaming security assessment engagements with enterprise customersCollaborate with clients to scope projects, define testing requirements, and establish success criteriaConduct comprehensive security assessments of AI systems, including text-based LLM applications and multimodal agentic systemsAuthor detailed security assessment reports with actionable findings and remediation recommendationsPresent findings and strategic recommendations to technical and executive stakeholders through report readoutsTooling & Methodology DevelopmentBuild upon and improve our established processes and playbooks to scale AI red teaming service deliveryDevelop frameworks to ensure consistent, high-quality service deliveryFind the tedious, repetitive stuff and automate it - you don't need to be a world-class developer, just someone who can build tools that make the team more effectiveResearch & InnovationDevelop novel red teaming methodologies for emerging modalities: image, video, audio, autonomous systemsStay ahead of the latest AI security threats, attack vectors, and defense mechanismsTranslate cutting-edge academic and industry research into practical testing approachesCollaborate with our research and product teams to continuously level up our methodologiesRequired QualificationsTechnical Expertise3+ years of experience in cybersecurity with focus on red teaming, penetration testing, or security assessmentsExperience with web application and API penetration testing preferredDeep understanding of LLM vulnerabilities including prompt injection, data poisoning, and jailbreaking techniquesPractical experience with threat modeling complex systems and architecturesProficiency in developing automated tooling to enable and enhance testing capabilities, improve workflows, and deliver deeper insightsProfessional SkillsProven track record of leading client-facing security assessment projects from scoping through deliveryExcellent technical writing skills with experience creating executive-level security reportsStrong presentation and communication skills for diverse audiencesExperience building processes, documentation, and tooling for service delivery teamsAI Security KnowledgeUnderstanding of AI/ML model architectures, training processes, and deployment patternsFamiliarity with AI safety frameworks and alignment researchKnowledge of emerging AI attack surfaces including multimodal systems and AI agentsPreferred QualificationsRelevant security certifications (OSCP, OSWA, BSCP, etc.)Hands-on experience performing AI red teaming assessments, with a strong plus for experience targeting agentic systemsDemonstrated experience designing LLM jailbreaksActive participation in security research and tooling communitiesBackground in threat modeling and risk assessment frameworksPrevious speaking experience at security conferences or industry eventsWhat You'll GainOpportunity to shape the future of AI security as an emerging disciplineWork with cutting-edge AI technologies and novel attack methodologiesLead high-visibility projects with enterprise clients across diverse industriesCollaborate with world-class research team pushing boundaries of AI safetyPlatform to establish thought leadership in AI security communityCompetitive compensation package with equity participation👉 Let's stay connected! Follow us on LinkedIn, Twitter & Instagram to learn more about what is happening at Lakera.ℹ️ Join us on Momentum, the slack community for AI Safety and Security everything.❗To remove your information from our recruitment database, please email privacy@lakera.ai.
MLOps / DevOps Engineer
Data Science & Analytics
Machine Learning Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
OpenAI.jpg

IT Solutions Engineer

OpenAI
USD
0
225000
-
275000
US.svg
United States
Full-time
Remote
false
About the TeamIT Systems Operations is the operational layer within Security and IT that connects core teams and keeps employee-facing workflows, systems, and tools running reliably across the Employee Technology & Experience program. We design and implement the workflows, automations, and integrations that power the employee lifecycle, access, ITIL processes, and core business systems, partnering closely with Security and cross-functional teams to deliver solutions that work end to end.Our mandate spans solution design, system implementation, workflow engineering, automation, and the reliability of the platforms behind employee-facing processes. We work closely with Support to resolve complex issues, refine workflows, and ensure our solutions perform reliably in real-world conditions. We turn operational problems into scalable, engineered systems that reduce friction, codify patterns, and help OpenAI grow predictably and securely.About the RoleAs a Solutions Engineer, you will design and implement scalable, ITIL-aligned workflows using OpenAI technology and models to automate and improve core IT processes. You will deliver cross-functional solutions from requirements and architecture through build, rollout, and operationalization across ITSM, IAM, SaaS platforms, identity-adjacent systems, lifecycle automation, access workflows, integrations, and enterprise orchestration.You will reduce operational toil by building durable automations that ensure systems behave consistently, predictably, and securely. You will also work closely with IT Support Operations to eliminate root causes, streamline workflows, and equip Support with the least-privilege access needed to operate effectively.This is an on-site role based in our San Francisco office with a minimum presence of three days per week.In this role, you will:Solution Design and ImplementationTranslate unclear requirements into structured workflows and dependable systems.Build integrations across identity, SaaS, ITSM, internal applications, and endpoints.Deliver implementations from architecture through rollout and documentation.Support PartnershipAddress recurring issues with targeted workflow and system improvements.Equip Support with training, Playbooks, and long-term administrative patterns.Manage on-call rotations and technical escalations across IT and Security.Workflow, Automation, and OrchestrationBuild automated request flows across Slack, ChatGPT, Atlas, Jira, Linear, Incident.io, and Retool.Use APIs, Terraform, scripting, and low-code tools to remove manual effort.ITIL Application and ManagementMaintain Incident, Request, Change, and Problem workflows across modern ITSM tools.Provide repeatable workflow templates for teams across the company.IntegrationsExecute user, device, and SaaS integration playbooks.Lead identity, access, and ITSM migrations during acquisitions.System AdministrationMaintain lifecycle, collaboration, and automation platforms such as Slack Grid, Google Workspace, Atlassian, and GitHub.Ensure system reliability through monitoring, logging, workflow health, and integration upkeep.You might thrive in this role if you:Instinctively break down messy, ambiguous processes and turn them into clear, logical, well-structured systems.Navigate technical and non-technical teams with ease, translating needs and aligning stakeholders without friction.Identify patterns, eliminate repetition, and design solutions that scale rather than relying on manual effort.Understand identity as a foundation of enterprise architecture and think holistically about trust, entitlements, and lifecycle flows.Adapt quickly to new tools, understand how systems fit together, and enjoy learning the inner workings of enterprise platforms.Think in systems naturally and identify where data or workflows break down, and design clean, resilient cross-system connections.You appreciate structured frameworks like ITIL and Agile, using them to bring order, predictability, and continuous improvement.You communicate clearly through runbooks and written guidance that help others move faster and reduce institutional knowledge gaps.About OpenAIOpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for US-based candidates. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.OpenAI Global Applicant Privacy PolicyAt OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
Mistral AI.jpg

Datacenter Hardware Engineer, HPC

Mistral AI
0
0
-
0
FR.svg
France
Full-time
Remote
false
About Mistral  At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life. We democratize AI through high-performance, optimized, open-source and cutting-edge models, products and solutions. Our comprehensive AI platform is designed to meet enterprise needs, whether on-premises or in cloud environments. Our offerings include le Chat, the AI assistant for life and work. We are a dynamic, collaborative team passionate about AI and its potential to transform society.Our diverse workforce thrives in competitive environments and is committed to driving innovation. Our teams are distributed between France, USA, UK, Germany and Singapore. We are creative, low-ego and team-spirited. Join us to be part of a pioneering company shaping the future of AI. Together, we can make a meaningful impact. See more about our culture on https://mistral.ai/careers. Role Summary Our compute footprint is growing fast to support our science and engineering teams. We’re hiring a Datacenter HW Engineer to maintain, troubleshoot, and scale our GPU/CPU clusters safely and reliably. You’ll execute hands-on hardware work in our Paris-area datacenter and partner with hardware owners, DC operations, and vendors to keep one of France’s largest GPU clusters healthy. Location: Bruyères-le-Châtel — on-site, field roleReporting line: Hardware Ops Impact • Compute is a key lever for Mistral’s success and our largest spend item. • Direct impact on scale: your work keeps one of France’s largest AI clusters healthy as we grow to unprecedented scale. • Enable breakthrough AI: you unlock our science & engineering teams to deliver groundbreaking AI solutions. What you will do • Diagnose & operate core server/cluster components - Investigate and handle compute/storage hardware issues (CPU, memory, drives, NICs, GPUs, PSUs) and interconnect problems (switches, cables, transceivers; Ethernet/InfiniBand). Perform safe interventions (power-off/lockout, ESD) to replace, re-seat, or recable components and restore service. • Safety & procedures - Apply lockout/tagout (LOTO) and ESD discipline; follow pre/post-work checklists; maintain tidy, safe work areas. • First-line diagnostics - Triage using LEDs, POST, beep codes and basic tests; capture evidence (photos, serials, results); open/update/close tickets with clear notes. • Preventive maintenance - Provide feedback and ideas to improve proactive activities, monitoring, and targeted follow-ups on recurring or specific anomalies; help turn ad-hoc checks into SOPs, alerts, and dashboards. • Parts & logistics - Receive and track parts, keep labeled inventory accurate, manage simple RMAs, and coordinate with vendors. • Collaboration & escalation - Partner with senior hardware/firmware owners on complex or multi-node issues; communicate status and next steps crisply. • Documentation & quality - Keep SOPs/checklists current; ensure zero undocumented changes and consistent, audit-ready records. About you • Hands-on mindset in datacenters/server hardware: you can install/re-seat/swap GPU/PCIe cards, NICs, PSUs, drives, and work cleanly in racks (rails, cabling, labeling). We also welcome candidates with strong Linux fundamentals (boot/check, logs) and scripting (Python/Bash) who are eager to learn hardware; you’ll be trained and mentored by a senior hardware engineer. • Disciplined and meticulous: follows checklists, ESD/LOTO; no rough handling; careful with all high-value server components. • Practical electrical basics: power-off, PPE, short-circuit risk awareness. • Comfortable in racks: cooling, network, storage, PDU, cable management; can lift/mount safely (within HSE limits). • Clear communicator: short factual updates; reliable teammate; punctual and process-minded. • Hardware-passionate, professionally grounded: strong curiosity and craft mindset. Nice to haveHPC/AI/Cloud at scale experience (production environments), large-fleet/server install & maintenance in datacenters. • Basic networking (Ethernet/InfiniBand) and basic Linux (boot/check; no coding needed). • Coding/automation skills (Python/Bash): small tools/scripts to improve checklists, photo/serial capture, inventory sync, or simple monitoring/reporting. • Experience with inventory/RMA tools and vendor coordination. • Exposure to HPC/research/industrial environments. Location & on-site policy • Bruyères-le-Châtel datacenter; on-site only. Day shifts with occasional evenings/weekends/on-call possible to support interventions. Location & Remote The position is based in our Paris HQ offices and we encourage going to the office as much as we can (at least 3 days per week) to create bonds and smooth communication. Our remote policy aims to provide flexibility, improve work-life balance and increase productivity. Each manager can decide the amount of days worked remotely based on autonomy and a specific context (e.g. more flexibility can occur during summer). In any case, employees are expected to maintain regular communication with their teams and be available during core working hours. What we offer 💰 Competitive salary and equity package🧑‍⚕️ Health insurance🚴 Transportation allowance🥎 Sport allowance🥕 Meal vouchers💰 Private pension plan🍼 Generous parental leave policy
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Fathom.ai

Infrastructure Engineer

Fathom
USD
0
180000
-
240000
US.svg
United States
Full-time
Remote
true
ABOUT FATHOMWe think it’s broken that so many people and businesses rely on notes to remember and share insights from their meetings. We created Fathom to eliminate the needless overhead of meetings. Our AI assistant captures, summarizes, and organizes the key moments of your calls, so you and your team can stay fully present without sacrificing context or clarity. From instant, searchable call summaries to seamless CRM updates and team-wide sharing, Fathom transforms meetings from a source of friction into a place for alignment and momentum. We started Fathom to rid us all of the tyranny of note-taking, and people seem to really love what we've built so far: 🥇 #1 Highest Satisfaction Product of 2024 on G2🔥 #1 Rated on G2 with 4,500+ reviews and a perfect 5/5 rating🥇 #1 Product of the Day and #2 AI Product of the Year🚀 Most installed AI meeting assistant on both the Zoom and HubSpot marketplaces📈 We’re hitting usage and revenue records every weekWe're growing incredibly quickly, so we're looking to grow our small but mighty team.Role Overview:We are looking for an SRE who is passionate about leveraging data and automation to drive a highly dynamic infrastructure. The role is a unique blend of infrastructure and internal tooling to reduce friction at every step of delivering an amazing customer experience.As part of our team, you'll play a pivotal role in scaling our infrastructure, reducing toil through automation, and contributing to our culture of innovation and continuous improvement.What you’ll do:By 30 Days:Use your observability background to help scale our existing tools to new heights as we continue to grow the platform.Enhance our existing automation for scaling our infrastructure and improve the development experience.By 90 Days:Play a key role in continuing to diversify and scale our platform across additional regions.Evaluate options to replace our existing real-time data pipeline for enhanced multi-regional capabilities.Provide platform support to all of engineering, using data-driven decision-making.By 1 Year:Work with engineering to re-evaluate what observability means for the Fathom platform, and drive improvements to remove frictionHelp us design and implement improvements to our elastic multi-regional storage platform.Drive platform improvements to enhance reliability and efficiencyRequirements:Hard Skills:Proficiency with and preference for Infrastructure as Code / GitOps tooling.Foundation in Observability best practices and implementation.Experience in a SaaS or PaaS environment.Experience with Google Cloud Platform (GCP) and Google Kubernetes Engine (GKE), including proficiency with GCP/GKE networkingFamiliarity with our tech stack: Message Queues, Prometheus, ClickHouse, ArgoCD, Github Actions, Golang. (Ruby / Rails is a bonus)Soft Skills:Curiosity-driven with a focus on delivering results.A generalist mindset with the ability to dive deep into a wide range of challenges.Resilience and an ability to grind through complex problems.Openness to disagreement and commitment to decisions once made.Strong collaborative skills, with the ability to explain complex insights in an accessible manner.Independence in managing one’s workload and priorities.What You'll Get:The opportunity to shape the dynamic platform of a growing company.A role that balances scaling infrastructure, enabling development teams, and internal tooling developmentA chance to work with a dynamic and collaborative team.Competitive compensation and benefits.A supportive environment that encourages innovation and personal growth.Join Us:If you're excited to own the data journey at Fathom and contribute to our mission with your analytical expertise, we would love to hear from you. Apply now to become a key player in our data-driven success story.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
Lambda.jpg

Data Center Operations Systems Engineer - Atlanta

Lambda AI
USD
0
89000
-
134000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our Atlanta, GA Data Center 5 days per week.The Operations team plays a critical role in ensuring the seamless end-to-end execution of our AI-IaaS infrastructure and hardware. This team is responsible for sourcing all necessary infrastructure and components, overseeing day-to-day data center operations to maintain optimal performance and uptime, and driving cross company coordination through product management organization to align operational capabilities with strategic goals. By managing the full lifecycle from procurement to deployment and operational efficiency, the Operations team ensures that our AI-driven infrastructure is reliable, scalable, and aligned with business priorities.What You'll DoEnsure new server, storage and network infrastructure is properly racked, labeled, cabled, and configuredDocument data center layout and network topology in DCIM softwareWork with supply chain & manufacturing teams to ensure timely deployment of systems and project plans for large-scale deploymentsParticipate in data center capacity and roadmap planning with sales and customer success teams to allocate floorspaceAssess current and future state data center requirements based on growth plans and technology trendsManage a parts depot inventory and track equipment through the delivery-store-stage-deploy-handoff process in each of our data centersWork closely with HW Support team to ensure data center infrastructure-related support tickets are resolvedWork with RMA team to ensure faulty parts are returned and replacements are orderedCreate installation standards and documentation for placement, labeling, and cabling to drive consistency and discoverability across all data centersServe as a subject-matter expert on data center deployments as part of sales engagement for large-scale deployments in our data centers and at customer sitesYouHave experience with critical infrastructure systems supporting data centers, such as power distribution, air flow management, environmental monitoring, capacity planning, DCIM software, structured cabling, and cable managementHave strong Linux administration experienceHave experience in setting up networking appliances (Ethernet and InfiniBand) across multiple data center locationsYou are action-oriented and have a strong willingness to learnYou are willing to travel for bring up of new data center locationsNice to HaveExperience with troubleshooting the following network layers, technologies, and system protocols: TCP/IP, DP/IP, BGP, OSPF, SNMP, SSL, HTTP, FTP, SSH, Syslog, DHCP, DNS, RDP, NETBIOS, IP routing, Ethernet, switched Ethernet, 802.11x, NFS, and VLANs.Experience with working in large-scale distributed data center environmentsExperience working with auditors to meet all compliance requirements (ISO/SOC)Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Lambda.jpg

Data Center Operations Engineer - Virginia

Lambda AI
USD
0
89000
-
134000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our Ashburn and Sterling, VA Data Centers 5 days per week.The Operations team plays a critical role in ensuring the seamless end-to-end execution of our Al-laaS infrastructure and hardware. This team is responsible for sourcing all necessary infrastructure and components, overseeing day-to-day data center operations to maintain optimal performance and uptime, and driving cross company coordination through product management organization to align operational capabilities with strategic goals. By managing the full lifecycle from procurement to deployment and operational efficiency, the Operations team ensures that our Al-driven infrastructure is reliable, scalable, and aligned with business priorities.What You'll DoEnsure new server, storage and network infrastructure is property racked, labeled, cabled, and configured.Document data center layout and network topology in DCIM software.Work with supply chain & manufacturing teams to ensure timely deployment of systems and project plans for large-scale deployments.Participate in data center capacity and roadmap planning with sales and customer success teams to allocate floorspace.Assess current and future state data center requirements based on growth plans and technology trends.Manage a parts depot inventory and track equipment through the delivery-store-stage-deploy-handoff process in each of our data centers.Work closely with the HW Support team to ensure data center infrastructure-related support tickets are resolved.Work with the RMA team to ensure faulty parts are returned and replacements are ordered.Create installation standards and documentation for placement, labeling, and cabling to drive consistency and discoverability across all data centers.Serve as a subject-matter expert on data center deployments as part of sales engagement for large-scale deployments in our data centers and at customer sites.YouHave experience with critical infrastructure systems supporting data centers, such as power distribution, air flow management, environmental monitoring, capacity planning, DCIM software, structured cabling, and cable management.Have strong Linux administration experience.Have experience in setting up networking appliances (Ethernet and InfiniBand) across multiple data center locations.You are action-oriented and have a strong willingness to learn.You are willing to travel to bring up new data center locations.Nice to HaveExperience with troubleshooting the following network layers, technologies, and system protocols: TCP/IP, UDP/IP, BGP, OSPF, SNMP, SSL, HTTP, FTP, SSH, Syslog, DHCP, DNS, RDP, NETBIOS, IP routing, Ethernet, switched Ethernet, 802.11x, NFS, and VLANs.Experience with working in large-scale distributed data center environments.Experience working with auditors to meet all compliance requirements (ISO/SOC).Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Lambda.jpg

Senior Site Reliability Engineer - Networking

Lambda AI
USD
0
250000
-
417000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our San Francisco/San Jose/Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.What You'll DoHelp scale Lambda’s high performance multi-tenant cloud networkContribute to the reproducible automation of network configuration and deploymentsContribute to the implementation and operations of Software Defined NetworksHelp to deploy and manage Spine and Leaf networksEnsure high availability of our network through observability, failover, and redundancyEnsure clients have predictable networking performance through the use of network engineering and other applicable technologiesHelp with deploying and maintaining network monitoring and management toolsParticipate in on-callYouHave 5+ years of experience being SWE, SRE or Network Reliability EngineeringBeen part of the implementation of production-scale networking projectsExperience being on-call and incident response managementHave experience building and maintaining Software Defined Networks (SDN), experience with OpenStack, Neutron, OVNAre comfortable on the Linux command line, and have an understanding of the Linux networking stackHave experience with multi-data center networks and hybrid cloud networksHave Python programming experience and configuration management tools like AnsibleHave experience with CI/CD tools for deployment and GIT. Operated network environment with GitOps practices in place.Experience with application lifecycle and deployments on KubernetesNice To HaveOperated production-scale SDNs in a cloud context (e.g. helped implement or operate the infrastructure that powers an AWS VPC-like feature)Have Software development experience with C, GO, PythonExperience automating network configuration within public clouds, with tools like kubentetes, HELM, Terraform, AnsibleDeep understanding of the Linux networking stack and its interaction with network virtualization, SR-IOV and DPDKUnderstanding of the SDN ecosystem (e.g. OVS, Neutron, VMware NSX, Cisco ACI or Nexus Fabric Controller, Arista CVP)Have experience with Spine and Leaf (Clos) network topologyHave experience and understanding of BGP EVPN VXLAN networksExperience with building and maintaining multi-data center networks, SD-WAN, DWDMExperience with Next-Generation Firewalls (NGFW)Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Abridge.jpg

Head of AI Platform

Abridge
USD
0
270000
-
340000
US.svg
United States
Full-time
Remote
false
About AbridgeAbridge was founded in 2018 with the mission of powering deeper understanding in healthcare. Our AI-powered platform was purpose-built for medical conversations, improving clinical documentation efficiencies while enabling clinicians to focus on what matters most—their patients.Our enterprise-grade technology transforms patient-clinician conversations into structured clinical notes in real-time, with deep EMR integrations. Powered by Linked Evidence and our purpose-built, auditable AI, we are the only company that maps AI-generated summaries to ground truth, helping providers quickly trust and verify the output. As pioneers in generative AI for healthcare, we are setting the industry standards for the responsible deployment of AI across health systems.We are a growing team of practicing MDs, AI scientists, PhDs, creatives, technologists, and engineers working together to empower people and make care make more sense. We have offices located in the Mission District in San Francisco, the SoHo neighborhood of New York, and East Liberty in Pittsburgh. The RoleOur generative AI-powered products bring joy back to the practice of medicine. As our offerings expand, we’re looking for a Head of AI Platform to scale the infrastructure that powers them. This is a critical, high-leverage role requiring both people leadership, technical strategy, and ownership of key business outcomes. You will own the entire lifecycle of our AI Platform, ensuring its reliability, efficiency, scalability, and compliance. You’ll own a key pillar of our technical organization, driving the technical direction and organization shaping how our models are trained, served, and managed in production.What You’ll DoPeople Management: Recruit, retain, and mentor engineers and engineering managers. Provide regular feedback; create opportunities for career growth; and foster a culture of collaboration and excellence.Technical & Organizational Leadership: Act as the people and technical leader for the AI Platform team. This includes owning the staffing and execution of the team, and driving work on model serving, training compute, agent serving platform, LLM gateway, and associated orchestration layers. You will guide architectural discussions and set top-level strategic direction for the company’s AI/ML infrastructure.Project Management: Work closely with stakeholders, including product managers, engineering managers, and AI/ML teams to plan, execute, and support multiple projects simultaneously. You will be responsible for the engineering process in the team and the output of the platform.Platform Ownership: Own the design, build, and operation of the core AI platform components, including:Model serving and deployment infrastructure.Compute and vendor management (e.g. GPU allocation).MLOps pipelines and tooling.Health, quality, and performance monitoring.Training compute infrastructure.LLM gateway and orchestration layers for agent serving.Champion Quality: Set a high standard for your team including software quality; communication; collaboration; and compliance with industry and regulatory standards.What You’ll BringA strong technologist, with 10+ years of experience building high-performance distributed systems and 3+ years managing AI/ML-focused engineering teams.Comfortable giving constructive feedback on technical designs and code reviews.Skilled in building secure, compliant systems in major cloud platforms (GCP preferred, but other experience welcome).Skilled at hiring and mentorship, with a track record of helping engineers grow their skills and careers.Expertise with kubernetes, containers, model training and serving, GPU-based capacity planning, and building applications on top of LLMs.Knowledgeable about the software development lifecycle. You view processes such as Kanban and Scrum as tools in a toolbox, and you know which tools to apply in which situations.Up-to-date on industry best-practices and tools, and enjoy learning new thingsExcited about being hands-on in a fast-moving, productive, and supportive environmentWilling to pitch in wherever neededHas thrived in a fast-growing startup, knows how to operate in that environmentBonus Points If…Has owned an Evaluation Platform for AI/ML models.Has owned data engineering or core infrastructure.Why Work at Abridge?At Abridge, we’re transforming healthcare delivery experiences with generative AI, enabling clinicians and patients to connect in deeper, more meaningful ways. Our mission is clear: to power deeper understanding in healthcare. We’re driving real, lasting change, with millions of medical conversations processed each month.Joining Abridge means stepping into a fast-paced, high-growth startup where your contributions truly make a difference. Our culture requires extreme ownership—every employee has the ability to (and is expected to) make an impact on our customers and our business.Beyond individual impact, you will have the opportunity to work alongside a team of curious, high-achieving people in a supportive environment where success is shared, growth is constant, and feedback fuels progress. At Abridge, it’s not just what we do—it’s how we do it. Every decision is rooted in empathy, always prioritizing the needs of clinicians and patients.We’re committed to supporting your growth, both professionally and personally. Whether it's flexible work hours, an inclusive culture, or ongoing learning opportunities, we are here to help you thrive and do the best work of your life.If you are ready to make a meaningful impact alongside passionate people who care deeply about what they do, Abridge is the place for you. How we take care of Abridgers:Generous Time Off: 13 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees.Comprehensive Health Plans: Medical, Dental, and Vision plans for all full-time employees. Abridge covers 100% of the premium for you and 75% for dependents. If you choose a HSA-eligible plan, Abridge also makes monthly contributions to your HSA. Paid Parental Leave: 16 weeks paid parental leave for all full-time employees.401k and Matching: Contribution matching to help invest in your future.Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits.Learning and Development Budget: Yearly contributions for coaching, courses, workshops, conferences, and more.Sabbatical Leave: 30 days of paid Sabbatical Leave after 5 years of employment.Compensation and Equity: Competitive compensation and equity grants for full time employees.... and much more!Equal Opportunity EmployerAbridge is an equal opportunity employer and considers all qualified applicants equally without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran status, or disability.Staying safe - Protect yourself from recruitment fraudWe are aware of individuals and entities fraudulently representing themselves as Abridge recruiters and/or hiring managers. Abridge will never ask for financial information or payment, or for personal information such as bank account number or social security number during the job application or interview process. Any emails from the Abridge recruiting team will come from an @abridge.com email address. You can learn more about how to protect yourself from these types of fraud by referring to this article. Please exercise caution and cease communications if something feels suspicious about your interactions. 
MLOps / DevOps Engineer
Data Science & Analytics
Machine Learning Engineer
Data Science & Analytics
Apply
Hidden link
Figure.jpg

Systems Integration Engineer – Head Subsystem

Figure AI
USD
150000
-
350000
US.svg
United States
Full-time
Remote
false
Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA. Figure’s vision is to deploy autonomous humanoids at a global scale. Our Helix team is looking for an experienced Training Infrastructure Engineer, to take our infrastructure to the next level. This role is focused on managing the training cluster, implementing distributed training algorithms, data loaders, and developer tools for AI researchers. The ideal candidate has experience building tools and infrastructure for a large-scale deep learning system. Responsibilities Design, deploy, and maintain Figure's training clusters Architect and maintain scalable deep learning frameworks for training on massive robot datasets Work together with AI researchers to implement training of new model architectures at a large scale Implement distributed training and parallelization strategies to reduce model development cycles Implement tooling for data processing, model experimentation, and continuous integration Requirements Strong software engineering fundamentals Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field Experience with Python and PyTorch Experience managing HPC clusters for deep neural network training Minimum of 4 years of professional, full-time experience building reliable backend systems Bonus Qualifications Experience managing cloud infrastructure (AWS, Azure, GCP) Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.) Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.) The US base salary range for this full-time position is between $150,000 - $350,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Figure.jpg

Validation Engineer – Mechanical Systems

Figure AI
USD
150000
-
350000
US.svg
United States
Full-time
Remote
false
Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA. Figure’s vision is to deploy autonomous humanoids at a global scale. Our Helix team is looking for an experienced Training Infrastructure Engineer, to take our infrastructure to the next level. This role is focused on managing the training cluster, implementing distributed training algorithms, data loaders, and developer tools for AI researchers. The ideal candidate has experience building tools and infrastructure for a large-scale deep learning system. Responsibilities Design, deploy, and maintain Figure's training clusters Architect and maintain scalable deep learning frameworks for training on massive robot datasets Work together with AI researchers to implement training of new model architectures at a large scale Implement distributed training and parallelization strategies to reduce model development cycles Implement tooling for data processing, model experimentation, and continuous integration Requirements Strong software engineering fundamentals Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field Experience with Python and PyTorch Experience managing HPC clusters for deep neural network training Minimum of 4 years of professional, full-time experience building reliable backend systems Bonus Qualifications Experience managing cloud infrastructure (AWS, Azure, GCP) Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.) Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.) The US base salary range for this full-time position is between $150,000 - $350,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.
MLOps / DevOps Engineer
Data Science & Analytics
Machine Learning Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
Ema.jpg

Senior Infrastructure Engineer

Ema
-
IN.svg
India
Full-time
Remote
false
Who we areEma is building the next generation AI technology to empower every employee in the enterprise to be their most creative and productive. Our proprietary tech allows enterprises to delegate most repetitive tasks to Ema, the AI employee. We are founded by ex-Google, Coinbase, Okta executives and serial entrepreneurs. We’ve raised capital from notable investors such as Accel Partners, Naspers, Section32 and a host of prominent Silicon Valley Angels including Sheryl Sandberg (Facebook/Google), Divesh Makan (Iconiq Capital), Jerry Yang (Yahoo), Dustin Moskovitz (Facebook/Asana), David Baszucki (Roblox CEO) and Gokul Rajaram (Doordash, Square, Google).Our team is a powerhouse of talent, comprising engineers from leading tech companies like Google, Microsoft Research, Facebook, Square/Block, and Coinbase. All our team members hail from top-tier educational institutions such as Stanford, MIT, UC Berkeley, CMU and Indian Institute of Technology. We’re well funded by the top investors and angels in the world. Ema is based in Silicon Valley and Bangalore, India. This will be a hybrid role where we expect employees to work from office three days a week.Who you areWe are seeking an experienced Infrastructure Engineer to join our growing team and play a pivotal role in designing and building our platform and infrastructure as we continue to scale our product and user base. As a part of our team, you will be working in a dynamic, fast-paced environment to ensure the reliability, scalability, and performance of our systems, while focusing on service architecture and deployment, query optimization, distributed systems, data and machine learning infrastructure, and security and authentication. Most importantly, you are excited to be part of a mission-oriented, fast-paced, high-growth startup that can create a lasting impact.You will:Partner with product, infra, and engineering teams to architect and build Ema’s next-generation infrastructure platform supporting multi-cloud deployments and on-prem installations.Design and implement scalable, secure, and resilient deployment frameworks for Ema SaaS and enterprise on-prem environments, enabling automated installation, upgrades, and lifecycle management of Ema.Develop and maintain multi-cloud infrastructure pipelines (AWS, Azure, GCP) using Kubernetes, Helm, Terraform, and cloud-native services to ensure seamless and reliable deployments.Build tools and frameworks to automate the provisioning, configuration, monitoring, and upgrade of Ema environments at scale.Design and optimize CI/CD pipelines (GitHub Actions, Cloud Build, etc.) to streamline the release process across environments while improving developer experience.Contribute code and automation scripts (in Python, Go, or Shell) to strengthen infrastructure management and deployment reliability.Ensure observability, scalability, and security across distributed systems by integrating monitoring, logging, and alerting solutions.Collaborate cross-functionally to evolve Ema’s infra architecture, enabling faster deployments, lower operational overhead, and improved platform stability.Nice to HaveExperience designing installers and deployment managers for both SaaS and air-gapped on-prem environments.Strong understanding of container orchestration (Kubernetes) and infrastructure as code (Terraform, Helm).Hands-on experience with automation frameworks (Ansible, ArgoCD, Flux, or similar).Knowledge of service mesh and networking for multi-cloud environments (Istio, Envoy, or similar).Familiarity with monitoring and observability stacks (Prometheus, Grafana, Signoz, PagerDuty).Prior experience in ML Ops or data infrastructure is a plus.Proficiency in Python or GoExposure to air-gapped deployments, private clouds, or secure enterprise installations.Qualifications:Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.5+ years of hands-on experience in Infrastructure, Platform, or DevOps Engineering, with strong exposure to multi-cloud environmentsStrong analytical and problem-solving skills, with a focus on scalability, reliability, and performance.Demonstrated ability to work independently and collaboratively in a fast-paced, high-growth environment.Experience working with global, cross-functional teams across time zones.Ema Unlimited is an equal opportunity employer and is committed to providing equal employment opportunities to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability, sexual orientation, gender identity, or genetics.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
OpenAI.jpg

Site Reliability Engineer, Frontier Systems Infrastructure

OpenAI
USD
255000
-
490000
US.svg
United States
Full-time
Remote
false
About the TeamThe Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training.We take data center designs, turn them into real, working systems and build any software needed for running large-scale frontier model trainings.Our mission is to bring up, stabilize and keep these hyperscale supercomputers reliable and efficient during the training of the frontier models.About the RoleWe are looking for engineers to operate the next generation of compute clusters that power OpenAI’s frontier research.This role blends distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale Kubernetes clusters to massive scale, automate bare-metal bring-up, and build the software layer that hides the complexity of a magnitude of nodes across multiple data centers.You will work at the intersection of hardware and software, where speed and reliability are critical. Expect to manage fast-moving operations, quickly diagnose and fix issues when things are on fire, and continuously raise the bar for automation and uptime.In this role, you will:Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle managementBuild software abstractions that unify multiple clusters and present a seamless interface to training workloadsOwn node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scaleImprove operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cyclesIntegrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructureDevelop monitoring and observability systems to detect issues early and keep clusters stable under extreme loadBe expected to execute at the same level as a software engineerYou might thrive in this role if you:Have deep experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environmentsBring strong programming or scripting skills (Python, Go, or similar) and familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormationAre comfortable with bare-metal Linux environments, GPU hardware, and large-scale networkingEnjoy solving fast-moving, high-impact operational problems and building automation to eliminate manual workCan balance careful engineering with the urgency of keeping mission-critical systems runningQualificationsExperience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environmentsStrong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloadsProficiency in cloud infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operationsBonus: background with GPU workloads, firmware management, or high-performance computingAbout OpenAIOpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.OpenAI Global Applicant Privacy PolicyAt OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Lambda.jpg

IT Systems Engineer, Infrastructure & Platform Reliability

Lambda AI
USD
0
206000
-
310000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our San Francisco or San Jose office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Information Systems at Lambda is responsible for building and scaling the internal systems that power our business. We partner across the company—Finance, GTM, Engineering, and People—to implement tools, automate workflows, and ensure data flows securely and accurately. Our scope includes enterprise applications, integrations, data platform and analytics, compliance automation, and all things IT.What You’ll DoDesign, write, and deliver software and services to improve the availability, scalability, reliability, and efficiency of Lambda’s internal IT systems and platforms.Solve problems relating to mission critical services and build automation to prevent problem recurrence with the goal of automating response to all non-exceptional events.Work with Lambda Engineering and internal teams to Influence and create new designs, architectures, standards, and methods for large-scale distributed systems.Engage in service capacity planning and demand forecasting, software performance analysis, and system tuning.Be an excellent communicator, producing documentation and related artifacts for the systems you are responsible for.YouHave a keen interest in system design, architecting for performance, scalability, and experience with multiple cloud infrastructure platforms (AWS, GCP, Azure, etc.).Think carefully about systems: edge cases, failure modes, behaviors, and specific implementations.Know and prefer configuration management systems and toolchains (Chef, Ansible, Terraform, GitHub Actions, etc.)Have solid programming skills: Python, Go, etc.Have an urge to collaborate and communicate asynchronously, combined with a desire to record and document issues and solutions.Have an enthusiastic, go-for-it attitude. When you see something broken, you can’t help but fix it.Have an urge for delivering quickly and effectively, and iterating fast.Nice to HaveExperience and interest in ML/AI workloads and computePractical experience implementing and managing paging, alerting, and on-call scheduling flowsA positive attitude, combined with a desire to learn and collaborateSalary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
Speechify.jpg

Senior Product Designer

Speechify
USD
140000
-
200000
US.svg
United States
Full-time
Remote
true
PLEASE APPLY THROUGH THIS LINK: https://job-boards.greenhouse.io/speechify/jobs/5287658004  DO NOT APPLY BELOW The mission of Speechify is to make sure that reading is never a barrier to learning. Over 50 million people use Speechify’s text-to-speech products to turn whatever they’re reading – PDFs, books, Google Docs, news articles, websites – into audio, so they can read faster, read more, and remember more. Speechify’s text-to-speech reading products include its iOS app, Android App, Mac App, Chrome Extension, and Web App. Google recently named Speechify the Chrome Extension of the Year and Apple named Speechify its App of the Day. Today, nearly 200 people around the globe work on Speechify in a 100% distributed setting – Speechify has no office. These include frontend and backend engineers, AI research scientists, and others from Amazon, Microsoft, and Google, leading PhD programs like Stanford, high growth startups like Stripe, Vercel, Bolt, and many founders of their own companies. This is a key role and ideal for someone who thinks strategically, enjoys fast-paced environments, passionate about making product decisions, and has experience building great user experiences that delight users. We are a flat organization that allows anyone to become a leader by showing excellent technical skills and delivering results consistently and fast. Work ethic, solid communication skills, and obsession with winning are paramount.  Our interview process involves several technical interviews and we aim to complete them within 1 week.  What You’ll Do Work alongside machine learning researchers, engineers, and product managers to bring our AI Voices to their customers for a diverse range of use cases Deploy and operate the core ML inference workloads for our AI Voices serving pipeline Introduce new techniques, tools, and architecture that improve the performance, latency, throughput, and efficiency of our deployed models Build tools to give us visibility into our bottlenecks and sources of instability and then design and implement solutions to address the highest priority issues An Ideal Candidate Should Have Experience shipping Python-based services Experience being responsible for the successful operation of a critical production service Experience with public cloud environments, GCP preferred Experience with Infrastructure such as Code, Docker, and containerized deployments. Preferred: Experience deploying high-availability applications on Kubernetes. Preferred: Experience deploying ML models to production What We Offer A dynamic environment where your contributions shape the company and its products A team that values innovation, intuition, and drive Autonomy, fostering focus and creativity The opportunity to have a significant impact in a revolutionary industry Competitive compensation, a welcoming atmosphere, and a commitment to an exceptional asynchronous work culture The privilege of working on a product that changes lives, particularly for those with learning differences like dyslexia, ADD, and more An active role at the intersection of artificial intelligence and audio – a rapidly evolving tech domain Salary The United States base salary range for this full-time position is $140,000-$200,000 + bonus + equity depending on experience Think you’re a good fit for this job?  Tell us more about yourself and why you're interested in the role when you apply. And don’t forget to include links to your portfolio and LinkedIn. Not looking but know someone who would make a great fit?  Refer them!  Speechify is committed to a diverse and inclusive workplace.  Speechify does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
Hippocratic AI.jpg

Head of Security Operations

Hippocratic AI
-
US.svg
United States
Full-time
Remote
false
About UsHippocratic AI has developed a safety-focused Large Language Model (LLM) for healthcare. The company believes that a safe LLM can dramatically improve healthcare accessibility and health outcomes in the world by bringing deep healthcare expertise to every human. No other technology has the potential to have this level of global impact on health.Why Join Our TeamInnovative Mission: We are developing a safe, healthcare-focused large language model (LLM) designed to revolutionize health outcomes on a global scale.Visionary Leadership: Hippocratic AI was co-founded by CEO Munjal Shah, alongside a group of physicians, hospital administrators, healthcare professionals, and artificial intelligence researchers from leading institutions, including El Camino Health, Johns Hopkins, Stanford, Microsoft, Google, and NVIDIA.Strategic Investors: We have raised a total of $278 million in funding, backed by top investors such as Andreessen Horowitz, General Catalyst, Kleiner Perkins, NVIDIA’s NVentures, Premji Invest, SV Angel, and six health systems.World-Class Team: Our team is composed of leading experts in healthcare and artificial intelligence, ensuring our technology is safe, effective, and capable of delivering meaningful improvements to healthcare delivery and outcomes.For more information, visit www.HippocraticAI.com.We value in-person teamwork and believe the best ideas happen together. Our team is expected to be in the office five days a week in Palo Alto, CA, unless explicitly noted otherwise in the job description.About the RoleAs Head of Security Operations at Hippocratic AI, you will lead the operational security architecture across infrastructure, product, data, and clinical-use contexts. You will ensure readiness for incidents, continuous monitoring, threat detection and response, and embed operational security into our healthcare-AI lifecycle. You will be responsible for defining strategy, managing teams, tools, and processes, and aligning with regulatory, privac,y and governance demands unique to healthcare AI. This position reports to the CISO.What You'll Do:Develop and own the security operations strategy: define missions, objectives, KPIs, service levels, and a road-map for detection, response, monitoring, and operations.Build, lead, and scale the security operations team: SOC/SecOps analysts, threat hunters, response engineers; define roles, hiring, training, and leadership.Oversee real-time security monitoring, detection, triage, investigation, and containment of incidents across cloud, infrastructure, product, clinical data pipelines, and end-user interfaces.Perform tabletop and DR/BR scenariosDefine incident response playbooks, run-books, escalation paths, crisis communication, post-mortem mechanics, and lessons-learned cycles specific to regulated health-AI contexts.Manage security tooling and architecture for operations: SIEM, SOAR, threat intel platforms, cloud-native logging/alerting, automation of response.Embed security operations practices into product and engineering life cycles: collaborate with product security, devops, data science, and clinical operations to integrate detection/response capabilities.Work with GRC to establish vendor/third-party risk monitoring for security operations: ensure that outsourced services, clinical-data vendors, and cloud providers meet operational security expectations.Maintain readiness for audits, compliance, and regulatory demands (HIPAA-adjacent, healthcare data, AI-governance) as operations scale.Liaise with other functional leads (GRC, privacy, product, legal) to ensure alignment of security operations with governance and compliance frameworks.What You BringYou have a proven track record (10+ years) leading or heavily involved in security operations in a technology or SaaS environment, ideally with regulated data (healthcare, life sciences, or similarly regulated).You are comfortable operating in ambiguity and high-stakes contexts, making decisions under pressure and prioritizing response.You have experience in incident response and understand the communication chain and evidence collection processYou understand multiple clouds (AWS, GCP, etc), containers, data-platform threat surfaces, and can translate technical risk into business-impact language.You can build and run metrics-driven security operations, define processes and workflows, and move from reactive to proactive/resilient models.You can communicate effectively with senior leadership and cross-functional stakeholders.You hold yourself accountable for operational excellence and continuous improvement of security posture.Must-Have:Bachelor’s degree (or equivalent experience) in computer science, cybersecurity, engineering, or similar.10+ years in security operations, incident response, or security engineering roles; 3+ years in a leadership role.Deep experience with security monitoring/detection tools (SIEM, SOAR, EDR/XDR), cloud security operations (AWS, GCP, Azure), threat hunting,and incident response.Proven success in establishing or scaling SOC/SecOps functions.Strong understanding of security operations metrics, incident lifecycle, root-cause analysis, and remediation.Familiarity with regulatory/compliance environments tied to healthcare or data-sensitive industries.Nice-to-Haves:Certifications such as CISSP, CISM, GIAC (GCIA, GCIH), or equivalent.Experience specifically in SaaS, healthcare, or clinical data security operations.Experience in AI/ML-centric organizations or securing AI/ML pipelines.Experience building remote/distributed security teams.Prior experience with compliance frameworks is a plus (HIPAA, HITRUST, ISO 27001, SOC2).***Be aware of recruitment scams impersonating Hippocratic AI. All recruiting communication will come from @hippocraticai.com email addresses. We will never request payment or sensitive personal information during the hiring process. If anything
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Hippocratic AI.jpg

Head of Data Operations

Hippocratic AI
0
0
-
0
US.svg
United States
Full-time
Remote
false
About UsHippocratic AI has developed a safety-focused Large Language Model (LLM) for healthcare. The company believes that a safe LLM can dramatically improve healthcare accessibility and health outcomes in the world by bringing deep healthcare expertise to every human. No other technology has the potential to have this level of global impact on health.Why Join Our TeamInnovative Mission: We are developing a safe, healthcare-focused large language model (LLM) designed to revolutionize health outcomes on a global scale.Visionary Leadership: Hippocratic AI was co-founded by CEO Munjal Shah, alongside a group of physicians, hospital administrators, healthcare professionals, and artificial intelligence researchers from leading institutions, including El Camino Health, Johns Hopkins, Stanford, Microsoft, Google, and NVIDIA.Strategic Investors: We have raised a total of $278 million in funding, backed by top investors such as Andreessen Horowitz, General Catalyst, Kleiner Perkins, NVIDIA’s NVentures, Premji Invest, SV Angel, and six health systems.World-Class Team: Our team is composed of leading experts in healthcare and artificial intelligence, ensuring our technology is safe, effective, and capable of delivering meaningful improvements to healthcare delivery and outcomes.For more information, visit www.HippocraticAI.com.We value in-person teamwork and believe the best ideas happen together. Our team is expected to be in the office five days a week in Palo Alto, CA, unless explicitly noted otherwise in the job description.About the Role:We are seeking a Head of Data Operations to lead the teams and partners responsible for data generation, annotation, evaluation, and RLHF (reinforcement learning from human feedback) across HippocraticAI’s healthcare agent products.This leader will own the entire data operations lifecycle — from sourcing and labeling to model evaluation and feedback — ensuring precision, scalability, and alignment with our safety and ethics principles. You will manage internal teams, global contractors, and strategic vendors to deliver high-quality data pipelines that enable continual learning and improvement of our agentic systems.What You'll DoTeam & Vendor LeadershipBuild, lead, and scale a global data operations organization including full-time employees, contractors, and vendor partners.Define clear roles, quality standards, and performance metrics across all data functions (evaluation, labeling, RLHF, and generation).Partner with Legal, Compliance, and Security to ensure all global data work adheres to HIPAA and data privacy standards.Data Program ManagementOversee the design and execution of evaluation frameworks for LLMs and agentic behaviors — both automated and human-in-the-loop.Lead data labeling, synthesis, and annotation operations, ensuring medical accuracy, consistency, and context-rich quality.Manage large-scale RLHF pipelines — aligning training data with clinical and ethical objectives.Optimize throughput, cost, and quality across in-house teams and external vendors.Process, Tooling, and QualityPartner with engineering and product to design and improve data operations infrastructure, including labeling tools, quality assurance systems, and task routing platforms.Implement robust QA processes and auditing frameworks to ensure data integrity and reliability.Drive continuous improvement in efficiency, consistency, and evaluator experience.Cross-Functional CollaborationWork closely with Research, Model, and Product teams to define data needs and feedback loops.Collaborate with Clinical and Safety leaders to align annotation and evaluation standards with clinical guidelines.Provide strategic input into data strategy, metrics, and operational planning.What You BringMust Have:10+ years of experience in data operations, annotation, or model evaluation, with 5+ years in management or leadership roles.Proven success scaling data or RLHF operations across geographies and vendors.Strong program management and process optimization skills; experience managing distributed teams.Familiarity with LLM training and evaluation, RLHF, or human-in-the-loop systems.Deep respect for data ethics, privacy, and quality — ideally within healthcare, life sciences, or another regulated industry.Excellent communication and collaboration skills; able to navigate between technical, clinical, and operational stakeholders.Nice-to-Have:Experience in medical or healthcare data annotation or clinical workflow modeling.Prior work building custom data pipelines or labeling platforms.Understanding of LLM fine-tuning, preference modeling, and evaluation metrics.Global vendor management experience with large-scale workforce operations.***Be aware of recruitment scams impersonating Hippocratic AI. All recruiting communication will come from @hippocraticai.com email addresses. We will never request payment or sensitive personal information during the hiring process. If anything
MLOps / DevOps Engineer
Data Science & Analytics
Machine Learning Engineer
Data Science & Analytics
Apply
Hidden link
Lambda.jpg

Director, Data Center Operations - North America

Lambda AI
USD
0
220000
-
330000
US.svg
United States
Full-time
Remote
true
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  Lambda, Inc. is seeking a highly skilled and experienced Director of Data Center Operations to lead and support Lambda Data Center Operations in North America. What You'll Do:As Director of Data Center Operations for North America you lead and support large-scale AI and high-performance computing (HPC) infrastructure in all of Lambda’s North America data centers. This individual will lead and oversee all aspects of data center operations — including reliability, hardware break/fix, capacity planning, provider interface, team mentorship, and new data center setup —ensuring world-class uptime, customer response, and scalability to meet rapidly growing AI infrastructure demands. Key Responsibilities:Strategic LeadershipDevelop and execute the North American data center operations strategy aligned with AI infrastructure goals and organizational growth.Drive continuous improvement across facility operations, emphasizing sustainability, efficiency, and resilience.Partner with Engineering, Capacity Planning, and Infrastructure teams to forecast and support future AI and GPU-based compute requirements. As well as provide operational feedback on designs and system improvements.Oversee expansion projects, retrofits, and site selection in collaboration with Data Center Infrastructure Engineering and HPC Architecture teams.Operational ExcellenceLead a multi-site operations team ensuring 24/7/365 reliability, availability, and SLA response across all facilities.Establish standardized procedures, metrics, and best practices for preventive maintenance, incident management, and service delivery.Monitor operational KPIs including uptime, PUE, safety, and compliance with corporate and regulatory standards.Implement automation and AI-driven monitoring solutions to optimize system performance and predictive maintenance. Coordinate and communicate data center provider maintenances with customers and impacted teams.Team Leadership and DevelopmentBuild, mentor, and scale a high-performing team of operations managers, technicians, and engineers across multiple regions.Routinely visit all sites to maintain standards, develop relationships, and identify areas of efficiency. Foster a culture of safety, accountability, and continuous learning driving data center operations to take on more responsibility and work up the stack.Assist in the build out of new data center whitespace and deployment of AI Infrastructure.Financial and Vendor ManagementDevelop and manage operating budgets, capital expenditures, and cost-optimization initiatives.Oversee strategic vendor partnerships with numerous data center providers for power, cooling, maintenance, and critical infrastructure components.Risk and ComplianceEnsure compliance with environmental, safety, and industry regulations (e.g., NFPA, OSHA, ISO standards).Lead incident response and root cause analysis to drive preventive improvements for incidents related to data center operations or infrastructure.Act as primary point of contact for audits related to data center operations for compliance such as SOCII, ISO, etc. Qualifications:10+ years of experience in data center operations, with at least 7 years in a leadership role managing multi-site or hyperscale facilities.Proven experience supporting AI, HPC, or cloud infrastructure at scale.Deep understanding of power and cooling systems, networking, capacity planning, and facility automation tools (DCIM, BMS, etc.).Strong track record of improving operational efficiency and managing relationships with data center providers.Preferred Bachelor’s degree in Engineering, Computer Science, or related field; Master’s bonus.Exceptional communication, cross-functional collaboration, and stakeholder management skills. Ability to build relationships and consensus and positive team culture.Willingness to travel (up to 50%) to data center sites across North America and data center sites under construction.Preferred Skills:Experience with GPU clusters, AI infrastructure networking, and large-scale storage systems.Familiarity with cloud-scale operational practices (e.g., AWS, Google, Microsoft data center standards).Certifications such as CDCDP, CDCP, PMP, or PE are a plus.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
Lambda.jpg

AI Infrastructure Deployment Lead

Lambda AI
USD
0
128000
-
149000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  As the AI Infrastructure Deployment Lead, you’ll be responsible for planning, coordinating, and executing the deployment of large-scale AI infrastructure across Lambda’s data centers and customer sites. You’ll lead cross-functional technical teams to design resilient network topologies, oversee rack-level integration, and ensure smooth delivery of compute environments optimized for large-scale training workloads.This role combines hands-on technical expertise with strategic project leadership — ideal for engineers who thrive at the intersection of hardware, networking, and systems design.What You’ll DoInfrastructure DeploymentLead end-to-end deployment of GPU clusters, storage systems, and networking fabric across Lambda’s data centers.Design and implement data center network topologies optimized for AI and HPC workloads, including high-speed Ethernet and InfiniBand environments.Oversee rack implementation, cabling, and power/cooling validation for optimal efficiency and scalability.Collaborate with supply chain, logistics, and operations teams to ensure smooth delivery and installation timelines.Network EngineeringImplement Layer 2/Layer 3 networks, including VLANs, Spine to Leaf architecture, Infiniband interconnect technology. Partner with network architects to ensure redundancy, scalability, and low-latency interconnects for distributed AI workloads.Monitor network health, identify bottlenecks, and implement optimizations to maintain peak performance.Hardware & Systems ManagementOversee server hardware troubleshooting, including GPUs, NICs, CPUs, and storage components.Lead root-cause analysis for system issues and drive corrective actions in collaboration with vendors and internal hardware teams.Develop standard operating procedures (SOPs) for hardware validation, deployment, and maintenance.Technical Project LeadershipServe as technical project lead for infrastructure rollouts and cluster expansion projects.Coordinate cross-functional teams — networking, facilities, cloud operations, and hardware engineering — to execute deployments on schedule.Manage project scope, budgets, risk assessments, and post-deployment reviews.Communicate status, challenges, and milestones to leadership with clarity and precision.Documentation & Continuous ImprovementMaintain detailed network topology diagrams, deployment runbooks, and hardware inventories.Identify opportunities for process automation and infrastructure standardization across deployments.Contribute to Lambda’s internal knowledge base and mentor junior engineers on data center best practices.What You’ll BringRequired:Bachelor’s degree in Computer Engineering, Information Technology, or related field.CCNA (Cisco Certified Network Associate) certification (CCNP or equivalent a plus).PMP (project Management Professional) Certification (PMP or equivalent a plus).5+ years of experience in data center infrastructure deployment or network operations, preferably in AI, HPC, or cloud environments.Proven ability to lead complex technical projects and manage multidisciplinary teams.Strong understanding of data center network design (Layer 2/3, VLAN, Rack elevations, port mapping, Infiniband technologies. Hands-on expertise in server hardware troubleshooting and rack-level integration.Preferred:Experience deploying or managing GPU clusters and distributed training environments.Familiarity with automation and orchestration tools (Ansible, Terraform) and monitoring systems (Prometheus, Grafana).Knowledge of structured cabling, power distribution, and environmental monitoring in data centers.Excellent communication and documentation skills.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
No job found
There is no job in this category at the moment. Please try again later