AI MLOps / DevOps Engineer Jobs | Top AI MLOps / DevOps Engineer Openings in 2025

Staff Engineer, Systems Test (R4151)

Shield AI

1001-5000

USD

140000

-

210000

United States

Full-time

Remote

Founded in 2015, Shield AI is a venture-backed deep-tech company with the mission of protecting service members and civilians with intelligent systems. Its products include the V-BAT and X-BAT aircraft, Hivemind Enterprise, and the Hivemind Vision product lines. With nine offices and facilities across the U.S., Europe, the Middle East, and the Asia-Pacific, Shield AI’s technology actively supports operations worldwide. For more information, visit www.shield.ai. Follow Shield AI on LinkedIn, X, Instagram, and YouTube. Job Description:We’re seeking a Staff Integration & Test Engineer to lead advanced integration and test activities for Shield AI’s Hivemind autonomy systems in Frisco, TX. You’ll define and execute test strategies that span Hivemind Software integration, simulation, hardware-in-the-loop, vehicle-in-the-loop, and live flight operation, ensuring robust system performance and reliability in real-world mission environments.  As a senior technical leader, you’ll architect test infrastructure, collaborate with cross-functional teams, mentor teammates and drive continuous improvement in validation methodologies and test automation. This role is deeply hands-on and highly collaborative, working across software, hardware, systems, and flight test disciplines to ensure seamless integration of Hivemind autonomy on platforms such as VBAT.  Shield AI is scaling and growing rapidly, the ideal candidate will demonstrate adaptability, a growth mindset, and a willingness to learn new technologies and methodologies quickly in a fast-paced, evolving environment. This is an opportunity to grow alongside a company that is changing the world, building something insanely great with a mission-driven culture, a sense of urgency, and an unwavering commitment to protecting those who serve. What you'll do:Lead system-level integration, test planning, and validation for advanced autonomous aircraft systemsDefine and implement test architectures, methodologies, and strategies across simulation, HIL, VIL, and flight environments. Own and manage comprehensive test plans, defining objectives, success criteria, procedures, and resource needs. Architect and evolve test infrastructure and automation frameworks that enable scalable and repeatable validation. Define Hivemind Software test release processes and quality release gates. Collaborate closely with software, hardware, and systems engineering teams to ensure robust integration and system readiness. Conduct hands-on debugging and validation of autonomy, avionics, and control systems. Lead flight test preparation, system configuration, and real-time troubleshooting during live events.Develop tools and utilities in Python (and optionally C++) to support automation, data analysis, and telemetry validation.Establish test documentation standards, ensuring traceability, repeatability, and knowledge sharing across teams. Mentor and provide technical direction to junior and senior engineers, fostering a culture of technical rigor and continuous improvement. Partner with program and mission teams to communicate test readiness, progress, and system performance effectively.Required qualification:Bachelor’s or Master’s degree in Engineering, Computer Science, Robotics, Aerospace Engineering, or related technical discipline.8+ years of experience in system integration, test planning, and validation of complex systems—ideally within robotics, aerospace, or autonomy.Proven expertise in test planning, including test plan creation, test case design, and validation tracking and software quality release processes.Deep understanding of test infrastructure, automation, and validation methodologies.Strong proficiency in Python for scripting, automation, and analysis; working knowledge of C++ preferred.Experience architecting and maintaining Hardware-in-the-Loop (HIL), Vehicle-in-the-Loop (VIL), or similar real-time test systems.Proven ability to troubleshoot complex, multidisciplinary systems involving software, hardware, and controls.Demonstrated success leading technical projects, mentoring engineers, and defining test strategies for multi-system programs.Excellent communication and cross-functional collaboration skills.Adaptability, growth mindset, and willingness to learn new technologies quickly in a scaling, fast-paced environment.Self-starter with strong sense of urgency, initiative, and comfort operating in ambiguity.U.S. Citizenship and ability to obtain and maintain a SECRET clearance.Preferred qualifications:Experience testing or integrating autonomous air, ground or sea vehicles.Background in defense, aerospace, or mission-critical robotics systems.Experience developing test infrastructure and automation frameworks at organizational scale.Familiarity with simulation and modeling tools for system-level validation.Knowledge of configuration management, verification processes, and data analytics for test reporting.Experience supporting flight test operations, including safety, instrumentation, and post-flight analysis. 140,000 - 210,000 a year#LI-LD1#LD Full-time regular employee offer package: Pay within range listed + Bonus + Benefits + Equity Temporary employee offer package: Pay within range listed above + temporary benefits package (applicable after 60 days of employment) Salary compensation is influenced by a wide array of factors including but not limited to skill set, level of experience, licenses and certifications, and specific work location. All offers are contingent on a cleared background and possible reference check. Military fellows and part-time employees are not eligible for benefits. Please speak to your talent acquisition representative for more information. ### Shield AI is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, marital status, disability, gender identity or Veteran status. If you have a disability or special need that requires accommodation, please let us know. 

MLOps / DevOps Engineer

Software Engineer

Robotics Engineer

Apply

November 18, 2025

Hidden link

Systems Architect - Active Safety

Figure AI

201-500

USD

150000

-

350000

United States

Full-time

Remote

Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA. Figure’s vision is to deploy autonomous humanoids at a global scale. Our Helix team is looking for an experienced Training Infrastructure Engineer, to take our infrastructure to the next level. This role is focused on managing the training cluster, implementing distributed training algorithms, data loaders, and developer tools for AI researchers. The ideal candidate has experience building tools and infrastructure for a large-scale deep learning system. Responsibilities Design, deploy, and maintain Figure's training clusters Architect and maintain scalable deep learning frameworks for training on massive robot datasets Work together with AI researchers to implement training of new model architectures at a large scale Implement distributed training and parallelization strategies to reduce model development cycles Implement tooling for data processing, model experimentation, and continuous integration Requirements Strong software engineering fundamentals Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field Experience with Python and PyTorch Experience managing HPC clusters for deep neural network training Minimum of 4 years of professional, full-time experience building reliable backend systems Bonus Qualifications Experience managing cloud infrastructure (AWS, Azure, GCP) Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.) Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.) The US base salary range for this full-time position is between $150,000 - $350,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.

MLOps / DevOps Engineer

Software Engineer

Apply

November 17, 2025

Hidden link

Staff Software Engineer, GPU Infrastructure (HPC)

Cohere

501-1000

-

Canada

Full-time

Remote

Who are we?Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what’s best for our customers.Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products.Join us on our mission and shape the future!Why this team?The internal infrastructure team is responsible for building world-class infrastructure and tools used to train, evaluate and serve Cohere's foundational models. By joining our team, you will work in close collaboration with AI researchers to support their AI workload needs on the cutting edge, with a strong focus on stability, scalability, and observability. You will be responsible for building and operating superclusters across multiple clouds. Your work will directly accelerate the development of industry-leading AI models that power Cohere's platform North. We’re hiring software engineers at multiple levels. Whether you’re early in your career or a seasoned staff engineer, you’ll find opportunities to grow and make an impact here.Please Note: All of our infrastructure roles require participating in a 24x7 on-call rotation, where you are compensated for your on-call schedule. As a Staff Software Engineer, you will:Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads.Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects.Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows.Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently.Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions.Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient.Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence. You may be a good fit if you have:Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments.Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads.Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions.Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads.Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges.Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.If some of the above doesn’t line up perfectly with your experience, we still encourage you to apply! We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.Full-Time Employees at Cohere enjoy these Perks:🤝 An open and inclusive culture and work environment 🧑‍💻 Work closely with a team on the cutting edge of AI research 🍽 Weekly lunch stipend, in-office lunches & snacks🦷 Full health and dental benefits, including a separate budget to take care of your mental health 🐣 100% Parental Leave top-up for up to 6 months🎨 Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement🏙 Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend✈️ 6 weeks of vacation (30 working days!)

MLOps / DevOps Engineer

Software Engineer

Machine Learning Engineer

Apply

November 17, 2025

Hidden link

Manager - Security Architecture

Lambda AI

501-1000

USD

297000

-

495000

United States

Full-time

Remote

Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our San Francisco, San Jose, or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.About the RoleLambda Security protects some of the world's most valuable digital assets: invaluable training data, model weights representing immense computational investments, and the sensitive inputs required to leverage best of breed AI models. We're responsible for securing every byte that powers breakthrough artificial intelligence.Reporting to the Senior Manager of Security, your team serves dual functions: building security for the business and demonstrating that work directly to customers. As security advisors to Product Engineering, Platform Engineering, and IT teams, your team will establish security policies and architecture standards, conduct threat modeling and design reviews for critical systems, and create implementation guidance that engineering teams can adopt. In support of our customers, your team will develop customer-facing security documentation and participate directly in enterprise security discussions. This work ensures the right security decisions get made across Lambda's AI infrastructure while protecting customer data, enabling hypergrowth velocity, and building the trust that closes enterprise deals.As Manager of the Security Architecture team, you'll build and lead a team of 4-5 security engineers with expertise across application security, infrastructure security, and corporate security. You'll hire strong specialists, coach them through complex security problems, set team priorities and architectural direction, and create a culture where security judgment accelerates business velocity rather than creating friction.Your success is measured by the security decisions your team enables across the business: engineering teams building secure-by-default systems, compliance frameworks mapped to technical controls, and customers trusting Lambda's infrastructure with their most valuable AI workloads. Your team will balance proactive architecture work (defining what "good" looks like) with reactive consultation (reviewing designs and answering complex security questions).Your immediate focus will be building your team, establishing processes for design reviews and architecture guidance that scale with Lambda's growth, and developing a 6-12 month roadmap aligned with Lambda's 2026 security strategic plan including compliance initiatives like ISO 27001.We're looking for engineering managers who pair strong people leadership with enough security depth to coach specialists, set architectural direction, and translate security decisions into business value. If you're energized by building high-performing teams, enabling security at scale through excellent judgment rather than brute force, and helping enterprise customers trust their most valuable AI workloads to Lambda's infrastructure, we'd love to talk.We value diverse backgrounds, experiences, and skills, and we are excited to hear from candidates who can bring unique perspectives to our team. If you do not exactly meet this description but believe you may be a good fit, please still apply and help us understand your readiness for this role. Your application is not a waste of our time.What You'll DoTeam Leadership & DevelopmentBuild, hire, and develop a high-performing team of 4-5 security engineers with deep expertise across application security, infrastructure security, and corporate security.Foster a culture where security judgment accelerates business velocity, creating an environment where specialists thrive through clear expectations, regular coaching, and opportunities for growth.Conduct regular one-on-ones and provide constructive feedback that helps your engineers advance their technical depth and expand their cross-functional impact.Set team priorities and architectural direction, ensuring your team focuses on the highest-impact security decisions across Lambda's AI infrastructure.Strategic Architecture & Program ManagementOwn your team's 6-12 month roadmap, balancing proactive architecture work (defining security standards and patterns) with reactive consultation (design reviews and complex security questions).Establish security policies and architecture standards that enable Product Engineering, Platform Engineering, and IT teams to build secure-by-default systems.Define measurable success criteria for your team's work, translating security architecture decisions into business impact that stakeholders understand.Proactively guide the evolution of Lambda's security architecture program as the company matures, ensuring architecture decisions align with compliance commitments and evolving customer security requirements.Cross-Functional Collaboration & Customer EnablementPartner deeply with Product Engineering, Platform Engineering, and IT teams to integrate security architecture guidance at optimal moments in their development cycles.Conduct and oversee threat modeling and design reviews for critical systems, ensuring your team provides actionable recommendations that balance security rigor with development velocity.Enable your team to create implementation guidance and architecture patterns that engineering teams voluntarily adopt because they make secure development easier.Support enterprise sales by developing customer-facing security documentation and coaching your team through direct security discussions with prospective customers evaluating Lambda's infrastructure.Collaborate with peer security teams (Detection & Response, Platform, Program Coordination) to ensure cohesive security architecture across all security functions.What We Think a Candidate Needs to Demonstrate to Succeed5+ years of security engineering or security architecture experience with 3+ years leading technical teams, demonstrating ability to build and develop high-performing security specialists.Proven track record building team cultures where specialists thrive through clear expectations, effective coaching, and career development that expands both technical depth and cross-functional impact.Strong technical background in security architecture, threat modeling, and secure design principles with enough depth to guide team decisions, evaluate complex tradeoffs, and coach engineers through difficult security problems.Experience working across application security, infrastructure security, or corporate security domains, with demonstrated ability to set architectural direction and security standards that engineering teams adopt.Excellent collaboration skills working with highly technical engineering teams both with and without authority, building relationships that enable security architecture guidance at optimal moments in development cycles.Skilled communicator who translates security architecture decisions into business value, helping stakeholders understand how technical security work protects customer data and enables business velocity.Ability to thrive in high-speed, high-ambiguity startup environments where you balance building team capability and security architecture foundations while executing at a fast pace.Nice to HavePrior experience in AI/ML infrastructure companies or cloud service providers where you've navigated the unique security challenges of multi-tenant systems and customer data isolation at scale.Hands-on experience driving compliance audits (SOC 2, ISO 27001, PCI-DSS, HIPAA/HITECH, or FedRAMP) including evidence collection, control mapping, and managing auditor relationships.Deep familiarity with bare metal infrastructure security in addition to cloud platforms, understanding physical security considerations and hardware-level security controls.Experience creating security architecture patterns that were adopted widely across multiple teams or organizations, demonstrating ability to build reusable solutions that scale beyond a single use case.Experience managing security engineers through significant career transitions, such as promoting ICs to lead roles or helping specialists successfully pivot between security domains.Enthusiasm about leveraging Lambda's access to state-of-the-art LLMs to pioneer AI-powered security architecture capabilities—imagine automated threat modeling, intelligent design review assistance, and architecture validation at scale only possible when you host the AI infrastructure yourself.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Software Engineer

Apply

November 17, 2025

Hidden link

Firmware Intern [Summer 2026]

Figure AI

201-500

USD

150000

-

350000

United States

Full-time

Remote

Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA. Figure’s vision is to deploy autonomous humanoids at a global scale. Our Helix team is looking for an experienced Training Infrastructure Engineer, to take our infrastructure to the next level. This role is focused on managing the training cluster, implementing distributed training algorithms, data loaders, and developer tools for AI researchers. The ideal candidate has experience building tools and infrastructure for a large-scale deep learning system. Responsibilities Design, deploy, and maintain Figure's training clusters Architect and maintain scalable deep learning frameworks for training on massive robot datasets Work together with AI researchers to implement training of new model architectures at a large scale Implement distributed training and parallelization strategies to reduce model development cycles Implement tooling for data processing, model experimentation, and continuous integration Requirements Strong software engineering fundamentals Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field Experience with Python and PyTorch Experience managing HPC clusters for deep neural network training Minimum of 4 years of professional, full-time experience building reliable backend systems Bonus Qualifications Experience managing cloud infrastructure (AWS, Azure, GCP) Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.) Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.) The US base salary range for this full-time position is between $150,000 - $350,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.

MLOps / DevOps Engineer

Apply

November 17, 2025

Hidden link

MLOps Engineer (Remote)

Bjak

201-500

-

Anywhere

Full-time

Remote

Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with remote work arrangement. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutionsWhy This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.

MLOps / DevOps Engineer

Apply

November 17, 2025

Hidden link

MLOps 엔지니어 (MLOps Engineer)

Bjak

201-500

-

South Korea

Full-time

Remote

언어 모델을 현실 세계의 애플리케이션으로 전환하기우리는 전 세계 사용자를 위한 AI 시스템을 구축하고 있습니다. 지금은 AI 전환의 시대이며, 이 새로운 프로젝트 팀은 현실 세계에서 더 큰 영향력과 글로벌 활용성을 실현하는 애플리케이션 개발에 집중하고 있습니다.이 포지션은 글로벌 역할이며, 유연한 원격 근무와 본사 오피스 협업을 결합한 하이브리드 근무 방식을 채택합니다. 제품, 엔지니어링, 운영, 인프라, 데이터 등 각 지역 팀과 긴밀히 협력하여 영향력 있는 AI 솔루션을 개발하고 확장하게 됩니다.이 역할이 중요한 이유최신 모델을 효율적으로 실행·관리하고, 평가 프레임워크를 설계하며, AI 기능을 실제 서비스 환경에 도입합니다. 당신의 업무는 모델이 단순히 지능적일 뿐만 아니라, 안전하고 신뢰할 수 있으며, 대규모 환경에서 효과적으로 작동하도록 보장합니다.주요 업무오픈소스 모델을 효율적으로 실행·관리하고, 비용과 신뢰성을 최적화GPU, CPU, 메모리 리소스 전반에서 높은 성능과 안정성 확보모델 추론을 모니터링 및 트러블슈팅하여 낮은 지연 시간과 높은 처리량 유지엔지니어와 협력해 확장 가능하고 신뢰성 있는 모델 서빙 솔루션 구현이런 분을 찾습니다주도성과 독립성을 중시하는 분“명확함은 실행에서 나온다”는 믿음을 가지고, 완벽한 계획을 기다리기보다 프로토타입·테스트·반복을 통해 실행하는 분스타트업 환경의 혼란 속에서도 침착하고 효과적으로 일할 수 있는 분 —— 우선순위 변화나 제로 베이스 구축도 두려워하지 않는 분속도 지향적으로, 완벽한 결과보다 지금 가치 있는 무언가를 전달하는 것을 중요하게 여기는 분피드백과 실패를 성장의 일부로 보고, 지속적으로 실력을 발전시키려는 분겸손함, 배움에 대한 열정, 실행력을 가지고 있으며, 동료들과 함께 성장하는 분자격 요건vLLM, HuggingFace TGI 등의 모델 서빙 플랫폼 사용 경험Kubernetes, Ray, Modal, RunPod, LambdaLabs 등을 활용한 GPU 오케스트레이션 경험트래픽 수요에 따라 지연 시간·비용을 모니터링하고 시스템을 효율적으로 확장할 수 있는 능력백엔드 엔지니어를 위한 추론 엔드포인트 설정 경험혜택 및 보상수평적 조직 구조와 진정한 오너십제품 방향성과 합의 기반 의사결정에 전면적으로 참여유연한 근무 형태제품, 데이터, 엔지니어링 전반에 걸쳐 높은 영향력을 가지는 역할업계 최고 수준의 보상 및 성과 기반 보너스글로벌 제품 개발에 참여할 기회다양한 복지 —— 주택 임대 보조, 우수한 회사 구내식당, 야근 식사 제공건강, 치과, 안과 보험본인 및 가족을 위한 글로벌 여행 보험무제한·유연한 휴가 제도팀과 문화우리는 고밀도·고성과 팀으로, 고품질의 업무와 글로벌 임팩트에 집중합니다. 우리는 주인의식으로 행동하며, 속도, 명확함, 끊임없는 책임감을 중시합니다. 성장 욕구가 크고, 탁월함을 진심으로 추구하는 분이라면 함께 하기를 기대합니다.회사 소개: BJAKBJAK은 동남아시아 최대의 보험 비교 플랫폼으로, 800만 명 이상의 사용자를 보유하고 있으며, 직원이 100% 지분을 소유한 회사입니다. 본사는 말레이시아에 있으며, 태국, 대만, 일본에서도 운영되고 있습니다.우리는 Bjak.com을 통해 수백만 명의 사용자에게 투명하고 합리적인 금융 보호를 제공합니다. 또한, API, 자동화, AI 등 최첨단 기술을 활용해 복잡한 금융 상품을 단순화하고, 차세대 지능형 금융 시스템을 구축하고 있습니다.현실 세계에 영향을 미치는 AI 시스템을 구축하고, 임팩트 있는 환경에서 빠르게 성장하고 싶다면, 지금 바로 지원하세요!----------------------------------------------------------------------Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.

MLOps / DevOps Engineer

Apply

November 17, 2025

Hidden link

MLOps Engineer

Bjak

201-500

-

Malaysia

Full-time

Remote

Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems.

MLOps / DevOps Engineer

Apply

November 17, 2025

Hidden link

MLOps エンジニア (MLOps Engineer)

Bjak

201-500

-

Japan

Full-time

Remote

言語モデルを現実のアプリケーションへ変革する私たちはグローバルなユーザーを対象とした AI システムを構築しています。現在は AI トランジションの時代にあり、この新しいプロジェクトチームは、現実世界への影響力を拡大し、世界中で最大限に活用されるアプリケーションの構築に注力します。このポジションはグローバルな役割であり、柔軟なリモートワークと本社での対面コラボレーションを組み合わせたハイブリッド勤務を採用しています。製品、エンジニアリング、オペレーション、インフラ、データの各地域チームと緊密に連携し、影響力のある AI ソリューションを構築・拡張します。この役割が重要な理由最先端のモデルを効率的に運用・最適化し、評価フレームワークを設計し、AI 機能を本番環境に投入します。あなたの仕事は、モデルがインテリジェントであるだけでなく、安全で信頼でき、大規模に影響力を持つことを保証します。主な業務内容オープンソースモデルを効率的に運用・管理し、コストと信頼性を最適化するGPU、CPU、メモリリソース全体で高いパフォーマンスと安定性を確保するモデル推論を監視・トラブルシューティングし、低レイテンシーと高スループットを維持するエンジニアと協力し、スケーラブルで信頼性の高いモデルサービングソリューションを実装する求める人物像主体性と独立性を好む方「行動から明確さが生まれる」と信じ、完璧な計画を待たずにプロトタイプ・テスト・反復を行える方スタートアップ特有の混乱の中でも冷静かつ効果的に対応できる方 —— 優先順位の変化やゼロからの構築を恐れないスピードを重視し、完璧を待つよりも「今すぐ価値ある成果」を届けることを優先できる方フィードバックや失敗を成長の一部と捉え、常にレベルアップを目指せる方謙虚さ、向上心、行動力を持ち、仲間を助けながら進める方応募資格vLLM や HuggingFace TGI などのモデルサービングプラットフォームの使用経験Kubernetes、Ray、Modal、RunPod、LambdaLabs などを用いた GPU オーケストレーションの経験レイテンシーやコストを監視し、トラフィック需要に応じて効率的にシステムをスケールできる能力バックエンドエンジニア向けの推論エンドポイントの構築経験待遇・福利厚生フラットな組織構造と本当のオーナーシッププロダクト方向性や意思決定への全面的な関与柔軟な勤務形態プロダクト・データ・エンジニアリングを横断する高インパクトな役割市場最高水準の給与と成果に基づくボーナスグローバルなプロダクト開発への参画機会充実した福利厚生 —— 住宅補助、高品質な社員食堂、残業食事補助健康・歯科・眼科保険グローバル旅行保険（本人および扶養家族対象）無制限で柔軟な有給休暇制度チームと文化私たちは高密度・高パフォーマンスのチームであり、高品質な仕事とグローバルインパクトに注力しています。オーナーのように行動し、スピード、明確さ、徹底的な責任感を重視します。成長意欲があり、卓越性を大切にする方を歓迎します。会社概要：BJAKBJAK は東南アジア最大の保険アグリゲーターで、800 万人以上のユーザーを持ち、社員が完全に所有する企業です。本社はマレーシアにあり、タイ、台湾、日本でも事業を展開しています。Bjak.com を通じて、数百万人のユーザーに透明性が高く、手頃な金融保障を提供しています。また、API、自動化、AI などの先端技術を駆使し、複雑な金融商品をシンプルにし、次世代のインテリジェントな金融システムを構築しています。現実世界にインパクトを与える AI システムを構築し、高インパクトな環境で急速に成長したい方、ぜひご応募ください。----------------------------------------------------------------------Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.

MLOps / DevOps Engineer

Apply

November 17, 2025

Hidden link

MLOps Engineer (HK)

Bjak

201-500

-

Hong Kong

Full-time

Remote

Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.

MLOps / DevOps Engineer

Apply

November 16, 2025

Hidden link

MLOps 工程师 (MLOps Engineer)

Bjak

201-500

-

China

Full-time

Remote

将语言模型转化为现实应用我们正在为全球用户构建 AI 系统。当前正处于 AI 变革的关键时期 —— 本项目团队致力于构建能够真正落地、创造现实世界最大影响与使用量的 AI 应用。该职位为全球岗位，采用灵活混合办公模式 —— 结合远程办公与总部现场协作。你将与产品、工程、运营、基础设施和数据等地区团队紧密合作，共同构建并扩展具有影响力的 AI 解决方案。为什么这个岗位重要你将运行并优化最前沿的开源模型、设计推理框架，并将 AI 功能稳定上线。你的工作将确保我们的模型不仅具备智能，还能在规模化场景中保持安全性、可靠性与性能表现。你的职责高效运行并管理开源大模型，优化推理的成本与可靠性确保在 GPU、CPU 与内存资源之间的高性能与稳定性实时监控与排查推理性能问题，确保低延迟与高吞吐量与工程团队协作，实现可扩展、可靠的模型服务架构我们正在寻找这样的你：喜欢主导项目并独立推动落地相信“清晰来自行动” —— 原型、测试、迭代，而非等待完美计划在初创环境中依然冷静高效 —— 不惧从零开始或变化快速重视速度 —— 优先交付有价值的产品，而非追求完美版本视反馈与失败为成长的一部分 —— 持续进阶自己的技能拥有谦逊、进取心与执行力，并在协作中带动他人前进任职要求有使用 vLLM、HuggingFace TGI 等模型推理平台的经验熟悉 GPU 调度与资源编排，掌握 Kubernetes、Ray、Modal、RunPod、LambdaLabs 等工具具备根据流量动态监控推理延迟、成本并高效扩展系统的能力熟悉为后端工程师设置推理 API 接口的流程与规范你将获得扁平化团队结构与真实项目主导权全程参与产品方向与决策制定灵活办公制度高影响力角色，跨产品、数据与工程多团队协作顶尖市场薪酬 + 绩效奖金全球化产品开发机会丰厚福利：住房租赁补贴、优质公司食堂、加班餐补健康、牙科与视力保险全球差旅保险（适用于你与家属）无限制、弹性带薪休假团队与文化我们是一支高密度、高绩效的团队，专注于高质量产品与全球影响力。我们像主人一样承担责任，重视速度、清晰与极致执行。如果你渴望成长并追求卓越，欢迎加入我们！关于 BJAKBJAK 是东南亚最大的保险聚合平台，服务用户超过 800 万，且由员工全资持股。公司总部位于马来西亚，在泰国、台湾与日本设有业务。我们通过 Bjak.com 帮助数百万用户获取透明且可负担的金融保障。我们通过 API、自动化与 AI 等前沿科技，简化复杂金融产品，致力于打造下一代智能金融系统。如果你对构建真正落地的 AI 系统充满热情，并希望在高影响力环境中快速成长，我们期待与你相遇！----------------------------------------------------------------------Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.

MLOps / DevOps Engineer

Apply

November 16, 2025

Hidden link

Member of technical staff (Infrastructure)

H Company

201-500

-

France

United Kingdom

Full-time

Remote

About H: H exists to push the boundaries of superintelligence with agentic AI. By automating complex, multi-step tasks typically performed by humans, AI agents will help unlock full human potential.H is hiring the world’s best AI talent, seeking those who are dedicated as much to building safely and responsibly as to advancing disruptive agentic capabilities. We promote a mindset of openness, learning, and collaboration, where everyone has something to contribute.About the Team: The Infrastructure team aims to make it seamless for our researchers and engineers to access and use the infrastructure they need to do their job. The team also ensures the underlying infrastructure for our public services is robust, reliable and scalable. Members of the Infra team are uniquely positioned to impact all areas of H, from building everything from our foundational models to our agents, all the way to our public services.Key Responsibilities:Designing and managing the infrastructure to supportResearch efforts in Model and Agent development incl. training infrastructure, data pipelines and inference.Product Engineering efforts on H Company’s agent platform including client-facing APIs and agent runtimes within various deployment scenarios (multi-tenant and on-prem).Setup and maintain observability and monitoring strategies.Requirements:MUST HAVEObservability and monitoring (Datadog, Prometheus, Grafana, …)Good knowledge of a modern programming language (ideally Python or JS/Typescript)NICE TO HAVEML Ops or Data EngineeringExperience architecting and deploying distributed systems on public cloud (AWS, Azure, GCP)Containerization and orchestration tools (Docker, Kubernetes, …)Infrastructure as code (CDK, Terraform, ...)CICD management experience (Github Actions, Gitlab CI, TeamCity, ...).Location:Paris or London.This role is hybrid, and you are expected to be in the office 3 days a week on average.What We Offer:Join the exciting journey of shaping the future of AI, and be part of the early days of one of the hottest AI startupsCollaborate with a fun, dynamic and multicultural team, working alongside world-class AI talent in a highly collaborative environmentEnjoy a competitive salaryUnlock opportunities for professional growth, continuous learning, and career developmentIf you want to change the status quo in AI, join us.

MLOps / DevOps Engineer

Data Engineer

Apply

November 13, 2025

Hidden link

Senior Member of technical staff (Infrastructure)

H Company

201-500

-

France

United Kingdom

Full-time

Remote

About H: H exists to push the boundaries of superintelligence with agentic AI. By automating complex, multi-step tasks typically performed by humans, AI agents will help unlock full human potential.H is hiring the world’s best AI talent, seeking those who are dedicated as much to building safely and responsibly as to advancing disruptive agentic capabilities. We promote a mindset of openness, learning, and collaboration, where everyone has something to contribute.About the Team: The Infrastructure team aims to make it seamless for our researchers and engineers to access and use the infrastructure they need to do their job. The team also ensures the underlying infrastructure for our public services is robust, reliable and scalable. Members of the Infra team are uniquely positioned to impact all areas of H, from building everything from our foundational models to our agents, all the way to our public services.Key Responsibilities:Designing and managing the infrastructure to supportResearch efforts in Model and Agent development incl. training infrastructure, data pipelines and inference.Product Engineering efforts on H Company’s agent platform including client-facing APIs and agent runtimes within various deployment scenarios (multi-tenant and on-prem).Setup and maintain observability and monitoring strategies.Mentor and grow other engineers in infrastructure-related topics as well as general engineering practices.Requirements:MUST HAVEML Ops or Data Engineering relevant experienceExperience architecting and deploying distributed systems on public cloud (AWS, Azure, GCP)Observability and monitoring (Datadog, Prometheus, Grafana, …)Good knowledge of a modern programming language (ideally Python or JS/Typescript)NICE TO HAVEContainerization and orchestration tools (Docker, Kubernetes, …)Infrastructure as code (CDK, Terraform, ...)CICD management experience (Github Actions, Gitlab CI, TeamCity, ...).Location:Paris or London.This role is hybrid, and you are expected to be in the office 3 days a week on average.What We Offer:Join the exciting journey of shaping the future of AI, and be part of the early days of one of the hottest AI startupsCollaborate with a fun, dynamic and multicultural team, working alongside world-class AI talent in a highly collaborative environmentEnjoy a competitive salaryUnlock opportunities for professional growth, continuous learning, and career developmentIf you want to change the status quo in AI, join us.

MLOps / DevOps Engineer

Data Engineer

Apply

November 13, 2025

Hidden link

Senior Data Center Operations Engineer - Quincy WA

Lambda AI

501-1000

USD

115000

-

173000

United States

Full-time

Remote

Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our Quincy Data Center 5 days per week. What You'll DoEnsure new server, storage and network infrastructure is properly racked, labeled, cabled, and configured.Troubleshoot hardware and software issues in some of the world’s most advanced GPU and Networking systems.Document and update data center layout and network topology in DCIM softwareWork with supply chain & manufacturing teams to ensure timely deployment of systems and project plans for large-scale deploymentsManage a parts depot inventory and track equipment through the delivery-store-stage-deploy-handoff process in each of our data centersPartner with HW Support teams to ensure data center hardware incidents with higher level troubleshooting challenges are resolved, reported on and solutions are disseminated to the large operations organization.Work with RMA team to ensure faulty parts are returned and replacements are orderedFollow installation standards and documentation for placement, labeling, and cabling to drive consistency and discoverability across all data centersYouHave strong past experiences with critical infrastructure systems supporting data centers, such as power distribution, air flow management, environmental monitoring, capacity planning, DCIM software, structured cabling, and cable managementBe familiar with carrier DIA circuit test and turn ups, fiber testing and troubleshootingBasic knowledge of cable optics and the different types of useSolid understanding of single and three phase power theoriesPDU balancing and why it is importantFamiliar with multiple cable media types and their usesKnowledge of cold isle and hot isle containmentSolid understanding of server hardware and boot processAbility to structure, collaborate and iteratively improve on complex maintenance MOPs.Working with product management, support, and other teams to align operational capabilities with company goals.Translating business priorities into technical and operational requirements.Supporting cross-functional projects where infrastructure plays a critical role.Are action-oriented and willingness to train junior staff on best practicesAre willing to travel for bring up of new data center locations as neededNice to HaveHave 3+ years experience with critical infrastructure systems supporting data centers, such as power distribution, air flow management, environmental monitoring, capacity planning, DCIM software, structured cabling, and cable managementExperience with/or knowledge of network topology and configurations and 400gb Infiniband architectures.Experience with/or knowledge of DDP or SCM cluster storage systems.Have 3+ years working with and reporting from a ticketing systems like JIRA and ZendeskAdvanced experience with Linux administrationExperience with High Performance Compute GPU systems (air or water cooled) - especially Nvidia NVL72Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Apply

November 12, 2025

Hidden link

Software Engineer, Infrastructure

Decagon

101-200

USD

200000

-

375000

United States

Full-time

Remote

About DecagonDecagon is the leading conversational AI platform empowering every brand to deliver concierge customer experience. Our AI agents provide intelligent, human-like responses across chat, email, and voice, resolving millions of customer inquiries across every language and at any time.Since coming out of stealth, Decagon has experienced rapid growth. We partner with industry leaders like Hertz, Eventbrite, Duolingo, Oura, Bilt, Curology, and Samsara to redefine customer experience at scale. We've raised over $200M from Bain Capital Ventures, Accel, a16z, BOND Capital, A*, Elad Gil, and notable angels such as the founders of Box, Airtable, Rippling, Okta, Lattice, and Klaviyo.We’re an in-office company, driven by a shared commitment to excellence and velocity. Our values—customers are everything, relentless momentum, winner’s mindset, and stronger together—shape how we work and grow as a team.About the TeamThe Infrastructure team builds and operates the foundations that power Decagon: networking, data, ML serving, developer platform, and real‑time voice. We partner closely with product, data, and ML to deliver high‑scale, low‑latency systems with clear SLOs and great developer ergonomics.We organize around five focus areas:Core Infra: The foundational cloud stack—networking, compute, storage, security, and infrastructure‑as‑code—to ensure reliability, scale, and cost efficiency.Data Infra: Streaming/batch data platforms powering analytics/BI and customer‑facing telemetry, including for customer‑managed and on‑prem environments.ML Infra: GPU and model‑serving platforms for LLM inference with multi‑provider routing and support for on‑prem/air‑gapped deployments.Platform (DevEx): CI/CD, paved paths, and core services that make shipping fast, safe, and consistent across teams.Voice Infra: Telephony/WebRTC stack and observability enabling ultra‑low‑latency, high‑quality voice experiences.Our mission is to deliver magical support experiences — AI agents working alongside humans to resolve issues quickly and accurately. About the RoleWe’re hiring a Senior Infrastructure Engineer to design, build, and operate production infrastructure for high‑scale, low‑latency systems. You’ll own critical services end‑to‑end, improve reliability and performance, and create paved‑paths that let every Decagon engineer ship confidently. In this role, you willDesign and implement critical infrastructure services with strong SLOs, clear runbooks, and actionable telemetry.Partner with research and product teams to architect solutions, set up prototypes, evaluate performance, and scale new features.Tune service latencies: optimize networking paths, apply smart caching/queuing, and tune CPU/memory/I/O for tight p95/p99s.Evolve CI/CD, golden paths, and self‑service tooling to improve developer velocity and safety.Support various deployment architectures for customers with robust observability and upgrade paths.Lead infrastructure‑as‑code (Terraform) and GitOps practices; reduce drift with reusable modules and policy‑as‑code.Participate in on‑call and drive down toil through automation and elimination of recurring issues. Your background looks something like this3+ years building and operating production infrastructure at scale.Depth in at least one area across Core/Data/AI‑ML/Platform/Voice, with curiosity to learn the rest.Proven track record meeting high availability and low latency targets (owning SLOs, p95/p99, and load testing).Excellent observability chops (OpenTelemetry, Prometheus/Grafana, Datadog) and incident response (PagerDuty, SLO/error budgets).Clear written communication and the ability to turn ambiguous requirements into simple, reliable designs. Even betterExperience being an early backend/platform/infrastructure engineer at another companyStrong Kubernetes experience (GKE/EKS/AKS) and experience across multiple cloud providers (GCP, AWS, and Azure)Experience with customer‑managed deployments BenefitsMedical, dental, and visionFlexible time offDaily lunch/dinner & snacks in the office

MLOps / DevOps Engineer

Apply

November 12, 2025

Hidden link

AI Security Engineer - Red Team

Lakera AI

51-100

-

United States

Full-time

Remote

We're looking for an AI Security Engineer to join our Red Team and help us push the boundaries of AI security. You'll lead cutting-edge security assessments, develop novel testing methodologies, and work directly with enterprise clients to secure their AI systems. This role combines hands-on red teaming, automation development, and client engagement. You'll thrive in this role if you want to be at the forefront of an emerging discipline, enjoy working on nascent problems, and like both breaking things and building processes that scale.Key ResponsibilitiesThis is a highly cross-functional position. AI security is still being defined, with best practices emerging in real-time. You'll be building the frameworks, methodologies, and tooling that scale our services while staying adaptable to rapid changes in the AI landscape. This role is ideal for someone who wants to take their traditional cybersecurity expertise and apply it to the new frontier of AI security and safety. Your focus will span several key areas:Service Delivery & Client EngagementLead end-to-end delivery of AI red teaming security assessment engagements with enterprise customersCollaborate with clients to scope projects, define testing requirements, and establish success criteriaConduct comprehensive security assessments of AI systems, including text-based LLM applications and multimodal agentic systemsAuthor detailed security assessment reports with actionable findings and remediation recommendationsPresent findings and strategic recommendations to technical and executive stakeholders through report readoutsTooling & Methodology DevelopmentBuild upon and improve our established processes and playbooks to scale AI red teaming service deliveryDevelop frameworks to ensure consistent, high-quality service deliveryFind the tedious, repetitive stuff and automate it - you don't need to be a world-class developer, just someone who can build tools that make the team more effectiveResearch & InnovationDevelop novel red teaming methodologies for emerging modalities: image, video, audio, autonomous systemsStay ahead of the latest AI security threats, attack vectors, and defense mechanismsTranslate cutting-edge academic and industry research into practical testing approachesCollaborate with our research and product teams to continuously level up our methodologiesRequired QualificationsTechnical Expertise3+ years of experience in cybersecurity with focus on red teaming, penetration testing, or security assessmentsExperience with web application and API penetration testing preferredDeep understanding of LLM vulnerabilities including prompt injection, data poisoning, and jailbreaking techniquesPractical experience with threat modeling complex systems and architecturesProficiency in developing automated tooling to enable and enhance testing capabilities, improve workflows, and deliver deeper insightsProfessional SkillsProven track record of leading client-facing security assessment projects from scoping through deliveryExcellent technical writing skills with experience creating executive-level security reportsStrong presentation and communication skills for diverse audiencesExperience building processes, documentation, and tooling for service delivery teamsAI Security KnowledgeUnderstanding of AI/ML model architectures, training processes, and deployment patternsFamiliarity with AI safety frameworks and alignment researchKnowledge of emerging AI attack surfaces including multimodal systems and AI agentsPreferred QualificationsRelevant security certifications (OSCP, OSWA, BSCP, etc.)Hands-on experience performing AI red teaming assessments, with a strong plus for experience targeting agentic systemsDemonstrated experience designing LLM jailbreaksActive participation in security research and tooling communitiesBackground in threat modeling and risk assessment frameworksPrevious speaking experience at security conferences or industry eventsWhat You'll GainOpportunity to shape the future of AI security as an emerging disciplineWork with cutting-edge AI technologies and novel attack methodologiesLead high-visibility projects with enterprise clients across diverse industriesCollaborate with world-class research team pushing boundaries of AI safetyPlatform to establish thought leadership in AI security communityCompetitive compensation package with equity participation👉 Let's stay connected! Follow us on LinkedIn, Twitter & Instagram to learn more about what is happening at Lakera.ℹ️ Join us on Momentum, the slack community for AI Safety and Security everything.❗To remove your information from our recruitment database, please email privacy@lakera.ai.

MLOps / DevOps Engineer

Machine Learning Engineer

Software Engineer

Apply

November 12, 2025

Hidden link

IT Solutions Engineer

OpenAI

5000+

USD

0

225000

-

275000

United States

Full-time

Remote

About the TeamIT Systems Operations is the operational layer within Security and IT that connects core teams and keeps employee-facing workflows, systems, and tools running reliably across the Employee Technology & Experience program. We design and implement the workflows, automations, and integrations that power the employee lifecycle, access, ITIL processes, and core business systems, partnering closely with Security and cross-functional teams to deliver solutions that work end to end.Our mandate spans solution design, system implementation, workflow engineering, automation, and the reliability of the platforms behind employee-facing processes. We work closely with Support to resolve complex issues, refine workflows, and ensure our solutions perform reliably in real-world conditions. We turn operational problems into scalable, engineered systems that reduce friction, codify patterns, and help OpenAI grow predictably and securely.About the RoleAs a Solutions Engineer, you will design and implement scalable, ITIL-aligned workflows using OpenAI technology and models to automate and improve core IT processes. You will deliver cross-functional solutions from requirements and architecture through build, rollout, and operationalization across ITSM, IAM, SaaS platforms, identity-adjacent systems, lifecycle automation, access workflows, integrations, and enterprise orchestration.You will reduce operational toil by building durable automations that ensure systems behave consistently, predictably, and securely. You will also work closely with IT Support Operations to eliminate root causes, streamline workflows, and equip Support with the least-privilege access needed to operate effectively.This is an on-site role based in our San Francisco office with a minimum presence of three days per week.In this role, you will:Solution Design and ImplementationTranslate unclear requirements into structured workflows and dependable systems.Build integrations across identity, SaaS, ITSM, internal applications, and endpoints.Deliver implementations from architecture through rollout and documentation.Support PartnershipAddress recurring issues with targeted workflow and system improvements.Equip Support with training, Playbooks, and long-term administrative patterns.Manage on-call rotations and technical escalations across IT and Security.Workflow, Automation, and OrchestrationBuild automated request flows across Slack, ChatGPT, Atlas, Jira, Linear, Incident.io, and Retool.Use APIs, Terraform, scripting, and low-code tools to remove manual effort.ITIL Application and ManagementMaintain Incident, Request, Change, and Problem workflows across modern ITSM tools.Provide repeatable workflow templates for teams across the company.IntegrationsExecute user, device, and SaaS integration playbooks.Lead identity, access, and ITSM migrations during acquisitions.System AdministrationMaintain lifecycle, collaboration, and automation platforms such as Slack Grid, Google Workspace, Atlassian, and GitHub.Ensure system reliability through monitoring, logging, workflow health, and integration upkeep.You might thrive in this role if you:Instinctively break down messy, ambiguous processes and turn them into clear, logical, well-structured systems.Navigate technical and non-technical teams with ease, translating needs and aligning stakeholders without friction.Identify patterns, eliminate repetition, and design solutions that scale rather than relying on manual effort.Understand identity as a foundation of enterprise architecture and think holistically about trust, entitlements, and lifecycle flows.Adapt quickly to new tools, understand how systems fit together, and enjoy learning the inner workings of enterprise platforms.Think in systems naturally and identify where data or workflows break down, and design clean, resilient cross-system connections.You appreciate structured frameworks like ITIL and Agile, using them to bring order, predictability, and continuous improvement.You communicate clearly through runbooks and written guidance that help others move faster and reduce institutional knowledge gaps.About OpenAIOpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for US-based candidates. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.OpenAI Global Applicant Privacy PolicyAt OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

MLOps / DevOps Engineer

Software Engineer

Apply

November 11, 2025

Hidden link

Datacenter Hardware Engineer, HPC

Mistral AI

501-1000

0

-

0

France

Full-time

Remote

About Mistral  At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life. We democratize AI through high-performance, optimized, open-source and cutting-edge models, products and solutions. Our comprehensive AI platform is designed to meet enterprise needs, whether on-premises or in cloud environments. Our offerings include le Chat, the AI assistant for life and work. We are a dynamic, collaborative team passionate about AI and its potential to transform society.Our diverse workforce thrives in competitive environments and is committed to driving innovation. Our teams are distributed between France, USA, UK, Germany and Singapore. We are creative, low-ego and team-spirited. Join us to be part of a pioneering company shaping the future of AI. Together, we can make a meaningful impact. See more about our culture on https://mistral.ai/careers. Role Summary Our compute footprint is growing fast to support our science and engineering teams. We’re hiring a Datacenter HW Engineer to maintain, troubleshoot, and scale our GPU/CPU clusters safely and reliably. You’ll execute hands-on hardware work in our Paris-area datacenter and partner with hardware owners, DC operations, and vendors to keep one of France’s largest GPU clusters healthy. Location: Bruyères-le-Châtel — on-site, field roleReporting line: Hardware Ops Impact • Compute is a key lever for Mistral’s success and our largest spend item. • Direct impact on scale: your work keeps one of France’s largest AI clusters healthy as we grow to unprecedented scale. • Enable breakthrough AI: you unlock our science & engineering teams to deliver groundbreaking AI solutions. What you will do • Diagnose & operate core server/cluster components - Investigate and handle compute/storage hardware issues (CPU, memory, drives, NICs, GPUs, PSUs) and interconnect problems (switches, cables, transceivers; Ethernet/InfiniBand). Perform safe interventions (power-off/lockout, ESD) to replace, re-seat, or recable components and restore service. • Safety & procedures - Apply lockout/tagout (LOTO) and ESD discipline; follow pre/post-work checklists; maintain tidy, safe work areas. • First-line diagnostics - Triage using LEDs, POST, beep codes and basic tests; capture evidence (photos, serials, results); open/update/close tickets with clear notes. • Preventive maintenance - Provide feedback and ideas to improve proactive activities, monitoring, and targeted follow-ups on recurring or specific anomalies; help turn ad-hoc checks into SOPs, alerts, and dashboards. • Parts & logistics - Receive and track parts, keep labeled inventory accurate, manage simple RMAs, and coordinate with vendors. • Collaboration & escalation - Partner with senior hardware/firmware owners on complex or multi-node issues; communicate status and next steps crisply. • Documentation & quality - Keep SOPs/checklists current; ensure zero undocumented changes and consistent, audit-ready records. About you • Hands-on mindset in datacenters/server hardware: you can install/re-seat/swap GPU/PCIe cards, NICs, PSUs, drives, and work cleanly in racks (rails, cabling, labeling). We also welcome candidates with strong Linux fundamentals (boot/check, logs) and scripting (Python/Bash) who are eager to learn hardware; you’ll be trained and mentored by a senior hardware engineer. • Disciplined and meticulous: follows checklists, ESD/LOTO; no rough handling; careful with all high-value server components. • Practical electrical basics: power-off, PPE, short-circuit risk awareness. • Comfortable in racks: cooling, network, storage, PDU, cable management; can lift/mount safely (within HSE limits). • Clear communicator: short factual updates; reliable teammate; punctual and process-minded. • Hardware-passionate, professionally grounded: strong curiosity and craft mindset. Nice to haveHPC/AI/Cloud at scale experience (production environments), large-fleet/server install & maintenance in datacenters. • Basic networking (Ethernet/InfiniBand) and basic Linux (boot/check; no coding needed). • Coding/automation skills (Python/Bash): small tools/scripts to improve checklists, photo/serial capture, inventory sync, or simple monitoring/reporting. • Experience with inventory/RMA tools and vendor coordination. • Exposure to HPC/research/industrial environments. Location & on-site policy • Bruyères-le-Châtel datacenter; on-site only. Day shifts with occasional evenings/weekends/on-call possible to support interventions. Location & Remote The position is based in our Paris HQ offices and we encourage going to the office as much as we can (at least 3 days per week) to create bonds and smooth communication. Our remote policy aims to provide flexibility, improve work-life balance and increase productivity. Each manager can decide the amount of days worked remotely based on autonomy and a specific context (e.g. more flexibility can occur during summer). In any case, employees are expected to maintain regular communication with their teams and be available during core working hours. What we offer 💰 Competitive salary and equity package🧑‍⚕️ Health insurance🚴 Transportation allowance🥎 Sport allowance🥕 Meal vouchers💰 Private pension plan🍼 Generous parental leave policy

MLOps / DevOps Engineer

Apply

November 11, 2025

Hidden link

Infrastructure Engineer

Fathom

101-200

USD

0

180000

-

240000

United States

Full-time

Remote

ABOUT FATHOMWe think it’s broken that so many people and businesses rely on notes to remember and share insights from their meetings. We created Fathom to eliminate the needless overhead of meetings. Our AI assistant captures, summarizes, and organizes the key moments of your calls, so you and your team can stay fully present without sacrificing context or clarity. From instant, searchable call summaries to seamless CRM updates and team-wide sharing, Fathom transforms meetings from a source of friction into a place for alignment and momentum. We started Fathom to rid us all of the tyranny of note-taking, and people seem to really love what we've built so far: 🥇 #1 Highest Satisfaction Product of 2024 on G2🔥 #1 Rated on G2 with 4,500+ reviews and a perfect 5/5 rating🥇 #1 Product of the Day and #2 AI Product of the Year🚀 Most installed AI meeting assistant on both the Zoom and HubSpot marketplaces📈 We’re hitting usage and revenue records every weekWe're growing incredibly quickly, so we're looking to grow our small but mighty team.Role Overview:We are looking for an SRE who is passionate about leveraging data and automation to drive a highly dynamic infrastructure. The role is a unique blend of infrastructure and internal tooling to reduce friction at every step of delivering an amazing customer experience.As part of our team, you'll play a pivotal role in scaling our infrastructure, reducing toil through automation, and contributing to our culture of innovation and continuous improvement.What you’ll do:By 30 Days:Use your observability background to help scale our existing tools to new heights as we continue to grow the platform.Enhance our existing automation for scaling our infrastructure and improve the development experience.By 90 Days:Play a key role in continuing to diversify and scale our platform across additional regions.Evaluate options to replace our existing real-time data pipeline for enhanced multi-regional capabilities.Provide platform support to all of engineering, using data-driven decision-making.By 1 Year:Work with engineering to re-evaluate what observability means for the Fathom platform, and drive improvements to remove frictionHelp us design and implement improvements to our elastic multi-regional storage platform.Drive platform improvements to enhance reliability and efficiencyRequirements:Hard Skills:Proficiency with and preference for Infrastructure as Code / GitOps tooling.Foundation in Observability best practices and implementation.Experience in a SaaS or PaaS environment.Experience with Google Cloud Platform (GCP) and Google Kubernetes Engine (GKE), including proficiency with GCP/GKE networkingFamiliarity with our tech stack: Message Queues, Prometheus, ClickHouse, ArgoCD, Github Actions, Golang. (Ruby / Rails is a bonus)Soft Skills:Curiosity-driven with a focus on delivering results.A generalist mindset with the ability to dive deep into a wide range of challenges.Resilience and an ability to grind through complex problems.Openness to disagreement and commitment to decisions once made.Strong collaborative skills, with the ability to explain complex insights in an accessible manner.Independence in managing one’s workload and priorities.What You'll Get:The opportunity to shape the dynamic platform of a growing company.A role that balances scaling infrastructure, enabling development teams, and internal tooling developmentA chance to work with a dynamic and collaborative team.Competitive compensation and benefits.A supportive environment that encourages innovation and personal growth.Join Us:If you're excited to own the data journey at Fathom and contribute to our mission with your analytical expertise, we would love to hear from you. Apply now to become a key player in our data-driven success story.

MLOps / DevOps Engineer

Software Engineer

Apply

November 10, 2025

Hidden link

Data Center Operations Systems Engineer - Atlanta

Lambda AI

501-1000

USD

0

89000

-

134000

United States

Full-time

Remote

Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our Atlanta, GA Data Center 5 days per week.The Operations team plays a critical role in ensuring the seamless end-to-end execution of our AI-IaaS infrastructure and hardware. This team is responsible for sourcing all necessary infrastructure and components, overseeing day-to-day data center operations to maintain optimal performance and uptime, and driving cross company coordination through product management organization to align operational capabilities with strategic goals. By managing the full lifecycle from procurement to deployment and operational efficiency, the Operations team ensures that our AI-driven infrastructure is reliable, scalable, and aligned with business priorities.What You'll DoEnsure new server, storage and network infrastructure is properly racked, labeled, cabled, and configuredDocument data center layout and network topology in DCIM softwareWork with supply chain & manufacturing teams to ensure timely deployment of systems and project plans for large-scale deploymentsParticipate in data center capacity and roadmap planning with sales and customer success teams to allocate floorspaceAssess current and future state data center requirements based on growth plans and technology trendsManage a parts depot inventory and track equipment through the delivery-store-stage-deploy-handoff process in each of our data centersWork closely with HW Support team to ensure data center infrastructure-related support tickets are resolvedWork with RMA team to ensure faulty parts are returned and replacements are orderedCreate installation standards and documentation for placement, labeling, and cabling to drive consistency and discoverability across all data centersServe as a subject-matter expert on data center deployments as part of sales engagement for large-scale deployments in our data centers and at customer sitesYouHave experience with critical infrastructure systems supporting data centers, such as power distribution, air flow management, environmental monitoring, capacity planning, DCIM software, structured cabling, and cable managementHave strong Linux administration experienceHave experience in setting up networking appliances (Ethernet and InfiniBand) across multiple data center locationsYou are action-oriented and have a strong willingness to learnYou are willing to travel for bring up of new data center locationsNice to HaveExperience with troubleshooting the following network layers, technologies, and system protocols: TCP/IP, DP/IP, BGP, OSPF, SNMP, SSL, HTTP, FTP, SSH, Syslog, DHCP, DNS, RDP, NETBIOS, IP routing, Ethernet, switched Ethernet, 802.11x, NFS, and VLANs.Experience with working in large-scale distributed data center environmentsExperience working with auditors to meet all compliance requirements (ISO/SOC)Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Apply

November 10, 2025

Hidden link

Top MLOps / DevOps Engineer Jobs Openings in 2025

Staff Engineer, Systems Test (R4151)

Systems Architect - Active Safety

Staff Software Engineer, GPU Infrastructure (HPC)

Manager - Security Architecture

Firmware Intern [Summer 2026]

MLOps Engineer (Remote)

MLOps 엔지니어 (MLOps Engineer)

MLOps Engineer

MLOps エンジニア (MLOps Engineer)

MLOps Engineer (HK)

MLOps 工程师 (MLOps Engineer)

Member of technical staff (Infrastructure)

Senior Member of technical staff (Infrastructure)

Senior Data Center Operations Engineer - Quincy WA

Software Engineer, Infrastructure

AI Security Engineer - Red Team

IT Solutions Engineer

Datacenter Hardware Engineer, HPC

Infrastructure Engineer

Data Center Operations Systems Engineer - Atlanta

Popular Categories