Top MLOps / DevOps Engineer Jobs Openings in 2025

Looking for opportunities in MLOps / DevOps Engineer? This curated list features the latest MLOps / DevOps Engineer job openings from AI-native companies. Whether you're an experienced professional or just entering the field, find roles that match your expertise, from startups to global tech leaders. Updated everyday.

Fathom.ai

Infrastructure Engineer

Fathom
USD
0
180000
-
240000
US.svg
United States
Full-time
Remote
true
ABOUT FATHOMWe think it’s broken that so many people and businesses rely on notes to remember and share insights from their meetings. We created Fathom to eliminate the needless overhead of meetings. Our AI assistant captures, summarizes, and organizes the key moments of your calls, so you and your team can stay fully present without sacrificing context or clarity. From instant, searchable call summaries to seamless CRM updates and team-wide sharing, Fathom transforms meetings from a source of friction into a place for alignment and momentum. We started Fathom to rid us all of the tyranny of note-taking, and people seem to really love what we've built so far: 🥇 #1 Highest Satisfaction Product of 2024 on G2🔥 #1 Rated on G2 with 4,500+ reviews and a perfect 5/5 rating🥇 #1 Product of the Day and #2 AI Product of the Year🚀 Most installed AI meeting assistant on both the Zoom and HubSpot marketplaces📈 We’re hitting usage and revenue records every weekWe're growing incredibly quickly, so we're looking to grow our small but mighty team.Role Overview:We are looking for an SRE who is passionate about leveraging data and automation to drive a highly dynamic infrastructure. The role is a unique blend of infrastructure and internal tooling to reduce friction at every step of delivering an amazing customer experience.As part of our team, you'll play a pivotal role in scaling our infrastructure, reducing toil through automation, and contributing to our culture of innovation and continuous improvement.What you’ll do:By 30 Days:Use your observability background to help scale our existing tools to new heights as we continue to grow the platform.Enhance our existing automation for scaling our infrastructure and improve the development experience.By 90 Days:Play a key role in continuing to diversify and scale our platform across additional regions.Evaluate options to replace our existing real-time data pipeline for enhanced multi-regional capabilities.Provide platform support to all of engineering, using data-driven decision-making.By 1 Year:Work with engineering to re-evaluate what observability means for the Fathom platform, and drive improvements to remove frictionHelp us design and implement improvements to our elastic multi-regional storage platform.Drive platform improvements to enhance reliability and efficiencyRequirements:Hard Skills:Proficiency with and preference for Infrastructure as Code / GitOps tooling.Foundation in Observability best practices and implementation.Experience in a SaaS or PaaS environment.Experience with Google Cloud Platform (GCP) and Google Kubernetes Engine (GKE), including proficiency with GCP/GKE networkingFamiliarity with our tech stack: Message Queues, Prometheus, ClickHouse, ArgoCD, Github Actions, Golang. (Ruby / Rails is a bonus)Soft Skills:Curiosity-driven with a focus on delivering results.A generalist mindset with the ability to dive deep into a wide range of challenges.Resilience and an ability to grind through complex problems.Openness to disagreement and commitment to decisions once made.Strong collaborative skills, with the ability to explain complex insights in an accessible manner.Independence in managing one’s workload and priorities.What You'll Get:The opportunity to shape the dynamic platform of a growing company.A role that balances scaling infrastructure, enabling development teams, and internal tooling developmentA chance to work with a dynamic and collaborative team.Competitive compensation and benefits.A supportive environment that encourages innovation and personal growth.Join Us:If you're excited to own the data journey at Fathom and contribute to our mission with your analytical expertise, we would love to hear from you. Apply now to become a key player in our data-driven success story.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
Lambda.jpg

Data Center Operations Systems Engineer - Atlanta

Lambda AI
USD
0
89000
-
134000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our Atlanta, GA Data Center 5 days per week.The Operations team plays a critical role in ensuring the seamless end-to-end execution of our AI-IaaS infrastructure and hardware. This team is responsible for sourcing all necessary infrastructure and components, overseeing day-to-day data center operations to maintain optimal performance and uptime, and driving cross company coordination through product management organization to align operational capabilities with strategic goals. By managing the full lifecycle from procurement to deployment and operational efficiency, the Operations team ensures that our AI-driven infrastructure is reliable, scalable, and aligned with business priorities.What You'll DoEnsure new server, storage and network infrastructure is properly racked, labeled, cabled, and configuredDocument data center layout and network topology in DCIM softwareWork with supply chain & manufacturing teams to ensure timely deployment of systems and project plans for large-scale deploymentsParticipate in data center capacity and roadmap planning with sales and customer success teams to allocate floorspaceAssess current and future state data center requirements based on growth plans and technology trendsManage a parts depot inventory and track equipment through the delivery-store-stage-deploy-handoff process in each of our data centersWork closely with HW Support team to ensure data center infrastructure-related support tickets are resolvedWork with RMA team to ensure faulty parts are returned and replacements are orderedCreate installation standards and documentation for placement, labeling, and cabling to drive consistency and discoverability across all data centersServe as a subject-matter expert on data center deployments as part of sales engagement for large-scale deployments in our data centers and at customer sitesYouHave experience with critical infrastructure systems supporting data centers, such as power distribution, air flow management, environmental monitoring, capacity planning, DCIM software, structured cabling, and cable managementHave strong Linux administration experienceHave experience in setting up networking appliances (Ethernet and InfiniBand) across multiple data center locationsYou are action-oriented and have a strong willingness to learnYou are willing to travel for bring up of new data center locationsNice to HaveExperience with troubleshooting the following network layers, technologies, and system protocols: TCP/IP, DP/IP, BGP, OSPF, SNMP, SSL, HTTP, FTP, SSH, Syslog, DHCP, DNS, RDP, NETBIOS, IP routing, Ethernet, switched Ethernet, 802.11x, NFS, and VLANs.Experience with working in large-scale distributed data center environmentsExperience working with auditors to meet all compliance requirements (ISO/SOC)Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Lambda.jpg

Data Center Operations Engineer - Virginia

Lambda AI
USD
0
89000
-
134000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our Ashburn and Sterling, VA Data Centers 5 days per week.The Operations team plays a critical role in ensuring the seamless end-to-end execution of our Al-laaS infrastructure and hardware. This team is responsible for sourcing all necessary infrastructure and components, overseeing day-to-day data center operations to maintain optimal performance and uptime, and driving cross company coordination through product management organization to align operational capabilities with strategic goals. By managing the full lifecycle from procurement to deployment and operational efficiency, the Operations team ensures that our Al-driven infrastructure is reliable, scalable, and aligned with business priorities.What You'll DoEnsure new server, storage and network infrastructure is property racked, labeled, cabled, and configured.Document data center layout and network topology in DCIM software.Work with supply chain & manufacturing teams to ensure timely deployment of systems and project plans for large-scale deployments.Participate in data center capacity and roadmap planning with sales and customer success teams to allocate floorspace.Assess current and future state data center requirements based on growth plans and technology trends.Manage a parts depot inventory and track equipment through the delivery-store-stage-deploy-handoff process in each of our data centers.Work closely with the HW Support team to ensure data center infrastructure-related support tickets are resolved.Work with the RMA team to ensure faulty parts are returned and replacements are ordered.Create installation standards and documentation for placement, labeling, and cabling to drive consistency and discoverability across all data centers.Serve as a subject-matter expert on data center deployments as part of sales engagement for large-scale deployments in our data centers and at customer sites.YouHave experience with critical infrastructure systems supporting data centers, such as power distribution, air flow management, environmental monitoring, capacity planning, DCIM software, structured cabling, and cable management.Have strong Linux administration experience.Have experience in setting up networking appliances (Ethernet and InfiniBand) across multiple data center locations.You are action-oriented and have a strong willingness to learn.You are willing to travel to bring up new data center locations.Nice to HaveExperience with troubleshooting the following network layers, technologies, and system protocols: TCP/IP, UDP/IP, BGP, OSPF, SNMP, SSL, HTTP, FTP, SSH, Syslog, DHCP, DNS, RDP, NETBIOS, IP routing, Ethernet, switched Ethernet, 802.11x, NFS, and VLANs.Experience with working in large-scale distributed data center environments.Experience working with auditors to meet all compliance requirements (ISO/SOC).Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Lambda.jpg

Senior Site Reliability Engineer - Networking

Lambda AI
USD
0
250000
-
417000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our San Francisco/San Jose/Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.What You'll DoHelp scale Lambda’s high performance multi-tenant cloud networkContribute to the reproducible automation of network configuration and deploymentsContribute to the implementation and operations of Software Defined NetworksHelp to deploy and manage Spine and Leaf networksEnsure high availability of our network through observability, failover, and redundancyEnsure clients have predictable networking performance through the use of network engineering and other applicable technologiesHelp with deploying and maintaining network monitoring and management toolsParticipate in on-callYouHave 5+ years of experience being SWE, SRE or Network Reliability EngineeringBeen part of the implementation of production-scale networking projectsExperience being on-call and incident response managementHave experience building and maintaining Software Defined Networks (SDN), experience with OpenStack, Neutron, OVNAre comfortable on the Linux command line, and have an understanding of the Linux networking stackHave experience with multi-data center networks and hybrid cloud networksHave Python programming experience and configuration management tools like AnsibleHave experience with CI/CD tools for deployment and GIT. Operated network environment with GitOps practices in place.Experience with application lifecycle and deployments on KubernetesNice To HaveOperated production-scale SDNs in a cloud context (e.g. helped implement or operate the infrastructure that powers an AWS VPC-like feature)Have Software development experience with C, GO, PythonExperience automating network configuration within public clouds, with tools like kubentetes, HELM, Terraform, AnsibleDeep understanding of the Linux networking stack and its interaction with network virtualization, SR-IOV and DPDKUnderstanding of the SDN ecosystem (e.g. OVS, Neutron, VMware NSX, Cisco ACI or Nexus Fabric Controller, Arista CVP)Have experience with Spine and Leaf (Clos) network topologyHave experience and understanding of BGP EVPN VXLAN networksExperience with building and maintaining multi-data center networks, SD-WAN, DWDMExperience with Next-Generation Firewalls (NGFW)Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Abridge.jpg

Head of AI Platform

Abridge
USD
0
270000
-
340000
US.svg
United States
Full-time
Remote
false
About AbridgeAbridge was founded in 2018 with the mission of powering deeper understanding in healthcare. Our AI-powered platform was purpose-built for medical conversations, improving clinical documentation efficiencies while enabling clinicians to focus on what matters most—their patients.Our enterprise-grade technology transforms patient-clinician conversations into structured clinical notes in real-time, with deep EMR integrations. Powered by Linked Evidence and our purpose-built, auditable AI, we are the only company that maps AI-generated summaries to ground truth, helping providers quickly trust and verify the output. As pioneers in generative AI for healthcare, we are setting the industry standards for the responsible deployment of AI across health systems.We are a growing team of practicing MDs, AI scientists, PhDs, creatives, technologists, and engineers working together to empower people and make care make more sense. We have offices located in the Mission District in San Francisco, the SoHo neighborhood of New York, and East Liberty in Pittsburgh. The RoleOur generative AI-powered products bring joy back to the practice of medicine. As our offerings expand, we’re looking for a Head of AI Platform to scale the infrastructure that powers them. This is a critical, high-leverage role requiring both people leadership, technical strategy, and ownership of key business outcomes. You will own the entire lifecycle of our AI Platform, ensuring its reliability, efficiency, scalability, and compliance. You’ll own a key pillar of our technical organization, driving the technical direction and organization shaping how our models are trained, served, and managed in production.What You’ll DoPeople Management: Recruit, retain, and mentor engineers and engineering managers. Provide regular feedback; create opportunities for career growth; and foster a culture of collaboration and excellence.Technical & Organizational Leadership: Act as the people and technical leader for the AI Platform team. This includes owning the staffing and execution of the team, and driving work on model serving, training compute, agent serving platform, LLM gateway, and associated orchestration layers. You will guide architectural discussions and set top-level strategic direction for the company’s AI/ML infrastructure.Project Management: Work closely with stakeholders, including product managers, engineering managers, and AI/ML teams to plan, execute, and support multiple projects simultaneously. You will be responsible for the engineering process in the team and the output of the platform.Platform Ownership: Own the design, build, and operation of the core AI platform components, including:Model serving and deployment infrastructure.Compute and vendor management (e.g. GPU allocation).MLOps pipelines and tooling.Health, quality, and performance monitoring.Training compute infrastructure.LLM gateway and orchestration layers for agent serving.Champion Quality: Set a high standard for your team including software quality; communication; collaboration; and compliance with industry and regulatory standards.What You’ll BringA strong technologist, with 10+ years of experience building high-performance distributed systems and 3+ years managing AI/ML-focused engineering teams.Comfortable giving constructive feedback on technical designs and code reviews.Skilled in building secure, compliant systems in major cloud platforms (GCP preferred, but other experience welcome).Skilled at hiring and mentorship, with a track record of helping engineers grow their skills and careers.Expertise with kubernetes, containers, model training and serving, GPU-based capacity planning, and building applications on top of LLMs.Knowledgeable about the software development lifecycle. You view processes such as Kanban and Scrum as tools in a toolbox, and you know which tools to apply in which situations.Up-to-date on industry best-practices and tools, and enjoy learning new thingsExcited about being hands-on in a fast-moving, productive, and supportive environmentWilling to pitch in wherever neededHas thrived in a fast-growing startup, knows how to operate in that environmentBonus Points If…Has owned an Evaluation Platform for AI/ML models.Has owned data engineering or core infrastructure.Why Work at Abridge?At Abridge, we’re transforming healthcare delivery experiences with generative AI, enabling clinicians and patients to connect in deeper, more meaningful ways. Our mission is clear: to power deeper understanding in healthcare. We’re driving real, lasting change, with millions of medical conversations processed each month.Joining Abridge means stepping into a fast-paced, high-growth startup where your contributions truly make a difference. Our culture requires extreme ownership—every employee has the ability to (and is expected to) make an impact on our customers and our business.Beyond individual impact, you will have the opportunity to work alongside a team of curious, high-achieving people in a supportive environment where success is shared, growth is constant, and feedback fuels progress. At Abridge, it’s not just what we do—it’s how we do it. Every decision is rooted in empathy, always prioritizing the needs of clinicians and patients.We’re committed to supporting your growth, both professionally and personally. Whether it's flexible work hours, an inclusive culture, or ongoing learning opportunities, we are here to help you thrive and do the best work of your life.If you are ready to make a meaningful impact alongside passionate people who care deeply about what they do, Abridge is the place for you. How we take care of Abridgers:Generous Time Off: 13 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees.Comprehensive Health Plans: Medical, Dental, and Vision plans for all full-time employees. Abridge covers 100% of the premium for you and 75% for dependents. If you choose a HSA-eligible plan, Abridge also makes monthly contributions to your HSA. Paid Parental Leave: 16 weeks paid parental leave for all full-time employees.401k and Matching: Contribution matching to help invest in your future.Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits.Learning and Development Budget: Yearly contributions for coaching, courses, workshops, conferences, and more.Sabbatical Leave: 30 days of paid Sabbatical Leave after 5 years of employment.Compensation and Equity: Competitive compensation and equity grants for full time employees.... and much more!Equal Opportunity EmployerAbridge is an equal opportunity employer and considers all qualified applicants equally without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran status, or disability.Staying safe - Protect yourself from recruitment fraudWe are aware of individuals and entities fraudulently representing themselves as Abridge recruiters and/or hiring managers. Abridge will never ask for financial information or payment, or for personal information such as bank account number or social security number during the job application or interview process. Any emails from the Abridge recruiting team will come from an @abridge.com email address. You can learn more about how to protect yourself from these types of fraud by referring to this article. Please exercise caution and cease communications if something feels suspicious about your interactions. 
MLOps / DevOps Engineer
Data Science & Analytics
Machine Learning Engineer
Data Science & Analytics
Apply
Hidden link
Lovable.jpg

Platform Engineer - Infrastructure

Lovable
-
SE.svg
Sweden
Full-time
Remote
false
TL;DR We’re looking for an exceptional platform engineer to design and scale the infrastructure behind the future of AI software engineering. You will build and operate the lower part of our stack, making sure that it meets the massive scaling demands of Lovable. Why Lovable?Lovable lets anyone and everyone build software with plain English. From solopreneurs to Fortune 100 teams, millions of people use Lovable to transform raw ideas into real products - fast. We are at the forefront of a foundational shift in software creation, which means you have an unprecedented opportunity to change the way the digital world works. Over 2 million people in 200+ countries already use Lovable to launch businesses, automate work, and bring their ideas to life. And we’re just getting started.We’re a small, talent-dense team building a generation-defining company from Stockholm. We value extreme ownership, high velocity and low-ego collaboration. We seek out people who care deeply, ship fast, and are eager to make a dent in the world.What we’re looking forDeep experience building production infrastructure as a Platform Engineer, Site Reliability Engineer or similar.You have deep experience with service orchestration, networking, cloud infrastructure and general modern scalable infrastructure practices from a global tech startup or scale-up.You have strong programming skills.You’re comfortable navigating ambiguity and solving problems as they arise.You care about security, stability, and speed and know when to make trade-offs between them.You’re based in Stockholm or ready to relocate - this is an on-site, 5-days-a-week role.What you’ll doIn one sentence: Own and scale the platform that makes AI engineering work for everyone.Design, build and maintain the systems that enable our AI product, such as:A runtime environment for running AI agent workloads in a secure and scalable way.High throughput Sandbox scheduler across multiple cloud providers.Harden our infrastructure against failures, downtime, and slowdowns.Support our growth, make sure that our infrastructure never becomes a bottleneck.Plan and implement our network infrastructure and cloud strategy.Identify and drive reliability improvement efforts across all engineering teams.Our tech stackWe're building with tools that both humans and AI love. Lovable platform engineers are capable of working across the whole stack. Examples of tech in our stack include:Frontend: React and Typescript.Backend: Golang and Rust.Cloud: Cloudflare, GCP, AWS, Modal.Data: Clickhouse, Firestore, Spanner, BigQuery.DevOps & Tooling: CI/CD pipelines, OTEL, kubernetes, terraform.And always on the lookout for what's next!About your applicationPlease submit your application in English - it’s our company language so you’ll be speaking lots of it if you joinWe treat all candidates equally - if you’re interested please apply through our careers portal
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
Lovable.jpg

Platform Engineer - Developer Experience

Lovable
-
SE.svg
Sweden
Full-time
Remote
false
TL;DR We are seeking a Platform Engineer to enhance and scale Lovable's developer experience. In this role, you will be responsible for critical components of our stack, ensuring our engineering workflows are faster, smoother, and more Lovable. Why Lovable?Lovable lets anyone and everyone build software with plain English. From solopreneurs to Fortune 100 teams, millions of people use Lovable to transform raw ideas into real products - fast. We are at the forefront of a foundational shift in software creation, which means you have an unprecedented opportunity to change the way the digital world works. Over 2 million people in 200+ countries already use Lovable to launch businesses, automate work, and bring their ideas to life. And we’re just getting started.We’re a small, talent-dense team building a generation-defining company from Stockholm. We value extreme ownership, high velocity and low-ego collaboration. We seek out people who care deeply, ship fast, and are eager to make a dent in the world. What we’re looking forYou have strong programming skills and a track record of improving developer velocity and system reliability.7+ years of experience working in a platform team supporting developer experience (o11y, CI/CD, application frameworks, productivity tooling, etc).You’ve written code and tools to support growing engineering orgs in scale-ups.You have experience with Docker, Kubernetes and modern infrastructure practices.You’re a problem-solver who thrives on challenges and ships high-leverage systems fast.You’re comfortable navigating ambiguity and solving problems as they arise.You care about security, stability, and speed and know when to make trade-offs between them.You’re based in Stockholm or ready to relocate - this is an on-site, 5-days-a-week role.What you’ll doIn one sentence: Own and scale the developer experience at Lovable, making our engineering teams the most productive in the world.Bring order and structure to our code base.Integrate or build application frameworks to support a growing engineering organization and code footprintOwn and develop our observability stack, from code instrumentation through ingestion to presentation.Integrating tools for AI driven development.Identify and drive reliability improvement efforts across all engineering teams.Our tech stackWe're building with tools that both humans and AI love. Lovable platform engineers are capable of working across the whole stack. Examples of tech in our stack include:Frontend: React and Typescript.Backend: Golang and Rust.Cloud: Cloudflare, GCP, AWS, Modal.Data: Clickhouse, Firestore, Spanner, BigQuery.DevOps Tooling: CI/CD pipelines, OTEL, Grafana, kubernetes, terraform.Local Tooling: Nix, DevEnv.And always on the lookout for what's next!About your applicationPlease submit your application in English - it’s our company language so you’ll be speaking lots of it if you joinWe treat all candidates equally - if you’re interested please apply through our careers portal
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
BJAK.jpg

MLOps Engineer

Bjak
-
MY.svg
Malaysia
Full-time
Remote
false
Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
BJAK.jpg

MLOps 엔지니어 (MLOps Engineer)

Bjak
-
KR.svg
South Korea
Full-time
Remote
false
언어 모델을 현실 세계의 애플리케이션으로 전환하기우리는 전 세계 사용자를 위한 AI 시스템을 구축하고 있습니다. 지금은 AI 전환의 시대이며, 이 새로운 프로젝트 팀은 현실 세계에서 더 큰 영향력과 글로벌 활용성을 실현하는 애플리케이션 개발에 집중하고 있습니다.이 포지션은 글로벌 역할이며, 유연한 원격 근무와 본사 오피스 협업을 결합한 하이브리드 근무 방식을 채택합니다. 제품, 엔지니어링, 운영, 인프라, 데이터 등 각 지역 팀과 긴밀히 협력하여 영향력 있는 AI 솔루션을 개발하고 확장하게 됩니다.이 역할이 중요한 이유최신 모델을 효율적으로 실행·관리하고, 평가 프레임워크를 설계하며, AI 기능을 실제 서비스 환경에 도입합니다. 당신의 업무는 모델이 단순히 지능적일 뿐만 아니라, 안전하고 신뢰할 수 있으며, 대규모 환경에서 효과적으로 작동하도록 보장합니다.주요 업무오픈소스 모델을 효율적으로 실행·관리하고, 비용과 신뢰성을 최적화GPU, CPU, 메모리 리소스 전반에서 높은 성능과 안정성 확보모델 추론을 모니터링 및 트러블슈팅하여 낮은 지연 시간과 높은 처리량 유지엔지니어와 협력해 확장 가능하고 신뢰성 있는 모델 서빙 솔루션 구현이런 분을 찾습니다주도성과 독립성을 중시하는 분“명확함은 실행에서 나온다”는 믿음을 가지고, 완벽한 계획을 기다리기보다 프로토타입·테스트·반복을 통해 실행하는 분스타트업 환경의 혼란 속에서도 침착하고 효과적으로 일할 수 있는 분 —— 우선순위 변화나 제로 베이스 구축도 두려워하지 않는 분속도 지향적으로, 완벽한 결과보다 지금 가치 있는 무언가를 전달하는 것을 중요하게 여기는 분피드백과 실패를 성장의 일부로 보고, 지속적으로 실력을 발전시키려는 분겸손함, 배움에 대한 열정, 실행력을 가지고 있으며, 동료들과 함께 성장하는 분자격 요건vLLM, HuggingFace TGI 등의 모델 서빙 플랫폼 사용 경험Kubernetes, Ray, Modal, RunPod, LambdaLabs 등을 활용한 GPU 오케스트레이션 경험트래픽 수요에 따라 지연 시간·비용을 모니터링하고 시스템을 효율적으로 확장할 수 있는 능력백엔드 엔지니어를 위한 추론 엔드포인트 설정 경험혜택 및 보상수평적 조직 구조와 진정한 오너십제품 방향성과 합의 기반 의사결정에 전면적으로 참여유연한 근무 형태제품, 데이터, 엔지니어링 전반에 걸쳐 높은 영향력을 가지는 역할업계 최고 수준의 보상 및 성과 기반 보너스글로벌 제품 개발에 참여할 기회다양한 복지 —— 주택 임대 보조, 우수한 회사 구내식당, 야근 식사 제공건강, 치과, 안과 보험본인 및 가족을 위한 글로벌 여행 보험무제한·유연한 휴가 제도팀과 문화우리는 고밀도·고성과 팀으로, 고품질의 업무와 글로벌 임팩트에 집중합니다. 우리는 주인의식으로 행동하며, 속도, 명확함, 끊임없는 책임감을 중시합니다. 성장 욕구가 크고, 탁월함을 진심으로 추구하는 분이라면 함께 하기를 기대합니다.회사 소개: BJAKBJAK은 동남아시아 최대의 보험 비교 플랫폼으로, 800만 명 이상의 사용자를 보유하고 있으며, 직원이 100% 지분을 소유한 회사입니다. 본사는 말레이시아에 있으며, 태국, 대만, 일본에서도 운영되고 있습니다.우리는 Bjak.com을 통해 수백만 명의 사용자에게 투명하고 합리적인 금융 보호를 제공합니다. 또한, API, 자동화, AI 등 최첨단 기술을 활용해 복잡한 금융 상품을 단순화하고, 차세대 지능형 금융 시스템을 구축하고 있습니다.현실 세계에 영향을 미치는 AI 시스템을 구축하고, 임팩트 있는 환경에서 빠르게 성장하고 싶다면, 지금 바로 지원하세요!----------------------------------------------------------------------Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
BJAK.jpg

MLOps エンジニア (MLOps Engineer)

Bjak
-
JP.svg
Japan
Full-time
Remote
false
言語モデルを現実のアプリケーションへ変革する私たちはグローバルなユーザーを対象とした AI システムを構築しています。現在は AI トランジションの時代にあり、この新しいプロジェクトチームは、現実世界への影響力を拡大し、世界中で最大限に活用されるアプリケーションの構築に注力します。このポジションはグローバルな役割であり、柔軟なリモートワークと本社での対面コラボレーションを組み合わせたハイブリッド勤務を採用しています。製品、エンジニアリング、オペレーション、インフラ、データの各地域チームと緊密に連携し、影響力のある AI ソリューションを構築・拡張します。この役割が重要な理由最先端のモデルを効率的に運用・最適化し、評価フレームワークを設計し、AI 機能を本番環境に投入します。あなたの仕事は、モデルがインテリジェントであるだけでなく、安全で信頼でき、大規模に影響力を持つことを保証します。主な業務内容オープンソースモデルを効率的に運用・管理し、コストと信頼性を最適化するGPU、CPU、メモリリソース全体で高いパフォーマンスと安定性を確保するモデル推論を監視・トラブルシューティングし、低レイテンシーと高スループットを維持するエンジニアと協力し、スケーラブルで信頼性の高いモデルサービングソリューションを実装する求める人物像主体性と独立性を好む方「行動から明確さが生まれる」と信じ、完璧な計画を待たずにプロトタイプ・テスト・反復を行える方スタートアップ特有の混乱の中でも冷静かつ効果的に対応できる方 —— 優先順位の変化やゼロからの構築を恐れないスピードを重視し、完璧を待つよりも「今すぐ価値ある成果」を届けることを優先できる方フィードバックや失敗を成長の一部と捉え、常にレベルアップを目指せる方謙虚さ、向上心、行動力を持ち、仲間を助けながら進める方応募資格vLLM や HuggingFace TGI などのモデルサービングプラットフォームの使用経験Kubernetes、Ray、Modal、RunPod、LambdaLabs などを用いた GPU オーケストレーションの経験レイテンシーやコストを監視し、トラフィック需要に応じて効率的にシステムをスケールできる能力バックエンドエンジニア向けの推論エンドポイントの構築経験待遇・福利厚生フラットな組織構造と本当のオーナーシッププロダクト方向性や意思決定への全面的な関与柔軟な勤務形態プロダクト・データ・エンジニアリングを横断する高インパクトな役割市場最高水準の給与と成果に基づくボーナスグローバルなプロダクト開発への参画機会充実した福利厚生 —— 住宅補助、高品質な社員食堂、残業食事補助健康・歯科・眼科保険グローバル旅行保険(本人および扶養家族対象)無制限で柔軟な有給休暇制度チームと文化私たちは高密度・高パフォーマンスのチームであり、高品質な仕事とグローバルインパクトに注力しています。オーナーのように行動し、スピード、明確さ、徹底的な責任感を重視します。成長意欲があり、卓越性を大切にする方を歓迎します。会社概要:BJAKBJAK は東南アジア最大の保険アグリゲーターで、800 万人以上のユーザーを持ち、社員が完全に所有する企業です。本社はマレーシアにあり、タイ、台湾、日本でも事業を展開しています。Bjak.com を通じて、数百万人のユーザーに透明性が高く、手頃な金融保障を提供しています。また、API、自動化、AI などの先端技術を駆使し、複雑な金融商品をシンプルにし、次世代のインテリジェントな金融システムを構築しています。現実世界にインパクトを与える AI システムを構築し、高インパクトな環境で急速に成長したい方、ぜひご応募ください。----------------------------------------------------------------------Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
BJAK.jpg

MLOps Engineer

Bjak
-
No items found.
Full-time
Remote
true
Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This is a remote role based in Indonesia, working closely with our HQ in Malaysia and cross-functional regional teams. You’ll operate across the stack, from backend logic and integration to frontend delivery, building intelligent systems that scale fast and matter deeply.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
BJAK.jpg

MLOps Engineer (HK)

Bjak
-
HK.svg
Hong Kong
Full-time
Remote
false
Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
BJAK.jpg

MLOps 工程师 (MLOps Engineer)

Bjak
-
CN.svg
China
Full-time
Remote
false
将语言模型转化为现实应用我们正在为全球用户构建 AI 系统。当前正处于 AI 变革的关键时期 —— 本项目团队致力于构建能够真正落地、创造现实世界最大影响与使用量的 AI 应用。该职位为全球岗位,采用灵活混合办公模式 —— 结合远程办公与总部现场协作。你将与产品、工程、运营、基础设施和数据等地区团队紧密合作,共同构建并扩展具有影响力的 AI 解决方案。为什么这个岗位重要你将运行并优化最前沿的开源模型、设计推理框架,并将 AI 功能稳定上线。你的工作将确保我们的模型不仅具备智能,还能在规模化场景中保持安全性、可靠性与性能表现。你的职责高效运行并管理开源大模型,优化推理的成本与可靠性确保在 GPU、CPU 与内存资源之间的高性能与稳定性实时监控与排查推理性能问题,确保低延迟与高吞吐量与工程团队协作,实现可扩展、可靠的模型服务架构我们正在寻找这样的你:喜欢主导项目并独立推动落地相信“清晰来自行动” —— 原型、测试、迭代,而非等待完美计划在初创环境中依然冷静高效 —— 不惧从零开始或变化快速重视速度 —— 优先交付有价值的产品,而非追求完美版本视反馈与失败为成长的一部分 —— 持续进阶自己的技能拥有谦逊、进取心与执行力,并在协作中带动他人前进任职要求有使用 vLLM、HuggingFace TGI 等模型推理平台的经验熟悉 GPU 调度与资源编排,掌握 Kubernetes、Ray、Modal、RunPod、LambdaLabs 等工具具备根据流量动态监控推理延迟、成本并高效扩展系统的能力熟悉为后端工程师设置推理 API 接口的流程与规范你将获得扁平化团队结构与真实项目主导权全程参与产品方向与决策制定灵活办公制度高影响力角色,跨产品、数据与工程多团队协作顶尖市场薪酬 + 绩效奖金全球化产品开发机会丰厚福利:住房租赁补贴、优质公司食堂、加班餐补健康、牙科与视力保险全球差旅保险(适用于你与家属)无限制、弹性带薪休假团队与文化我们是一支高密度、高绩效的团队,专注于高质量产品与全球影响力。我们像主人一样承担责任,重视速度、清晰与极致执行。如果你渴望成长并追求卓越,欢迎加入我们!关于 BJAKBJAK 是东南亚最大的保险聚合平台,服务用户超过 800 万,且由员工全资持股。公司总部位于马来西亚,在泰国、台湾与日本设有业务。我们通过 Bjak.com 帮助数百万用户获取透明且可负担的金融保障。我们通过 API、自动化与 AI 等前沿科技,简化复杂金融产品,致力于打造下一代智能金融系统。如果你对构建真正落地的 AI 系统充满热情,并希望在高影响力环境中快速成长,我们期待与你相遇!----------------------------------------------------------------------Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Decagon.jpg

Senior Software Engineer, Infrastructure

Decagon
USD
0
200000
-
375000
US.svg
United States
Full-time
Remote
false
About DecagonDecagon is the leading conversational AI platform empowering every brand to deliver concierge customer experience. Our AI agents provide intelligent, human-like responses across chat, email, and voice, resolving millions of customer inquiries across every language and at any time.Since coming out of stealth, Decagon has experienced rapid growth. We partner with industry leaders like Hertz, Eventbrite, Duolingo, Oura, Bilt, Curology, and Samsara to redefine customer experience at scale. We've raised over $200M from Bain Capital Ventures, Accel, a16z, BOND Capital, A*, Elad Gil, and notable angels such as the founders of Box, Airtable, Rippling, Okta, Lattice, and Klaviyo.We’re an in-office company, driven by a shared commitment to excellence and velocity. Our values—customers are everything, relentless momentum, winner’s mindset, and stronger together—shape how we work and grow as a team.About the TeamThe Infrastructure team builds and operates the foundations that power Decagon: networking, data, ML serving, developer platform, and real‑time voice. We partner closely with product, data, and ML to deliver high‑scale, low‑latency systems with clear SLOs and great developer ergonomics.We organize around five focus areas:Core Infra: The foundational cloud stack—networking, compute, storage, security, and infrastructure‑as‑code—to ensure reliability, scale, and cost efficiency.Data Infra: Streaming/batch data platforms powering analytics/BI and customer‑facing telemetry, including for customer‑managed and on‑prem environments.ML Infra: GPU and model‑serving platforms for LLM inference with multi‑provider routing and support for on‑prem/air‑gapped deployments.Platform (DevEx): CI/CD, paved paths, and core services that make shipping fast, safe, and consistent across teams.Voice Infra: Telephony/WebRTC stack and observability enabling ultra‑low‑latency, high‑quality voice experiences.Our mission is to deliver magical support experiences — AI agents working alongside humans to resolve issues quickly and accurately. About the RoleWe’re hiring a Senior Infrastructure Engineer to design, build, and operate production infrastructure for high‑scale, low‑latency systems. You’ll own critical services end‑to‑end, improve reliability and performance, and create paved‑paths that let every Decagon engineer ship confidently. In this role, you willDesign and implement critical infrastructure services with strong SLOs, clear runbooks, and actionable telemetry.Partner with research and product teams to architect solutions, set up prototypes, evaluate performance, and scale new features.Tune service latencies: optimize networking paths, apply smart caching/queuing, and tune CPU/memory/I/O for tight p95/p99s.Evolve CI/CD, golden paths, and self‑service tooling to improve developer velocity and safety.Support various deployment architectures for customers with robust observability and upgrade paths.Lead infrastructure‑as‑code (Terraform) and GitOps practices; reduce drift with reusable modules and policy‑as‑code.Participate in on‑call and drive down toil through automation and elimination of recurring issues. Your background looks something like this6+ years building and operating production infrastructure at scale.Depth in at least one area across Core/Data/AI‑ML/Platform/Voice, with curiosity to learn the rest.Proven track record meeting high availability and low latency targets (owning SLOs, p95/p99, and load testing).Excellent observability chops (OpenTelemetry, Prometheus/Grafana, Datadog) and incident response (PagerDuty, SLO/error budgets).Clear written communication and the ability to turn ambiguous requirements into simple, reliable designs. Even betterExperience being an early backend/platform/infrastructure engineer at another companyStrong Kubernetes experience (GKE/EKS/AKS) and experience across multiple cloud providers (GCP, AWS, and Azure)Experience with customer‑managed deployments BenefitsMedical, dental, and visionFlexible time offDaily lunch/dinner & snacks in the office
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
Figure.jpg

Systems Integration Engineer – Head Subsystem

Figure AI
USD
150000
-
350000
US.svg
United States
Full-time
Remote
false
Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA. Figure’s vision is to deploy autonomous humanoids at a global scale. Our Helix team is looking for an experienced Training Infrastructure Engineer, to take our infrastructure to the next level. This role is focused on managing the training cluster, implementing distributed training algorithms, data loaders, and developer tools for AI researchers. The ideal candidate has experience building tools and infrastructure for a large-scale deep learning system. Responsibilities Design, deploy, and maintain Figure's training clusters Architect and maintain scalable deep learning frameworks for training on massive robot datasets Work together with AI researchers to implement training of new model architectures at a large scale Implement distributed training and parallelization strategies to reduce model development cycles Implement tooling for data processing, model experimentation, and continuous integration Requirements Strong software engineering fundamentals Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field Experience with Python and PyTorch Experience managing HPC clusters for deep neural network training Minimum of 4 years of professional, full-time experience building reliable backend systems Bonus Qualifications Experience managing cloud infrastructure (AWS, Azure, GCP) Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.) Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.) The US base salary range for this full-time position is between $150,000 - $350,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Figure.jpg

Validation Engineer – Mechanical Systems

Figure AI
USD
150000
-
350000
US.svg
United States
Full-time
Remote
false
Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA. Figure’s vision is to deploy autonomous humanoids at a global scale. Our Helix team is looking for an experienced Training Infrastructure Engineer, to take our infrastructure to the next level. This role is focused on managing the training cluster, implementing distributed training algorithms, data loaders, and developer tools for AI researchers. The ideal candidate has experience building tools and infrastructure for a large-scale deep learning system. Responsibilities Design, deploy, and maintain Figure's training clusters Architect and maintain scalable deep learning frameworks for training on massive robot datasets Work together with AI researchers to implement training of new model architectures at a large scale Implement distributed training and parallelization strategies to reduce model development cycles Implement tooling for data processing, model experimentation, and continuous integration Requirements Strong software engineering fundamentals Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field Experience with Python and PyTorch Experience managing HPC clusters for deep neural network training Minimum of 4 years of professional, full-time experience building reliable backend systems Bonus Qualifications Experience managing cloud infrastructure (AWS, Azure, GCP) Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.) Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.) The US base salary range for this full-time position is between $150,000 - $350,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.
MLOps / DevOps Engineer
Data Science & Analytics
Machine Learning Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
Crusoe.jpg

Senior Site Reliability Engineer, Compute

Crusoe
USD
0
172000
-
209000
US.svg
United States
Full-time
Remote
false
Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.About This Role:At Crusoe, we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This role is centered on supporting virtualization, hypervisor, and kernel-level performance for Crusoe’s compute infrastructure. You’ll play a vital role in deploying and optimizing bare-metal and virtualized compute platforms, ensuring performance, security, and scale for modern AI and HPC workloads.What You'll Be Working On:In this role, you will develop automation and observability tools to monitor Crusoe’s compute infrastructure, spanning from the kernel to orchestration layers. You will support and scale the company’s virtualization stack, including technologies such as KVM, QEMU, and other hypervisors. Collaborating with Linux kernel and hardware teams, you’ll help identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads. A key focus will be on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU/NIC resources. You will participate in root cause analysis for kernel crashes, hardware-software integration problems, and performance regressions, while also integrating hypervisor-level enhancements to improve guest VM reliability and workload isolation. The role involves tuning kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling. Additionally, you will work closely with platform teams to implement and validate support for emerging compute hardware, including SmartNICs, BlueField devices, and TPUs What You’ll Bring to the Team:8+ years of professional experience in Compute SRE, Linux system engineering, or compute infrastructure roles.Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems.Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware.Familiarity with SmartNICs/DPUs (e.g., NVIDIA CX6/7, BlueField-3) and kernel bypass techniques.Expert-level skills in at least one programming language: Go, C or Rust.Experience with system-level debugging, including kdump, kexec, and kernel panic analysis.Proficiency in Infrastructure as Code tooling and CI/CD practices for bare-metal or cloud infrastructure.Strong understanding of compute scheduling, resource management, and high-throughput networking.Benefits:Industry competitive payRestricted Stock Units in a fast growing, well-funded technology companyHealth insurance package options that include HDHP and PPO, vision, and dental for you and your dependentsEmployer contributions to HSA accounts Paid Parental Leave Paid life insurance, short-term and long-term disability Teladoc 401(k) with a 100% match up to 4% of salaryGenerous paid time off and holiday scheduleCell phone reimbursementTuition reimbursementSubscription to the Calm appMetLife LegalCompany paid commuter benefit; $300/monthCompensation Range:Compensation will be paid in the range of $172,000 - $209,000 a year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Ema.jpg

Senior Infrastructure Engineer

Ema
-
IN.svg
India
Full-time
Remote
false
Who we areEma is building the next generation AI technology to empower every employee in the enterprise to be their most creative and productive. Our proprietary tech allows enterprises to delegate most repetitive tasks to Ema, the AI employee. We are founded by ex-Google, Coinbase, Okta executives and serial entrepreneurs. We’ve raised capital from notable investors such as Accel Partners, Naspers, Section32 and a host of prominent Silicon Valley Angels including Sheryl Sandberg (Facebook/Google), Divesh Makan (Iconiq Capital), Jerry Yang (Yahoo), Dustin Moskovitz (Facebook/Asana), David Baszucki (Roblox CEO) and Gokul Rajaram (Doordash, Square, Google).Our team is a powerhouse of talent, comprising engineers from leading tech companies like Google, Microsoft Research, Facebook, Square/Block, and Coinbase. All our team members hail from top-tier educational institutions such as Stanford, MIT, UC Berkeley, CMU and Indian Institute of Technology. We’re well funded by the top investors and angels in the world. Ema is based in Silicon Valley and Bangalore, India. This will be a hybrid role where we expect employees to work from office three days a week.Who you areWe are seeking an experienced Infrastructure Engineer to join our growing team and play a pivotal role in designing and building our platform and infrastructure as we continue to scale our product and user base. As a part of our team, you will be working in a dynamic, fast-paced environment to ensure the reliability, scalability, and performance of our systems, while focusing on service architecture and deployment, query optimization, distributed systems, data and machine learning infrastructure, and security and authentication. Most importantly, you are excited to be part of a mission-oriented, fast-paced, high-growth startup that can create a lasting impact.You will:Partner with product, infra, and engineering teams to architect and build Ema’s next-generation infrastructure platform supporting multi-cloud deployments and on-prem installations.Design and implement scalable, secure, and resilient deployment frameworks for Ema SaaS and enterprise on-prem environments, enabling automated installation, upgrades, and lifecycle management of Ema.Develop and maintain multi-cloud infrastructure pipelines (AWS, Azure, GCP) using Kubernetes, Helm, Terraform, and cloud-native services to ensure seamless and reliable deployments.Build tools and frameworks to automate the provisioning, configuration, monitoring, and upgrade of Ema environments at scale.Design and optimize CI/CD pipelines (GitHub Actions, Cloud Build, etc.) to streamline the release process across environments while improving developer experience.Contribute code and automation scripts (in Python, Go, or Shell) to strengthen infrastructure management and deployment reliability.Ensure observability, scalability, and security across distributed systems by integrating monitoring, logging, and alerting solutions.Collaborate cross-functionally to evolve Ema’s infra architecture, enabling faster deployments, lower operational overhead, and improved platform stability.Nice to HaveExperience designing installers and deployment managers for both SaaS and air-gapped on-prem environments.Strong understanding of container orchestration (Kubernetes) and infrastructure as code (Terraform, Helm).Hands-on experience with automation frameworks (Ansible, ArgoCD, Flux, or similar).Knowledge of service mesh and networking for multi-cloud environments (Istio, Envoy, or similar).Familiarity with monitoring and observability stacks (Prometheus, Grafana, Signoz, PagerDuty).Prior experience in ML Ops or data infrastructure is a plus.Proficiency in Python or GoExposure to air-gapped deployments, private clouds, or secure enterprise installations.Qualifications:Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.5+ years of hands-on experience in Infrastructure, Platform, or DevOps Engineering, with strong exposure to multi-cloud environmentsStrong analytical and problem-solving skills, with a focus on scalability, reliability, and performance.Demonstrated ability to work independently and collaboratively in a fast-paced, high-growth environment.Experience working with global, cross-functional teams across time zones.Ema Unlimited is an equal opportunity employer and is committed to providing equal employment opportunities to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability, sexual orientation, gender identity, or genetics.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
OpenAI.jpg

Site Reliability Engineer, Frontier Systems Infrastructure

OpenAI
USD
255000
-
490000
US.svg
United States
Full-time
Remote
false
About the TeamThe Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training.We take data center designs, turn them into real, working systems and build any software needed for running large-scale frontier model trainings.Our mission is to bring up, stabilize and keep these hyperscale supercomputers reliable and efficient during the training of the frontier models.About the RoleWe are looking for engineers to operate the next generation of compute clusters that power OpenAI’s frontier research.This role blends distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale Kubernetes clusters to massive scale, automate bare-metal bring-up, and build the software layer that hides the complexity of a magnitude of nodes across multiple data centers.You will work at the intersection of hardware and software, where speed and reliability are critical. Expect to manage fast-moving operations, quickly diagnose and fix issues when things are on fire, and continuously raise the bar for automation and uptime.In this role, you will:Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle managementBuild software abstractions that unify multiple clusters and present a seamless interface to training workloadsOwn node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scaleImprove operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cyclesIntegrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructureDevelop monitoring and observability systems to detect issues early and keep clusters stable under extreme loadBe expected to execute at the same level as a software engineerYou might thrive in this role if you:Have deep experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environmentsBring strong programming or scripting skills (Python, Go, or similar) and familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormationAre comfortable with bare-metal Linux environments, GPU hardware, and large-scale networkingEnjoy solving fast-moving, high-impact operational problems and building automation to eliminate manual workCan balance careful engineering with the urgency of keeping mission-critical systems runningQualificationsExperience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environmentsStrong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloadsProficiency in cloud infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operationsBonus: background with GPU workloads, firmware management, or high-performance computingAbout OpenAIOpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.OpenAI Global Applicant Privacy PolicyAt OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
Lambda.jpg

IT Systems Engineer, Infrastructure & Platform Reliability

Lambda AI
USD
0
206000
-
310000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our San Francisco or San Jose office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Information Systems at Lambda is responsible for building and scaling the internal systems that power our business. We partner across the company—Finance, GTM, Engineering, and People—to implement tools, automate workflows, and ensure data flows securely and accurately. Our scope includes enterprise applications, integrations, data platform and analytics, compliance automation, and all things IT.What You’ll DoDesign, write, and deliver software and services to improve the availability, scalability, reliability, and efficiency of Lambda’s internal IT systems and platforms.Solve problems relating to mission critical services and build automation to prevent problem recurrence with the goal of automating response to all non-exceptional events.Work with Lambda Engineering and internal teams to Influence and create new designs, architectures, standards, and methods for large-scale distributed systems.Engage in service capacity planning and demand forecasting, software performance analysis, and system tuning.Be an excellent communicator, producing documentation and related artifacts for the systems you are responsible for.YouHave a keen interest in system design, architecting for performance, scalability, and experience with multiple cloud infrastructure platforms (AWS, GCP, Azure, etc.).Think carefully about systems: edge cases, failure modes, behaviors, and specific implementations.Know and prefer configuration management systems and toolchains (Chef, Ansible, Terraform, GitHub Actions, etc.)Have solid programming skills: Python, Go, etc.Have an urge to collaborate and communicate asynchronously, combined with a desire to record and document issues and solutions.Have an enthusiastic, go-for-it attitude. When you see something broken, you can’t help but fix it.Have an urge for delivering quickly and effectively, and iterating fast.Nice to HaveExperience and interest in ML/AI workloads and computePractical experience implementing and managing paging, alerting, and on-call scheduling flowsA positive attitude, combined with a desire to learn and collaborateSalary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
No job found
There is no job in this category at the moment. Please try again later