Top MLOps / DevOps Engineer Jobs Openings in 2025

Looking for opportunities in MLOps / DevOps Engineer? This curated list features the latest MLOps / DevOps Engineer job openings from AI-native companies. Whether you're an experienced professional or just entering the field, find roles that match your expertise, from startups to global tech leaders. Updated everyday.

lambda_labs_logo

Engineering Manager, Core Services

Lambda AI
USD
0
297000
-
495000
US.svg
United States
Full-time
Remote
false
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our San Francisco, San Jose or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. The Lambda Core Services team builds and operates release engineering, cloud automation, and workflow systems for our AI cloud product suite. We provide CI/CD tooling and artifact management to support the build/deploy process for our services. We also automate configuration of our AWS and other SaaS resources and manage AWS usage for all of Lambda engineering. Keeping the internal product and engineering teams moving quickly and delivering quality is what makes us tick.Along with the Platform Engineering organization, we help to build the foundations that unlock product excellence and a highly reliable experience for our customers. About the Role:We are seeking a seasoned Engineering Manager with deep experience in both release engineering and the management of large-scale cloud deployments. You will hire and guide a team of platform engineers in building out critical pillars of our stack. You will lead the team in designing, deploying, scaling, and supporting these solutions.Your role is not just to manage people, but to coordinate the delivery of platform solutions to engineering customers within Lambda. This is a unique opportunity to work at the intersection of platform engineering and the rapidly evolving field of AI infrastructure.What You’ll DoTeam Leadership & Management:Grow/Hire, lead, and mentor a team of high-performing platform engineers and SREs.Foster a culture of technical excellence, collaboration, and customer service.Conduct regular one-on-one meetings, provide constructive feedback, and support career development for team members.Drive outcomes by managing project priorities, deadlines, and deliverables.Technical Strategy & Execution:Work with the engineering team to drive strategy for internal CI/CD and Cloud services.Develop self-service abstractions to make our platform tooling easier to adopt and use.Lead the broader engineering organization in best-practices adoption of CI/CD, Workflow, and Cloud services.Manage costs of both vendors and internally developed platforms.Lead team in the continued development of our existing CI/CD solutions based on Buildkite and Github Actions.Lead team in the expansion of our Terraform / Atlantis infrastructure automation platform.Guide Lambda engineering in utilization of AWS services in line with our technical standards.Guide team in problem identification, requirements gathering, solution ideation, and stakeholder alignment on engineering RFCs.Identify gaps in our platform engineering posture and drive resolution.Lead the team in supporting our internal customers from across Lambda engineering.Cross-Functional Collaboration:Work closely with Lambda product engineering teams on requirements and planning to meet their needs.Work to understand the needs of engineering teams and drive our Platform solutions towards self-service.Manage a short list of vendors that provide SaaS solutions used at Lambda.YouExperience:7+ years of experience in either Release Engineering or Platform Engineering with at least 3 years in a management or lead role.Demonstrated experience leading a team of engineers and SREs on complex, cross-functional projects in a fast-paced startup environment.Experience managing, monitoring, and scaling CI/CD platforms.Deep experience using and operating AWS services.Solid background in software engineering and the SDLC.Strong project management skills, leading planning, project execution, and delivery of team outcomes on schedule.Experience building a high-performance team through deliberate hiring, upskilling, performance-management, and expectation setting.Nice to HaveExperience:Experience driving cross-functional engineering management initiatives (coordinating events, strategic planning, coordinating large projects).Experience driving organizational improvements (processes, systems, etc.)Experience managing AWS service usage across a broader engineering organization.Experience in AWS spend management.Experience designing solutions using Temporal workflows; ability to act as an internal consultant for Temporal.Experience with Kubernetes.Experience designing scalable distributed systems.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
tenstorrent_inc_logo

Engineering Director, Technical Infrastructure

Tenstorrent
USD
100000
-
500000
US.svg
United States
CA.svg
Canada
Full-time
Remote
false
Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities.We are seeking an Engineering Director to lead our Technical Infrastructure organization, spanning Developer Infrastructure, IT, HPC platforms, and EDA environments. In this role, you will unify and scale the teams that power developer productivity, corporate IT, high-performance compute clusters, and the EDA workflows critical to silicon design. You will partner closely with engineering leads, product teams, hardware organizations, and executive leadership to deliver reliable, secure, and scalable infrastructure that enables innovation in open source AI software and custom high-performance hardware. Success in this role requires driving alignment and progress across multiple teams and functions. This role is hybrid, based out of Austin, TX; Santa Clara, CA; or Toronto, ON. We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting.   Who You Are Experienced leader of large, multi-disciplinary engineering teams. Background leading Developer Infrastructure, IT, HPC, EDA or similar environments. Skilled at scaling organizations and building strong leadership layers. Known for working effectively across teams and functions to achieve shared goals. Able to balance strategic vision with hands-on technical execution.   What We Need Lead and grow the DevInfra, IT, HPC, and EDA infrastructure teams under a unified strategy. Define roadmaps for CI/CD, observability, orchestration, enterprise IT systems, and compute platforms. Build and mentor leaders within each team, ensuring accountability and clarity. Drive operational excellence, security, and reliability across developer, HPC, and EDA workflows. Partner cross-functionally with engineering, hardware, product, and operations leaders to align infrastructure investments with business goals   What You Will Learn How to build and scale infrastructure at the cutting edge of AI software and custom silicon. How to drive alignment and momentum across diverse teams to deliver breakthrough results. Firsthand exposure to open source AI software, advanced chip design workflows, and massive distributed compute systems.   Compensation for all engineers at Tenstorrent ranges from $100k - $500k including base and variable compensation targets. Experience, skills, education, background and location all impact the actual offer made. Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer.This offer of employment is contingent upon the applicant being eligible to access U.S. export-controlled technology.  Due to U.S. export laws, including those codified in the U.S. Export Administration Regulations (EAR), the Company is required to ensure compliance with these laws when transferring technology to nationals of certain countries (such as EAR Country Groups D:1, E1, and E2).   These requirements apply to persons located in the U.S. and all countries outside the U.S.  As the position offered will have direct and/or indirect access to information, systems, or technologies subject to these laws, the offer may be contingent upon your citizenship/permanent residency status or ability to obtain prior license approval from the U.S. Commerce Department or applicable federal agency.  If employment is not possible due to U.S. export laws, any offer of employment will be rescinded.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
scaleai_logo

DevOps Engineer, IPS

Scale AI
0
0
-
0
No items found.
Remote
false
Scale’s rapidly growing International Public Sector team is focused on using AI to address critical challenges facing the public sector around the world. Our core work consists of: Creating custom AI applications that will impact millions of citizens Generating high-quality training data for national LLMs Upskilling and advisory services to spread the impact of AI As a Software Engineer (Infrastructure), you will design and develop core platforms and software systems, while supporting orchestration, data abstraction, data pipelines, identity & access management, security tools, and underlying cloud infrastructure. At Scale, we’re not just building AI solutions—we’re enabling the public sector to transform their operations and better serve citizens through cutting-edge technology. If you’re ready to shape the future of AI in the public sector and be a founding member of our team, we’d love to hear from you. You will: Backend Development and System Ownership: Design and implement secure, scalable backend systems for customers using modern, cloud-native AI infrastructure. Own services or systems, define long-term health goals, and improve the health of surrounding components. Collaboration and Standards: Collaborate with cross-functional teams to define and execute backend and infrastructure solutions tailored for secure environments. Enhance engineering standards, tooling, and processes to maintain high-quality outputs. Infrastructure Automation and Management: Write, maintain, and enhance Infrastructure as Code templates (e.g., Terraform, CloudFormation) for automated provisioning and management. Manage networking architecture, including secure VPCs, VPNs, load balancers, and firewalls, in cloud environments. Deployment and Scalability: Design and optimize CI/CD pipelines for efficient testing, building, and deployment processes. Scale and optimize containerized applications using orchestration platforms like Kubernetes to ensure high availability and reliability. Disaster Recovery and Hybrid Strategies: Develop and test disaster recovery plans with robust backups and failover mechanisms. Design and implement hybrid and multi-cloud strategies to support workloads across on-premises and multiple cloud providers. Ideally you’d have: A strong engineering background, with a Bachelor’s degree in Computer Science, Mathematics, or a related quantitative field (or equivalent practical experience) 5+ years of post-graduation engineering experience, with a focus on back-end systems and proficiency in at least one of Python, Typescript, Javascript, or C++ Extensive experience in software development and a deep understanding of distributed systems and public cloud platforms (AWS and Azure preferred) Track record of independent ownership of successful engineering projects Experience working fluently with standard containerization & deployment technologies like Kubernetes, Terraform, Docker, etc. Strong knowledge of software engineering best practices and CI/CD tooling (CircleCI, Github Actions) Solid foundation and real-world experience in network engineering Nice to haves: Experience working cross functionally with operations Experience building solutions with LLMs and a deep understanding of the overall Gen AI landscape Experience with data warehouses (Snowflake, Firebolt) and data pipeline/ETL tools (Dagster, dbt) Experience with authentication/authorization systems (Zanzibar, Authz, etc.) Experience with NoSQL document databases (MongoDB) and structured databases (Postgres) Experience with hybrid or on-prem systems Experience with orchestration platforms, such as Temporal and AWS Step Functions PLEASE NOTE: Our policy requires a 90-day waiting period before reconsidering candidates for the same role. This allows us to ensure a fair and thorough evaluation of all applicants. About Us: At Scale, our mission is to develop reliable AI systems for the world's most important decisions. Our products provide the high-quality data and full-stack technologies that power the world's leading models, and help enterprises and governments build, deploy, and oversee AI applications that deliver real impact. We work closely with industry leaders like Meta, Cisco, DLA Piper, Mayo Clinic, Time Inc., the Government of Qatar, and U.S. government agencies including the Army and Air Force. We are expanding our team to accelerate the development of AI applications. We believe that everyone should be able to bring their whole selves to work, which is why we are proud to be an inclusive and equal opportunity workplace. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability status, gender identity or Veteran status.  We are committed to working with and providing reasonable accommodations to applicants with physical and mental disabilities. If you need assistance and/or a reasonable accommodation in the application or recruiting process due to a disability, please contact us at accommodations@scale.com. Please see the United States Department of Labor's Know Your Rights poster for additional information. We comply with the United States Department of Labor's Pay Transparency provision.  PLEASE NOTE: We collect, retain and use personal data for our professional business purposes, including notifying you of job opportunities that may be of interest and sharing with our affiliates. We limit the personal data we collect to that which we believe is appropriate and necessary to manage applicants’ needs, provide our services, and comply with applicable laws. Any information we collect in connection with your application will be treated in accordance with our internal policies and programs designed to protect personal data. Please see our privacy policy for additional information.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
e2b_dev_logo

Security Engineer

E2B
USD
0
170000
-
210000
US.svg
United States
Full-time
Remote
false
🌁 Location: San Francisco, in-person only 💰 Salary: $170k - $210k annual salary + 📈 Equity 💻 Languages: Go, Rust, TypeScript, C ✅ Skills: Linux security, container hardening, threat detection, compliance frameworks👉 Who we are E2B is a fast growing Series A startup with 7-figure revenue. We've raised over $32M in total since our funding in 2023 and are supported by great investors like Insight Partners. Our customers are companies like Perplexity, Hugging Face, Manus, or Groq. We're building the next hyperscaler for AI agents.👉 About the role Your job will be to secure the infrastructure where billions of AI agents execute untrusted code daily. You'll be responsible for protecting companies' most sensitive AI workflows while maintaining our sub-200ms performance standards.You'll be building defense-in-depth security for our Firecracker microVM infrastructure, implementing real-time threat detection across tens of thousands of concurrent sandboxes, and ensuring compliance with enterprise security requirements.This role requires deep technical expertise in systems security and the ability to solve novel security challenges in the AI agent execution space.👉 What we're looking for5+ years of experience in production security infrastructureBeing comfortable with securing distributed systems at massive scaleDeep expertise in Linux security primitives (seccomp, eBPF, namespaces)Excited to work in person from San Francisco on cutting-edge AI infrastructureExperience with container security and microVM hardeningTrack record with compliance frameworks (SOC 2, ISO 27001, GDPR)Not being afraid to implement kernel-level security controlsIf you join E2B, you'll get a lot of freedom. We expect you to be proactive and take ownership. You'll be securing infrastructure that protects millions of AI agents with the support of the rest of the team.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
file.jpeg

Engineering Lead, Inference Platform

Cerebras Systems
-
CA.svg
Canada
US.svg
United States
Full-time
Remote
false
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.   Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.Location: Toronto / Sunnyvale  We're looking for a deeply technical, hands-on engineering leader for our Inference Service Platform. You will lead a high performing team to tackle a critical challenge: scaling LLM inference on Cerebras’ advanced compute clusters and delivering a world-class, on-prem solution for enterprise customers. In this role, you’ll set the technical vision while staying close to the code, architecting highly reliable, low latency distributed systems. If you have proven expertise in distributed systems and scaling modern model-serving frameworks, we want to hear from you.  Responsibilities  Provide hands-on technical leadership, owning the technical vision and roadmap for the Cerebras Inference Platform, from internal scaling to on-prem customer solutions.  Lead the end-to-end development of distributed inference systems, including request routing, autoscaling, and resource orchestration on Cerebras' unique hardware.  Drive a culture of operational excellence, guaranteeing platform reliability (>99.9% uptime), performance, and efficiency.  Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution.  Productize the platform into an enterprise-ready, on-prem solution, collaborating closely with product, ops, and customer teams to ensure successful deployments.  Skills & Qualifications  Technical Leadership: 6+ years in high-scale software engineering, with 3+ years leading distributed systems or ML infra teams; strong coding and review skills.  Inference Expertise: Proven track record scaling LLM inference: optimizing latency (<100ms P99), throughput, batching, memory/IO efficiency and resources utilization.  ML Systems Knowledge: Expertise in distributed inference/training for modern LLMs; understanding of AI/ML ecosystems, including public clouds (AWS/GCP/Azure).  Frameworks & Tools: Hands-on with model-serving frameworks (e.g. vLLM, TensorRT-LLM, Triton or similar) and ML stacks (PyTorch, Hugging Face, SageMaker).   Infrastructure: Deep experience with orchestration (Kubernetes/EKS, Slurm), large clusters, and low-latency networking.  Operations & Monitoring: Strong background in monitoring and reliability engineering (Prometheus/Grafana, incident response, post-mortems).  Leadership & Collaboration: Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products.  Preferred Skills  Experience with on-prem/private cloud deployments.  Background in edge or streaming inference, multi-region systems, or security/privacy in AI.  Customer-facing experience with enterprise deployments.  Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection  point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2025. Apply today and become part of the forefront of groundbreaking advancements in AI! Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Machine Learning Engineer
Data Science & Analytics
Apply
Hidden link
console_co_logo

Platform Engineer

Console
USD
0
200000
-
350000
US.svg
United States
Full-time
Remote
false
About UsConsole is an AI platform that automates IT and internal support. We help companies scale without scaling headcount, and give employees instant resolution to their issues. Our agents understand the full context of the organization, handle requests end-to-end and pull in humans only when necessary.Today, companies like Ramp, Scale, Webflow, and Flock Safety rely on Console to automate over half of their IT & HR requests. We've won every bake-off against our competitors, closed every trial customer and expect to 10x usage by year-end.We're a small, talent-dense team: naturally curious, high-agency and low-ego. Our organization is very flat and ideas win on merit, not hierarchy. We're hiring exceptional people to keep up with demand. We're backed by Thrive Capital and world-class angels.About the roleAs a Platform Engineer at Console, you'll work on the core infrastructure and systems that will enable us to power support across hundreds of medium and large scale enterprises. You'll work closely with the CTO to architect scalable enterprise-grade systems.Some examples of work you might do:Architect, plan, and execute a self-hostable SKU of Console for enterprise customers.Execute zero downtime production migrations while evolving our infrastructure to support a massive increase in customer loadOptimize build times and setup blue-green production deployment workflowsYou'll have broad license to shape the platform design and lead devops initiatives that accelerate the velocity of the whole team.This role is based in San Francisco, CA. We work in-person and offer relocation assistance to new employees.About youYou have hands-on experience designing and building foundational infrastructure zero-to-oneYou're have a deep understanding of AWS/GCP and the tradeoffs between various cloud offerings and architecturesYou've worked on high-scale, production grade systemsYou care about developer experience and productivity; you like building systems that accelerate the whole teamRequirements5+ years of full-time experience in platform or infrastructure-facing engineering rolesExperience with AWS/GCP, IaC tools like Terraform/Pulumi, and relative comfort with our stack (Typescript/Node/React) Passionate about building quality, reliable systems, and never happy with "good enough"Why join Console?Product-market fit: We have built the leading product in our category, in a massive market. We've hit an inflection point and are on track to build a generational company. World-class team: We seek high agency contributors who are comfortable navigating ambiguity, ruthlessly prioritize what matters and are action-biased.Grow with us: We reward impact, not credentials or years of experience. We intend to grow talent from within as we scale up.Competitive pay and benefits: top compensation with full benefits including:Equity with early exercise & QSBS eligibilityComprehensive health, dental, and vision insuranceUnlimited PTO401(k)Meals provided daily in office
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
gleanwork_logo

Application Security Engineer

Glean Work
-
IN.svg
India
Full-time
Remote
false
About Glean: Founded in 2019, Glean is an innovative AI-powered knowledge management platform designed to help organizations quickly find, organize, and share information across their teams. By integrating seamlessly with tools like Google Drive, Slack, and Microsoft Teams, Glean ensures employees can access the right knowledge at the right time, boosting productivity and collaboration. The company’s cutting-edge AI technology simplifies knowledge discovery, making it faster and more efficient for teams to leverage their collective intelligence. Glean was born from Founder & CEO Arvind Jain’s deep understanding of the challenges employees face in finding and understanding information at work. Seeing firsthand how fragmented knowledge and sprawling SaaS tools made it difficult to stay productive, he set out to build a better way - an AI-powered enterprise search platform that helps people quickly and intuitively access the information they need. Since then, Glean has evolved into the leading Work AI platform, combining enterprise-grade search, an AI assistant, and powerful application- and agent-building capabilities to fundamentally redefine how employees work.About the Role: Glean is looking for an Application Security Engineer with a primary focus on ensuring that our entire technology stack is free of software vulnerabilities (CVEs). This role is responsible for securing our base OS images, ensuring all open-source software (OSS) dependencies are scanned and patched, and integrating cutting-edge security tools into our CI/CD pipeline. The ideal candidate will drive the adoption of solutions like Google’s Assured Open Source Software (OSS) and explore alternative approaches to enhance software security. You will: Implement and improve the vulnerability management lifecycle, ensuring our entire tech stack is free from known vulnerabilities/CVEs. Continuously scan, monitor, and patch OSS dependencies to mitigate supply chain risks and enforce best practices for dependency management. Work closely with engineering teams to integrate state-of-the-art SAST, DAST, and dependency scanning tools into the CI/CD pipeline to detect and remediate vulnerabilities early. Define and maintain best practices for secure coding to ensure all code developed by Glean engineers is free from vulnerabilities. Ensure secure posture in SDLC by securing designs, conducting secure code reviews and penetration testing the features. Develop automated security validation tests to enforce vulnerability-free deployments across the stack. Lead the adoption and, if necessary, develop custom security solutions to manage and mitigate security risks at scale. Provide security guidance, training, and mentorship to engineering teams to foster a security-first culture at Glean. About you: BA/BS in Computer Science, Cybersecurity, or a related field (or equivalent industry experience). 5+ years of experience in application security and vulnerability management. Deep understanding of software security vulnerabilities, including CVEs, OWASP Top 10, and supply chain risks. Deep understanding security design principles including but not limited to authentication, authorisation, RBAC, database security. Experience with SAST, DAST, dependency scanning, and vulnerability management tools (e.g., Snyk, GitHub Dependabot, Trivy, Clair, Burp Suite, OWASP ZAP). Strong familiarity with package managers (npm, pip, Maven, Go modules) and securing open-source dependencies. Coding experience in languages such as Go, Python, Java, or C++ to develop security test cases and tooling. Hands-on experience with cloud-native security best practices across AWS, GCP, or Azure. Knowledge of container security, Kubernetes security, and securing microservices architectures. Ability to lead cross-functional initiatives and drive security adoption within engineering teams. A strong proactive approach to security, identifying risks before they become problems. Excellent problem-solving skills and the ability to balance security with performance and usability. Experience working in fast-paced, highly collaborative environments where security is a shared responsibility. Passion for open-source security and keeping up with the latest trends in software vulnerability management. Location: This role is hybrid (3 days a week in our Bangalore office) We are a diverse bunch of people and we want to continue to attract and retain a diverse range of people into our organization. We're committed to an inclusive and diverse company. We do not discriminate based on gender, ethnicity, sexual orientation, religion, civil or family status, age, disability, or race.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
moonvalley_ai_logo

Member of Technical Staff, Infrastructure & Data

Moonvalley
-
CA.svg
Canada
GB.svg
United Kingdom
US.svg
United States
Full-time
Remote
true
About Moonvalley Moonvalley's mission is to solve Visual Intelligence in the age of generative AI. We are building technology that can tell stories, scale creativity, and understand both the physics and semantics of the world. With Marey, our first high-definition foundation model trained exclusively on licensed data, we are powering the next era of cinematic, commercial, and enterprise-grade creation. Our team is an unprecedented convergence of talent across industries. Our elite AI scientists from Deepmind, Google, Microsoft, Meta & Snap, have decades of collective experience in machine learning and computational creativity. We have also established the first AI-enabled movie studio in Hollywood, filled with accomplished filmmakers and visionary creative talent. We work with the top producers, actors, and filmmakers in Hollywood as well as creative-driven global brands. So far we've raised over $100M+ from world-class investors including General Catalyst, Bessemer, Khosla Ventures & YCombinator – and we're just getting started.Job Summary We're hiring an Infrastructure Engineer to design and maintain the systems that power Moonvalley's generative AI research and product development. You'll be joining at a pivotal moment, helping to define the foundations of our infrastructure as we train and deploy cutting-edge video foundation models.In this role, you'll work closely with researchers, engineers, and cross-functional partners to ensure our infrastructure is scalable, reliable, and efficient. From managing GPU clusters to optimizing ETL pipelines, you'll be instrumental in ensuring the technical performance and productivity of our entire AI platform.What you'll doBuild, manage, and scale GPU infrastructure using tools like Kubernetes, Terraform, or PulumiMaintain and optimize ETL pipelines using Spark, Ray, or AirflowOperate and improve our telemetry and monitoring stack (Datadog, Grafana, Weights & Biases)Manage CI/CD pipelines and development tooling (GitHub, PyTorch, Python)Track and optimize datasets, checkpoints, compute utilization, and related assetsAutomate repetitive tasks to improve efficiency and reduce friction across engineering workflowsParticipate in an on-call rotation to resolve infrastructure issues and ensure uptimeProvide tooling, documentation, and support to accelerate internal engineering productivityWhat we're looking forStrong generalist with experience managing large-scale, high-performance infrastructureSkilled in designing scalable systems for compute, data, and developer toolingComfortable in high-urgency environments with the ability to prioritize for impactFamiliar with infrastructure stacks for AI model training and experimentationExperienced with Kubernetes, Terraform/Pulumi, Spark/Ray, and observability toolsPragmatic problem-solver who favors automation and simplicity over complexityOpen to using and contributing to open-source tooling when appropriateBonus: experience as a Cluster Engineer, Data Engineer, or Developer Advocate in AI/ML environmentsWhat we offer (compensation & benefits) Competitive salary and equityPrivate health coveragePension contribution (UK, Canada, US)Unlimited paid vacationFully-distributed, async-first cultureHardware setup of your choiceStipends for phone, internet, and mealsIn our team, we approach our work with the dedication similar to Olympic athletes. Anticipate occasional late nights and weekends dedicated to our mission. We understand this level of commitment may not suit everyone, and we openly communicate this expectation.If you're motivated by deeply technical problems, a seemingly never-ending uphill battle and the opportunity to build (and own) a generational technology company, we can give you what you're looking for.All business roles at Moonvalley are hybrid positions by default, with some fully remote depending on the job scope. We meet a few times every year, usually in London, UK or North America (LA, Toronto) as a company.If you're excited about the opportunity to work on cutting-edge AI technology and help shape the future of media and entertainment, we encourage you to apply. We look forward to hearing from you!The statements contained in this job description reflect general details as necessary to describe the principal functions of this job, the level of knowledge and skill typically required and the scope of responsibility. It should not be considered an all-inclusive listing of work requirements. Individuals may perform other duties as assigned, including work in other functional areas to cover absences, to equalize peak work periods, or to otherwise balance organizational workMoonvalley AI is proud to be an equal opportunity employer. We are committed to providing accommodations. If you require accommodation, we will work with you to meet your needs.Please be assured we'll treat any information you share with us with the utmost care, only use your information for recruitment purposes and will never sell it to other companies for marketing purposes. Please review our privacy policy and job applicant privacy policy located here for further information.
MLOps / DevOps Engineer
Data Science & Analytics
Data Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
bland_ai_logo

Senior Infrastructure Engineer

Bland
USD
0
120000
-
200000
US.svg
United States
Full-time
Remote
false
About Bland At Bland.com, our goal is to empower enterprises to make AI-phone agents at scale. Based out of San Francisco, we're a quickly growing team striving to change the way customers interact with businesses. We've raised $65 million from Silicon Valley's finest; Including Emergence Capital, Scale Venture Partners, YC, the founders of Twilio, Affirm, ElevenLabs, and many more.About the Role As a Senior Infrastructure Engineer at Bland, you'll help us to build the backbone that enables millions of AI-powered phone conversations. You're not just keeping servers running, you're architecting distributed systems that handle real-time voice processing, scale ML inference, and integrate with enterprise telephony infrastructure. Your work directly determines whether our platform can handle business-defining call volumes for our customers, or leaves them with dead air.What You'll DoContribute to the designing of scalable architecture: Build distributed systems using Kubernetes that handle high-volume, real-time voice processing with strict latency and reliability requirements.Build and Support ML infrastructure: Create and optimize the infrastructure supporting our AI models, from training pipelines to real-time inference serving across multiple regions.Integrate with telephony: Maintain robust connections between our platform and complex enterprise phone systems, SIP trunks, and VoIP infrastructure.Recognize Flaws, Control for them: We’re building a new type of architecture that takes something from Column A, and Column B. We’re never going to get it perfect, so you’ll be helping us keep a look out for what we need to solve.Ensure reliability: Implement monitoring, alerting, and incident response systems that keep our platform running 24/7 with enterprise-grade uptime.Scale with growth: Anticipate and solve scaling challenges before they become problems—our call volume grows exponentially and infrastructure needs to stay ahead.Security and compliance: Implement security best practices and compliance requirements for enterprise customers in regulated industries.Interesting Problems to OwnOld-Meets-New: Telephone calls have been around for awhile. Now with an explosion in modern technologies - comes interesting new ways to wrangle old-school protocols and techniques. You’ll have the space to be creative and really own a new emergent type of architecture.Sizable Call Volumes requires new approaches: Understand and deeply invest in ensuring that we match any amount of customer’s customers call volume! We need unique solutions, that you’ll help us discover along the way.Streaming Architectures: On top of building to support our APIs, you’ll also be building to helping maintain the reliability, failover, and scaling of our important stream-based traffic.What Makes You a Great FitInfrastructure expertise: 5+ years building and scaling distributed systems, with deep knowledge of cloud infrastructure (AWS/GCP preferred).You “get” the fundamentals, and beyond: For example, you can casually tell someone how TLS works beyond buzzwords, do a quick sketch of how different load balancing strategies work, or even tell us the obscure thing you fell asleep reading about last night. There isn’t a blank stare, there’s an excitement to share.Real-time systems experience: You've built systems that handle high-throughput, low-latency workloads, streaming, real-time processing, or similar.Startup mentality: You've worked at fast-growing companies where you wear multiple hats and solve problems as they come up.You’re opinionated, but you’re not alienating: You accept that opinions drive progress, but you don’t intend to break into alienating discussions at the risk of not finding compromises for our customers.You’re familiar with some tools/components like: Cloudflare, HAProxy, Go, TypeScript, Datadog, Terraform, Docker, Kubernetes, Nvidia Hardware (nvlink for example), and anything in between.Bonus Points If You HaveExperience with telephony systems (SIP, VOIP, WebRTC.)Background in ML infrastructure, model serving, or GPU computing.Experience with real-time audio/video processing.Benefits and Pay:Healthcare, dental, vision, all the good stuffMeaningful equity in a fast-growing companyEvery tool you need to succeedBeautiful office in Jackson Square, SF with rooftop viewsIf you don't have the perfect experience that is fine! We're a bunch of drop-outs and hackers. Working at a start-up is really hard. We work a lot and we figure things out on the fly.Compensation Range: $120,000-$200,000
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
moonvalley_ai_logo

Member of Technical Staff, Applied AI Engineer

Moonvalley
-
GB.svg
United Kingdom
CA.svg
Canada
US.svg
United States
Full-time
Remote
true
About Moonvalley Moonvalley's mission is to solve Visual Intelligence in the age of generative AI. We are building technology that can tell stories, scale creativity, and understand both the physics and semantics of the world. With Marey, our first high-definition foundation model trained exclusively on licensed data, we are powering the next era of cinematic, commercial, and enterprise-grade creation. Our team is an unprecedented convergence of talent across industries. Our elite AI scientists from Deepmind, Google, Microsoft, Meta & Snap, have decades of collective experience in machine learning and computational creativity. We have also established the first AI-enabled movie studio in Hollywood, filled with accomplished filmmakers and visionary creative talent. We work with the top producers, actors, and filmmakers in Hollywood as well as creative-driven global brands. So far we've raised over $100M+ from world-class investors including General Catalyst, Bessemer, Khosla Ventures & YCombinator – and we're just getting started.Job Summary We're hiring an Integration/Deployment Research Engineer to join our technical team and help translate experimental ML research into robust, scalable systems. This is a highly cross-functional role that sits at the intersection of engineering and research, ensuring that our AI models are reproducible, performant, and ready for real-world use.You’ll design and maintain the infrastructure needed to train, fine-tune, and deploy our most advanced models. From distributed inference optimization to production-grade experimentation frameworks, your work will accelerate both research velocity and deployment reliability.What you'll doTranslate experimental research code into robust, scalable production systemsDesign and maintain ML training, fine-tuning, and deployment pipelinesOptimize model inference for performance and reliability in distributed/cloud environmentsCollaborate with research scientists to integrate novel techniques into production workflowsBuild tools for experiment tracking, reproducibility, and rapid iterationDefine and enforce testing, validation, and monitoring best practices for ML modelsWork closely with creative, product, and engineering teams to scale AI systems into productionImprove infrastructure to support fast-paced experimentation and deployment cyclesWhat we're looking forBS/MS/PhD in Computer Science, Electrical Engineering, or related field5+ years of professional experience in software engineering, focused on ML systemsStrong Python skills and experience with production-level codebasesDeep expertise with ML frameworks (e.g. PyTorch, TensorFlow) and deployment pipelinesFamiliarity with cloud platforms (AWS, GCP, or Azure) and containerization (Docker, Kubernetes)Experience with distributed computing, GPU acceleration, and model optimization techniquesBackground in CI/CD, monitoring, and testing ML systems in productionProficiency with ML workflow orchestration tools (e.g. Airflow, Ray, MLflow, Weights & Biases)Strong collaboration skills and ability to evolve research prototypes into stable productsBonus: Experience in video/image processing pipelines or media content deliveryBonus: Contributions to open-source ML infrastructure projectsWhat we offer (compensation & benefits) Competitive salary and equityPrivate health coveragePension contribution (UK, Canada, US)Unlimited paid vacationFully-distributed, async-first cultureHardware setup of your choiceStipends for phone, internet, and mealsIn our team, we approach our work with the dedication similar to Olympic athletes. Anticipate occasional late nights and weekends dedicated to our mission. We understand this level of commitment may not suit everyone, and we openly communicate this expectation.If you're motivated by deeply technical problems, a seemingly never-ending uphill battle and the opportunity to build (and own) a generational technology company, we can give you what you're looking for.All business roles at Moonvalley are hybrid positions by default, with some fully remote depending on the job scope. We meet a few times every year, usually in London, UK or North America (LA, Toronto) as a company.If you're excited about the opportunity to work on cutting-edge AI technology and help shape the future of media and entertainment, we encourage you to apply. We look forward to hearing from you!The statements contained in this job description reflect general details as necessary to describe the principal functions of this job, the level of knowledge and skill typically required and the scope of responsibility. It should not be considered an all-inclusive listing of work requirements. Individuals may perform other duties as assigned, including work in other functional areas to cover absences, to equalize peak work periods, or to otherwise balance organizational workMoonvalley AI is proud to be an equal opportunity employer. We are committed to providing accommodations. If you require accommodation, we will work with you to meet your needs.Please be assured we'll treat any information you share with us with the utmost care, only use your information for recruitment purposes and will never sell it to other companies for marketing purposes. Please review our privacy policy and job applicant privacy policy located here for further information.
MLOps / DevOps Engineer
Data Science & Analytics
Machine Learning Engineer
Data Science & Analytics
Apply
Hidden link
lambda_labs_logo

Senior Platform Engineer

Lambda AI
USD
0
240000
-
401000
US.svg
United States
Full-time
Remote
false
We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our San Francisco,San Jose, or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll DoArchitect, deploy, and manage Kubernetes clusters across AWS, OCI, and on-prem datacenters.Build and maintain automation for cluster lifecycle management, upgrades, and scaling.Own the reliability, performance, and security of Kubernetes workloads.Implement observability, logging, and alerting for clusters and critical workloads.Partner with developers to design scalable, cloud-native services and CI/CD pipelines.Define and enforce best practices for resource usage, networking, and RBAC.Lead incident response, root cause analysis, and post-mortems for cluster-related issues.Mentor junior engineers and contribute to internal platform engineering standards.You5+ years of experience in Platform, Infrastructure, or SRE roles.Expert knowledge of Kubernetes internals and operational practices.Proven experience running Kubernetes clusters in production at scale.Strong skills with Helm, Kustomize, or similar deployment tooling.Solid understanding of networking, service meshes, and container runtimes.Proficiency in infrastructure-as-code (Terraform, Pulumi, etc.). Experience with observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).Familiarity with security best practices (network policies, secrets management, image scanning).Strong coding skills in Go, Python, or similar for automation.Comfort with GitOps workflows and CI/CD integration.Excellent problem-solving skills and ability to operate in complex environments.Nice to HaveExperience with multi-cluster, multi-cloud, or hybrid environments.Knowledge of GPU scheduling, HPC workloads, or ML/AI infrastructure.Exposure to cost optimization and capacity planning for large clusters.Contributions to CNCF or Kubernetes open-source projects.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
file.jpeg

MLOps Engineer

Bjak
-
US.svg
United States
Full-time
Remote
false
Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
file.jpeg

MLOps 工程師 (MLOps Engineer)

Bjak
-
TW.svg
Taiwan
Full-time
Remote
false
將語言模型轉化為真實世界應用我們正在為全球用戶建構 AI 系統。這個專案團隊專注於開發能真正落地、產生大規模影響力的應用。你將負責的工作高效運行與管理開源模型,兼顧成本與可靠性確保 GPU、CPU、記憶體資源的效能與穩定性監控並排除推理問題,保持低延遲與高吞吐量與工程師合作,建置可擴展且可靠的模型服務方案必要條件有使用 vLLM、HuggingFace TGI 等模型服務平台經驗熟悉 Kubernetes、Ray、Modal、RunPod、LambdaLabs 等 GPU 調度工具能監控延遲、成本並依流量需求有效擴展系統有設定後端工程師推理 API 端點的經驗你將獲得扁平化組織與實際主導權全程參與產品方向與共識決策彈性混合工作模式高影響力角色,跨產品、數據與工程協作頂尖市場水準的薪酬與績效獎金全球產品開發經驗與曝光機會多樣福利:住房補助、優質員工餐廳、加班餐點補貼健康、牙科與視力保險全球差旅保險(本人與眷屬適用)無限制、彈性帶薪休假團隊與文化我們是一支高密度、高績效的團隊,專注於 高品質工作與全球影響力。我們像主人一樣行動,重視速度、清晰與徹底的責任感。如果你渴望成長並追求卓越,誠摯邀請你加入!關於 BjakBJAK 是東南亞第一大保險聚合平台,擁有 800 萬以上用戶,並由員工全額持股。總部設於馬來西亞,並在泰國、台灣與日本營運。我們透過 Bjak.com 幫助數百萬人獲得透明且可負擔的金融保障。我們利用 API、自動化與 AI 前沿技術,簡化複雜金融商品,致力於打造下一代智慧金融系統。----------------------------------------------------------------------Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
file.jpeg

MLOps 工程师 (MLOps Engineer)

Bjak
-
CN.svg
China
Full-time
Remote
false
将语言模型转化为现实应用我们正在为全球用户构建 AI 系统。当前正处于 AI 变革的关键时期 —— 本项目团队致力于构建能够真正落地、创造现实世界最大影响与使用量的 AI 应用。该职位为全球岗位,采用灵活混合办公模式 —— 结合远程办公与总部现场协作。你将与产品、工程、运营、基础设施和数据等地区团队紧密合作,共同构建并扩展具有影响力的 AI 解决方案。为什么这个岗位重要你将运行并优化最前沿的开源模型、设计推理框架,并将 AI 功能稳定上线。你的工作将确保我们的模型不仅具备智能,还能在规模化场景中保持安全性、可靠性与性能表现。你的职责高效运行并管理开源大模型,优化推理的成本与可靠性确保在 GPU、CPU 与内存资源之间的高性能与稳定性实时监控与排查推理性能问题,确保低延迟与高吞吐量与工程团队协作,实现可扩展、可靠的模型服务架构我们正在寻找这样的你:喜欢主导项目并独立推动落地相信“清晰来自行动” —— 原型、测试、迭代,而非等待完美计划在初创环境中依然冷静高效 —— 不惧从零开始或变化快速重视速度 —— 优先交付有价值的产品,而非追求完美版本视反馈与失败为成长的一部分 —— 持续进阶自己的技能拥有谦逊、进取心与执行力,并在协作中带动他人前进任职要求有使用 vLLM、HuggingFace TGI 等模型推理平台的经验熟悉 GPU 调度与资源编排,掌握 Kubernetes、Ray、Modal、RunPod、LambdaLabs 等工具具备根据流量动态监控推理延迟、成本并高效扩展系统的能力熟悉为后端工程师设置推理 API 接口的流程与规范你将获得扁平化团队结构与真实项目主导权全程参与产品方向与决策制定灵活办公制度高影响力角色,跨产品、数据与工程多团队协作顶尖市场薪酬 + 绩效奖金全球化产品开发机会丰厚福利:住房租赁补贴、优质公司食堂、加班餐补健康、牙科与视力保险全球差旅保险(适用于你与家属)无限制、弹性带薪休假团队与文化我们是一支高密度、高绩效的团队,专注于高质量产品与全球影响力。我们像主人一样承担责任,重视速度、清晰与极致执行。如果你渴望成长并追求卓越,欢迎加入我们!关于 BJAKBJAK 是东南亚最大的保险聚合平台,服务用户超过 800 万,且由员工全资持股。公司总部位于马来西亚,在泰国、台湾与日本设有业务。我们通过 Bjak.com 帮助数百万用户获取透明且可负担的金融保障。我们通过 API、自动化与 AI 等前沿科技,简化复杂金融产品,致力于打造下一代智能金融系统。如果你对构建真正落地的 AI 系统充满热情,并希望在高影响力环境中快速成长,我们期待与你相遇!----------------------------------------------------------------------Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
file.jpeg

MLOps エンジニア (MLOps Engineer)

Bjak
-
JP.svg
Japan
Full-time
Remote
false
言語モデルを現実のアプリケーションへ変革する私たちはグローバルなユーザーを対象とした AI システムを構築しています。現在は AI トランジションの時代にあり、この新しいプロジェクトチームは、現実世界への影響力を拡大し、世界中で最大限に活用されるアプリケーションの構築に注力します。このポジションはグローバルな役割であり、柔軟なリモートワークと本社での対面コラボレーションを組み合わせたハイブリッド勤務を採用しています。製品、エンジニアリング、オペレーション、インフラ、データの各地域チームと緊密に連携し、影響力のある AI ソリューションを構築・拡張します。この役割が重要な理由最先端のモデルを効率的に運用・最適化し、評価フレームワークを設計し、AI 機能を本番環境に投入します。あなたの仕事は、モデルがインテリジェントであるだけでなく、安全で信頼でき、大規模に影響力を持つことを保証します。主な業務内容オープンソースモデルを効率的に運用・管理し、コストと信頼性を最適化するGPU、CPU、メモリリソース全体で高いパフォーマンスと安定性を確保するモデル推論を監視・トラブルシューティングし、低レイテンシーと高スループットを維持するエンジニアと協力し、スケーラブルで信頼性の高いモデルサービングソリューションを実装する求める人物像主体性と独立性を好む方「行動から明確さが生まれる」と信じ、完璧な計画を待たずにプロトタイプ・テスト・反復を行える方スタートアップ特有の混乱の中でも冷静かつ効果的に対応できる方 —— 優先順位の変化やゼロからの構築を恐れないスピードを重視し、完璧を待つよりも「今すぐ価値ある成果」を届けることを優先できる方フィードバックや失敗を成長の一部と捉え、常にレベルアップを目指せる方謙虚さ、向上心、行動力を持ち、仲間を助けながら進める方応募資格vLLM や HuggingFace TGI などのモデルサービングプラットフォームの使用経験Kubernetes、Ray、Modal、RunPod、LambdaLabs などを用いた GPU オーケストレーションの経験レイテンシーやコストを監視し、トラフィック需要に応じて効率的にシステムをスケールできる能力バックエンドエンジニア向けの推論エンドポイントの構築経験待遇・福利厚生フラットな組織構造と本当のオーナーシッププロダクト方向性や意思決定への全面的な関与柔軟な勤務形態プロダクト・データ・エンジニアリングを横断する高インパクトな役割市場最高水準の給与と成果に基づくボーナスグローバルなプロダクト開発への参画機会充実した福利厚生 —— 住宅補助、高品質な社員食堂、残業食事補助健康・歯科・眼科保険グローバル旅行保険(本人および扶養家族対象)無制限で柔軟な有給休暇制度チームと文化私たちは高密度・高パフォーマンスのチームであり、高品質な仕事とグローバルインパクトに注力しています。オーナーのように行動し、スピード、明確さ、徹底的な責任感を重視します。成長意欲があり、卓越性を大切にする方を歓迎します。会社概要:BJAKBJAK は東南アジア最大の保険アグリゲーターで、800 万人以上のユーザーを持ち、社員が完全に所有する企業です。本社はマレーシアにあり、タイ、台湾、日本でも事業を展開しています。Bjak.com を通じて、数百万人のユーザーに透明性が高く、手頃な金融保障を提供しています。また、API、自動化、AI などの先端技術を駆使し、複雑な金融商品をシンプルにし、次世代のインテリジェントな金融システムを構築しています。現実世界にインパクトを与える AI システムを構築し、高インパクトな環境で急速に成長したい方、ぜひご応募ください。----------------------------------------------------------------------Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
file.jpeg

MLOps 엔지니어 (MLOps Engineer)

Bjak
-
KR.svg
South Korea
Full-time
Remote
false
언어 모델을 현실 세계의 애플리케이션으로 전환하기우리는 전 세계 사용자를 위한 AI 시스템을 구축하고 있습니다. 지금은 AI 전환의 시대이며, 이 새로운 프로젝트 팀은 현실 세계에서 더 큰 영향력과 글로벌 활용성을 실현하는 애플리케이션 개발에 집중하고 있습니다.이 포지션은 글로벌 역할이며, 유연한 원격 근무와 본사 오피스 협업을 결합한 하이브리드 근무 방식을 채택합니다. 제품, 엔지니어링, 운영, 인프라, 데이터 등 각 지역 팀과 긴밀히 협력하여 영향력 있는 AI 솔루션을 개발하고 확장하게 됩니다.이 역할이 중요한 이유최신 모델을 효율적으로 실행·관리하고, 평가 프레임워크를 설계하며, AI 기능을 실제 서비스 환경에 도입합니다. 당신의 업무는 모델이 단순히 지능적일 뿐만 아니라, 안전하고 신뢰할 수 있으며, 대규모 환경에서 효과적으로 작동하도록 보장합니다.주요 업무오픈소스 모델을 효율적으로 실행·관리하고, 비용과 신뢰성을 최적화GPU, CPU, 메모리 리소스 전반에서 높은 성능과 안정성 확보모델 추론을 모니터링 및 트러블슈팅하여 낮은 지연 시간과 높은 처리량 유지엔지니어와 협력해 확장 가능하고 신뢰성 있는 모델 서빙 솔루션 구현이런 분을 찾습니다주도성과 독립성을 중시하는 분“명확함은 실행에서 나온다”는 믿음을 가지고, 완벽한 계획을 기다리기보다 프로토타입·테스트·반복을 통해 실행하는 분스타트업 환경의 혼란 속에서도 침착하고 효과적으로 일할 수 있는 분 —— 우선순위 변화나 제로 베이스 구축도 두려워하지 않는 분속도 지향적으로, 완벽한 결과보다 지금 가치 있는 무언가를 전달하는 것을 중요하게 여기는 분피드백과 실패를 성장의 일부로 보고, 지속적으로 실력을 발전시키려는 분겸손함, 배움에 대한 열정, 실행력을 가지고 있으며, 동료들과 함께 성장하는 분자격 요건vLLM, HuggingFace TGI 등의 모델 서빙 플랫폼 사용 경험Kubernetes, Ray, Modal, RunPod, LambdaLabs 등을 활용한 GPU 오케스트레이션 경험트래픽 수요에 따라 지연 시간·비용을 모니터링하고 시스템을 효율적으로 확장할 수 있는 능력백엔드 엔지니어를 위한 추론 엔드포인트 설정 경험혜택 및 보상수평적 조직 구조와 진정한 오너십제품 방향성과 합의 기반 의사결정에 전면적으로 참여유연한 근무 형태제품, 데이터, 엔지니어링 전반에 걸쳐 높은 영향력을 가지는 역할업계 최고 수준의 보상 및 성과 기반 보너스글로벌 제품 개발에 참여할 기회다양한 복지 —— 주택 임대 보조, 우수한 회사 구내식당, 야근 식사 제공건강, 치과, 안과 보험본인 및 가족을 위한 글로벌 여행 보험무제한·유연한 휴가 제도팀과 문화우리는 고밀도·고성과 팀으로, 고품질의 업무와 글로벌 임팩트에 집중합니다. 우리는 주인의식으로 행동하며, 속도, 명확함, 끊임없는 책임감을 중시합니다. 성장 욕구가 크고, 탁월함을 진심으로 추구하는 분이라면 함께 하기를 기대합니다.회사 소개: BJAKBJAK은 동남아시아 최대의 보험 비교 플랫폼으로, 800만 명 이상의 사용자를 보유하고 있으며, 직원이 100% 지분을 소유한 회사입니다. 본사는 말레이시아에 있으며, 태국, 대만, 일본에서도 운영되고 있습니다.우리는 Bjak.com을 통해 수백만 명의 사용자에게 투명하고 합리적인 금융 보호를 제공합니다. 또한, API, 자동화, AI 등 최첨단 기술을 활용해 복잡한 금융 상품을 단순화하고, 차세대 지능형 금융 시스템을 구축하고 있습니다.현실 세계에 영향을 미치는 AI 시스템을 구축하고, 임팩트 있는 환경에서 빠르게 성장하고 싶다면, 지금 바로 지원하세요!----------------------------------------------------------------------Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
file.jpeg

MLOps Engineer

Bjak
-
GE.svg
Germany
Full-time
Remote
false
Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
file.jpeg

MLOps Engineer

Bjak
-
GB.svg
United Kingdom
Full-time
Remote
false
Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
file.jpeg

MLOps Engineer

Bjak
-
HK.svg
Hong Kong
Full-time
Remote
false
Transform Language Models into Real-World ApplicationsWe’re building AI systems for a global audience. We are living in an era of AI transition - this new project team will be focusing on building applications to enable more real world impact and highest usage for the world. This role is a global role with hybrid work arrangement - combining flexible remote work with in-office collaboration at our HQ. You’ll work closely with regional teams across product, engineering, operations, infrastructure and data to build and scale impactful AI solutions.Why This Role MattersYou’ll fine-tune state-of-the-art models, design evaluation frameworks, and bring AI features into production. Your work ensures our models are not only intelligent, but also safe, trustworthy, and impactful at scale.What You’ll DoRun and manage open-source models efficiently, optimizing for cost and reliabilityEnsure high performance and stability across GPU, CPU, and memory resourcesMonitor and troubleshoot model inference to maintain low latency and high throughputCollaborate with engineers to implement scalable and reliable model serving solutionsWhat Is It LikeLikes ownership and independenceBelieve clarity comes from action - prototype, test, and iterate without waiting for perfect plans.Stay calm and effective in startup chaos - shifting priorities and building from zero doesn’t faze you.Bias for speed - you believe it’s better to deliver something valuable now than a perfect version much later.See feedback and failure as part of growth - you’re here to level up.Possess humility, hunger, and hustle, and lift others up as you go.RequirementsExperience with model serving platforms such as vLLM or HuggingFace TGIProficiency in GPU orchestration using tools like Kubernetes, Ray, Modal, RunPod, LambdaLabsAbility to monitor latency, costs, and scale systems efficiently with traffic demandsExperience setting up inference endpoints for backend engineersWhat You’ll GetFlat structure & real ownershipFull involvement in direction and consensus decision makingFlexibility in work arrangementHigh-impact role with visibility across product, data, and engineeringTop-of-market compensation and performance-based bonusesGlobal exposure to product developmentLots of perks - housing rental subsidies, a quality company cafeteria, and overtime mealsHealth, dental & vision insuranceGlobal travel insurance (for you & your dependents)Unlimited, flexible time offOur Team & CultureWe’re a densed, high-performance team focused on high quality work and global impact. We behave like owners. We value speed, clarity, and relentless ownership. If you’re hungry to grow and care deeply about excellence, join us.About BjakBJAK is Southeast Asia’s #1 insurance aggregator with 8M+ users, fully owned by its employees. Headquartered in Malaysia and operating in Thailand, Taiwan, and Japan, we help millions of users access transparent and affordable financial protection through Bjak.com. We simplify complex financial products through cutting-edge technologies, including APIs, automation, and AI, to build the next generation of intelligent financial systems. If you're excited to build real-world AI systems and grow fast in a high-impact environment, we’d love to hear from you.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
joinanyscale_logo

Software Engineer - ML Developer Experience

Anyscale
-
No items found.
Full-time
Remote
false
About Anyscale:At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We’re commercializing Ray, a popular open-source project that's creating an ecosystem of libraries for scalable machine learning. Companies like OpenAI, Uber, Spotify, Instacart, Cruise, and many more, have Ray in their tech stacks to accelerate the progress of AI applications out into the real world.With Anyscale, we’re building the best place to run Ray, so that any developer or data scientist can scale an ML application from their laptop to the cluster without needing to be a distributed systems expert.Proud to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date.About the role:The ML Development Platform team is responsible for creating the suite of tools and services that enable users to create production quality applications using Ray. The product is the user’s primary interface into the world of Anyscale and by building a polished, stable, and well-designed product, we are able to enable a magical developer experience for our users.   This team provides the interface for administering Anyscale components including Anyscale workspaces, production and development tools, ML Ops tools and integrations, and more. Beyond the user-facing features, engineers help build out critical pieces of infrastructure and architecture needed to power our platform at scale.  With a taste for good products, a willingness to work with and understand the user base, and technical talent to build high quality software, the engineers can help build a delightful experience for our users from new developers learning to use Ray to businesses powering their products on Anyscale. As part of this role you will:Develop a next-gen ML Ops platform and development tooling centered around RayBuild high quality frameworks for accelerating the AI development lifecycle from data preparation to training to production servingWork with a team of leading distributed systems and machine learning expertsCommunicate your work to a broader audience through talks, tutorials, and blog postsWe'd love to hear from you if you have:At least 5 years of backend development with a solid background in algorithms, data structures, and system designExperience working with modern machine learning tooling, including PyTorch, MLFlow, data catalogs, etc.Familiarity with technologies such as Python, FastAPI, or SQLAlchemyMotivated people who are excited to build tools to power the next generation of cloud applications!Bonus points if you have:Experience in building and maintaining open-source projects.Experience in building and operating machine learning infrastructure in production.Experience in building highly available serving systems.A snapshot of projects you might work on:Full stack work on Anyscale workspaces, debugging and dependency management on AnyscaleDevelopment of new ML Ops tooling and capabilities, like dataset management, experiment and lineage tracking, etc.Lead the development of the Anyscale SDK, authentication, etc.Anyscale Inc. is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law.  Anyscale Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
No job found
There is no job in this category at the moment. Please try again later