AI MLOps / DevOps Engineer Jobs | Top AI MLOps / DevOps Engineer Openings in 2025

Application Security Engineer

Glean Work

1001-5000

-

India

Full-time

Remote

About Glean: Founded in 2019, Glean is an innovative AI-powered knowledge management platform designed to help organizations quickly find, organize, and share information across their teams. By integrating seamlessly with tools like Google Drive, Slack, and Microsoft Teams, Glean ensures employees can access the right knowledge at the right time, boosting productivity and collaboration. The company’s cutting-edge AI technology simplifies knowledge discovery, making it faster and more efficient for teams to leverage their collective intelligence. Glean was born from Founder & CEO Arvind Jain’s deep understanding of the challenges employees face in finding and understanding information at work. Seeing firsthand how fragmented knowledge and sprawling SaaS tools made it difficult to stay productive, he set out to build a better way - an AI-powered enterprise search platform that helps people quickly and intuitively access the information they need. Since then, Glean has evolved into the leading Work AI platform, combining enterprise-grade search, an AI assistant, and powerful application- and agent-building capabilities to fundamentally redefine how employees work.About the Role: Glean is looking for an Application Security Engineer with a primary focus on ensuring that our entire technology stack is free of software vulnerabilities (CVEs). This role is responsible for securing our base OS images, ensuring all open-source software (OSS) dependencies are scanned and patched, and integrating cutting-edge security tools into our CI/CD pipeline. The ideal candidate will drive the adoption of solutions like Google’s Assured Open Source Software (OSS) and explore alternative approaches to enhance software security. You will: Implement and improve the vulnerability management lifecycle, ensuring our entire tech stack is free from known vulnerabilities/CVEs. Continuously scan, monitor, and patch OSS dependencies to mitigate supply chain risks and enforce best practices for dependency management. Work closely with engineering teams to integrate state-of-the-art SAST, DAST, and dependency scanning tools into the CI/CD pipeline to detect and remediate vulnerabilities early. Define and maintain best practices for secure coding to ensure all code developed by Glean engineers is free from vulnerabilities. Ensure secure posture in SDLC by securing designs, conducting secure code reviews and penetration testing the features. Develop automated security validation tests to enforce vulnerability-free deployments across the stack. Lead the adoption and, if necessary, develop custom security solutions to manage and mitigate security risks at scale. Provide security guidance, training, and mentorship to engineering teams to foster a security-first culture at Glean. About you: BA/BS in Computer Science, Cybersecurity, or a related field (or equivalent industry experience). 5+ years of experience in application security and vulnerability management. Deep understanding of software security vulnerabilities, including CVEs, OWASP Top 10, and supply chain risks. Deep understanding security design principles including but not limited to authentication, authorisation, RBAC, database security. Experience with SAST, DAST, dependency scanning, and vulnerability management tools (e.g., Snyk, GitHub Dependabot, Trivy, Clair, Burp Suite, OWASP ZAP). Strong familiarity with package managers (npm, pip, Maven, Go modules) and securing open-source dependencies. Coding experience in languages such as Go, Python, Java, or C++ to develop security test cases and tooling. Hands-on experience with cloud-native security best practices across AWS, GCP, or Azure. Knowledge of container security, Kubernetes security, and securing microservices architectures. Ability to lead cross-functional initiatives and drive security adoption within engineering teams. A strong proactive approach to security, identifying risks before they become problems. Excellent problem-solving skills and the ability to balance security with performance and usability. Experience working in fast-paced, highly collaborative environments where security is a shared responsibility. Passion for open-source security and keeping up with the latest trends in software vulnerability management. Location: This role is hybrid (3 days a week in our Bangalore office) We are a diverse bunch of people and we want to continue to attract and retain a diverse range of people into our organization. We're committed to an inclusive and diverse company. We do not discriminate based on gender, ethnicity, sexual orientation, religion, civil or family status, age, disability, or race.

MLOps / DevOps Engineer

Apply

September 26, 2025

Hidden link

Senior Infrastructure Engineer

Bland

51-100

USD

0

120000

-

200000

United States

Full-time

Remote

About Bland At Bland.com, our goal is to empower enterprises to make AI-phone agents at scale. Based out of San Francisco, we're a quickly growing team striving to change the way customers interact with businesses. We've raised $65 million from Silicon Valley's finest; Including Emergence Capital, Scale Venture Partners, YC, the founders of Twilio, Affirm, ElevenLabs, and many more.About the Role As a Senior Infrastructure Engineer at Bland, you'll help us to build the backbone that enables millions of AI-powered phone conversations. You're not just keeping servers running, you're architecting distributed systems that handle real-time voice processing, scale ML inference, and integrate with enterprise telephony infrastructure. Your work directly determines whether our platform can handle business-defining call volumes for our customers, or leaves them with dead air.What You'll DoContribute to the designing of scalable architecture: Build distributed systems using Kubernetes that handle high-volume, real-time voice processing with strict latency and reliability requirements.Build and Support ML infrastructure: Create and optimize the infrastructure supporting our AI models, from training pipelines to real-time inference serving across multiple regions.Integrate with telephony: Maintain robust connections between our platform and complex enterprise phone systems, SIP trunks, and VoIP infrastructure.Recognize Flaws, Control for them: We’re building a new type of architecture that takes something from Column A, and Column B. We’re never going to get it perfect, so you’ll be helping us keep a look out for what we need to solve.Ensure reliability: Implement monitoring, alerting, and incident response systems that keep our platform running 24/7 with enterprise-grade uptime.Scale with growth: Anticipate and solve scaling challenges before they become problems—our call volume grows exponentially and infrastructure needs to stay ahead.Security and compliance: Implement security best practices and compliance requirements for enterprise customers in regulated industries.Interesting Problems to OwnOld-Meets-New: Telephone calls have been around for awhile. Now with an explosion in modern technologies - comes interesting new ways to wrangle old-school protocols and techniques. You’ll have the space to be creative and really own a new emergent type of architecture.Sizable Call Volumes requires new approaches: Understand and deeply invest in ensuring that we match any amount of customer’s customers call volume! We need unique solutions, that you’ll help us discover along the way.Streaming Architectures: On top of building to support our APIs, you’ll also be building to helping maintain the reliability, failover, and scaling of our important stream-based traffic.What Makes You a Great FitInfrastructure expertise: 5+ years building and scaling distributed systems, with deep knowledge of cloud infrastructure (AWS/GCP preferred).You “get” the fundamentals, and beyond: For example, you can casually tell someone how TLS works beyond buzzwords, do a quick sketch of how different load balancing strategies work, or even tell us the obscure thing you fell asleep reading about last night. There isn’t a blank stare, there’s an excitement to share.Real-time systems experience: You've built systems that handle high-throughput, low-latency workloads, streaming, real-time processing, or similar.Startup mentality: You've worked at fast-growing companies where you wear multiple hats and solve problems as they come up.You’re opinionated, but you’re not alienating: You accept that opinions drive progress, but you don’t intend to break into alienating discussions at the risk of not finding compromises for our customers.You’re familiar with some tools/components like: Cloudflare, HAProxy, Go, TypeScript, Datadog, Terraform, Docker, Kubernetes, Nvidia Hardware (nvlink for example), and anything in between.Bonus Points If You HaveExperience with telephony systems (SIP, VOIP, WebRTC.)Background in ML infrastructure, model serving, or GPU computing.Experience with real-time audio/video processing.Benefits and Pay:Healthcare, dental, vision, all the good stuffMeaningful equity in a fast-growing companyEvery tool you need to succeedBeautiful office in Jackson Square, SF with rooftop viewsIf you don't have the perfect experience that is fine! We're a bunch of drop-outs and hackers. Working at a start-up is really hard. We work a lot and we figure things out on the fly.Compensation Range: $120,000-$200,000

MLOps / DevOps Engineer

Software Engineer

Apply

September 25, 2025

Hidden link

Senior Platform Engineer

Lambda AI

501-1000

USD

0

240000

-

401000

United States

Full-time

Remote

We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our San Francisco,San Jose, or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll DoArchitect, deploy, and manage Kubernetes clusters across AWS, OCI, and on-prem datacenters.Build and maintain automation for cluster lifecycle management, upgrades, and scaling.Own the reliability, performance, and security of Kubernetes workloads.Implement observability, logging, and alerting for clusters and critical workloads.Partner with developers to design scalable, cloud-native services and CI/CD pipelines.Define and enforce best practices for resource usage, networking, and RBAC.Lead incident response, root cause analysis, and post-mortems for cluster-related issues.Mentor junior engineers and contribute to internal platform engineering standards.You5+ years of experience in Platform, Infrastructure, or SRE roles.Expert knowledge of Kubernetes internals and operational practices.Proven experience running Kubernetes clusters in production at scale.Strong skills with Helm, Kustomize, or similar deployment tooling.Solid understanding of networking, service meshes, and container runtimes.Proficiency in infrastructure-as-code (Terraform, Pulumi, etc.). Experience with observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).Familiarity with security best practices (network policies, secrets management, image scanning).Strong coding skills in Go, Python, or similar for automation.Comfort with GitOps workflows and CI/CD integration.Excellent problem-solving skills and ability to operate in complex environments.Nice to HaveExperience with multi-cluster, multi-cloud, or hybrid environments.Knowledge of GPU scheduling, HPC workloads, or ML/AI infrastructure.Exposure to cost optimization and capacity planning for large clusters.Contributions to CNCF or Kubernetes open-source projects.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Apply

September 24, 2025

Hidden link

Senior Site Reliability Engineer

Stability AI

101-200

0

-

0

United States

Full-time

Remote

MLOps / DevOps Engineer

Apply

September 23, 2025

Hidden link

Senior Cloud Network Engineer

Figure AI

201-500

USD

0

180000

-

240000

United States

Full-time

Remote

Figure is an AI Robotics company developing a general purpose humanoid. Our humanoid robot, Figure 02, is designed for commercial tasks and the home. We are based in San Jose, CA and require 5 days/week in-office collaboration. It’s time to build. We are looking for a skilled Senior Network Engineer with a strong background in both cloud network administration (AWS, Azure, GCP) and on-premise Cisco networking, including switching and wireless technologies. The ideal candidate will be experienced in managing hybrid environments, configuring firewalls, implementing SD-WAN solutions, and delivering exceptional customer service. This candidate also needs to be experienced in a start-up company environment and be team oriented.  This role plays a critical part in maintaining the integrity, performance, and security of both cloud and on-site network infrastructure for our organization and its clients. Responsibilities: Cloud Network Administration: Design, configure, and support virtual networks, routing, VPNs, and load balancing in AWS, Azure, and GCP. Administer cloud network components (e.g., VPCs, Transit Gateways, Azure VNets, GCP Interconnects). Ensure cloud networking aligns with enterprise security and performance standards. Automate cloud network provisioning using tools like Terraform, CloudFormation, or ARM templates. Monitor and troubleshoot cloud network performance and incidents.  Lead efforts to audit, standardize, and secure Azure networking resources, transforming a loosely managed environment into a well-governed, cost-optimized, and scalable architecture On-Premise Networking: Deploy, configure, and manage Cisco switching and wireless solutions. Maintain wired and wireless network availability across corporate offices. Diagnose and resolve issues related to switches, APs, controllers, and PoE devices. Security & WAN Technologies: Configure and manage firewalls (e.g., Cisco ASA/Firepower, Palo Alto). Implement and support SD-WAN technologies for secure, optimized wide-area connectivity. Ensure network compliance with internal security policies and industry standards. Customer Service & Support: Serve as a point of contact for internal stakeholders and clients on network-related issues. Provide high-quality support, documentation, and communication throughout project lifecycles. Participate in on-call rotations and respond to escalations with professionalism and urgency. Requirements:  Bachelor's degree in Computer Science, Information Technology, or related field (or equivalent experience). 7+ years of hands-on experience in network engineering and cloud administration. Solid experience with: AWS, Azure, and Google Cloud Platform networking components. Experience with AWS Direct Connect or Azure ExpressRoute is a plus. Cisco switching (Layer 2/3) and wireless infrastructure. Experience with Cisco DNS/Catalyst Center is a plus. Firewall configuration and management. SD-WAN solutions (e.g., Cisco Viptela, Velo Cloud, Fatpipe). Strong understanding of networking protocols (TCP/IP, BGP, OSPF, DHCP, DNS, VLANs, etc.). Proficiency in network monitoring and diagnostic tools (e.g., Wireshark, SolarWinds, NetFlow). Excellent communication and customer service skills. Bonus Qualifications:  Relevant certifications, such as: Cisco (CCNA, CCNP) AWS Certified Advanced Networking Azure Network Engineer Associate Google Professional Cloud Network Engineer Experience working in hybrid IT environments (on-prem + cloud). Familiarity with Zero Trust architecture and secure cloud networking principles. Early stage start up experience. Strong problem-solving and troubleshooting skills. Ability to communicate complex technical concepts to non-technical stakeholders. Detail-oriented with a proactive and collaborative mindset. Comfortable working independently and as part of a cross-functional team, in a high paced start-up company environment. The US base salary range for this full-time position is between $180,000 - $240,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended. 

MLOps / DevOps Engineer

Apply

September 23, 2025

Hidden link

Security and Compliance Lead

Black Forest Labs

11-50

-

Germany

United States

United Kingdom

Remote

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently seeking a strong security and compliance to work closely with our team in building and implementing world class security and ensuring regulatory compliance across the business.   The Role: Own and evolve the company-wide security strategy across infrastructure, application, and corporate environments Lead our global compliance programs (e.g., ISO 27001, SOC 2) ensuring we meet regulatory and customer trust requirements. Build and maintain relationships with auditors, ensuring smooth audit processes Address AI-specific compliance requirements around data usage, model governance Build a comprehensive security program that scales with our AI training and inference infrastructure Partner closely with engineering and Devops to embed “secure by default” principles into our architecture and development lifecycle. Secure our model training infrastructure: distributed GPU clusters, data pipelines, training datasets Protect inference infrastructure: model serving endpoints, API gateways, and production deployment pipelines Ensure secure model versioning, storage, and deployment practices Implement access controls and audit trails for sensitive training data and model weights Manage and scale our IT function, ensuring a secure, efficient, and user friendly digital workplace. Establish and maintain risk & governance structures, security policies, and incident response procedures. Design and implement security controls for large scale Kubernetes environments hosting training and inference workloads Lead internal risk assessments and external audits, and build trusted relationships with auditors and customers Create and optimise detections, playbooks, and workflows to quickly identify and respond to potential incidents Make impactful, risk-based security decisions aligned with business objectives Establish security as a competitive advantage while maintaining development velocity Ideal Experience: 5+ years of experience in security roles (Security Officer, Security Engineer, Compliance & Security Manager) Deep understanding of infrastructure security, application security, and cloud security Experience performing security operations or investigations involving large scale Kubernetes environments Track record of successfully managing compliance certifications (SOC 2, ISO 27001, etc.) Exceptional communication and collaboration skills An ability to lead projects with little guidance Experience contributing to a high growth startup environment Experience securing cloud infrastructure (Azure) at scale Experience with or strong interest in securing ML/AI infrastructure is highly valued      

MLOps / DevOps Engineer

Software Engineer

Apply

September 23, 2025

Hidden link

Cybersecurity - Site Reliablity Engineer, X Money

X AI

5000+

USD

0

180000

-

360000

No items found.

Remote

About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.About the Role The Cybersecurity / SRE team is focused on ensuring the security and reliability of X Money. This role will primarily focus on the X Money platform but will also cross over with the X Social platform. The ideal candidate will have experience in the banking, money transmission, and P2P payments industry. We emphasize working with large distributed systems and security platforms at scale, with an automation-first mindset. You’ll be responsible for securing and maintaining the reliability of X Money’s infrastructure. You’ll work closely with cross-functional teams to enhance security measures, improve system resilience, and implement best practices. Your role will include: Responsibilities Building and securing mission-critical applications within AWS. Ensuring proper identity and role management within AWS. Implementing and maintaining KMS for data management in RDS and DynamoDB. Strengthening Kubernetes and container security. Writing and maintaining infrastructure code using Python and Terraform. Integrating and maintaining code scanning platforms. Taking ownership of cybersecurity projects, identifying problems, and implementing solutions. Conducting critical analysis and applying strong problem-solving skills. Minimum qualifications: Proficiency in Python and Terraform. Hands-on experience with code scanning platforms. A proactive, problem-solving mindset with a strong sense of ownership. Excellent critical thinking and analytical skills. AWS experience, particularly with identity management and security. Expertise in Kubernetes and container security & experience with self-managed Kubernetes or EKS on AWS. Be based in the SF Bay Area, or willing to relocate here. Annual Salary Range $180,000 - $360,000 USD Benefits Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.xAI is an equal opportunity employer. California Consumer Privacy Act (CCPA) Notice

MLOps / DevOps Engineer

Software Engineer

Apply

September 18, 2025

Hidden link

DevOps Engineer I

Observe

201-500

0

-

0

India

Full-time

Remote

About Us: Observe.AI enables enterprises to transform how they connect with customers - through AI agents and copilots that engage, assist, and act across every channel. From automating conversations to guiding human agents in real time to uncovering insights that shape strategy, Observe.AI turns every interaction into a driver of loyalty and growth. Trusted by global leaders, we’re creating a future where every customer experience is smarter, faster, and more impactful. Why Join Us  At Observe.AI, DevOps isn’t just about maintaining infrastructure—it’s about building scalable, reliable, and secure systems that empower innovation across the organization. As a DevOps Engineer, you’ll help design and automate the foundation that powers our AI/ML platforms, ensuring seamless operations across AWS accounts, Kubernetes clusters, and diverse environments while driving efficiency and cost optimization. You’ll work on automation that goes beyond CI/CD, tackling challenges in scalability, reliability, and security, while collaborating closely with engineering, data, and product teams to create resilient systems. If you’re looking for an opportunity where your expertise shapes the future of our infrastructure, your work enables faster innovation, and your growth is fueled by solving meaningful challenges alongside a talented team, this is the place for you. What you’ll be doing  Help in the definition of best practices in production monitoring and alerting and able to own application of the same Assist and troubleshoot  in the setup and maintenance of various environments (Production, testing, etc) Automate, optimize and drive efficiency of effort, code, and process Be able to assist with product stability and closely collaborate with other tech teams to suggest improvements for the same Assist in the implementation of security best practices, especially in public cloud infrastructure and in audit/compliance requirements. Own integration of existing systems using appropriate Kubernetes/ docker / Terraform scripts to automate and improve the efficiency of the deployment  Develop CI/CD pipelines for various services Coordinate and monitor releases of the same What you’ll bring to the role Expertise in scripting and programming skills (e.g., Python, Shell, Go). Good problem-solving and hands-on with the programming language or scripting for infra-automation CI/CD experience with Jenkins and cloud deployment technologies like Code Deploy (AWS), and/or GitLab. Understanding of enterprise software development and infrastructure processes and lifecycle; ability to adjust and apply this knowledge in a dynamic environment using Agile or similar methodologies. Hands-on experience with Infrastructure as Code, using Terraform, CloudFormation, or other tools. Hands-on experience with microservices and distributed applications, such as orchestration and containers, Kubernetes, and/or serverless technology. Understanding of different kinds of infra components such as DB/pub-sub services/cache etc. Bachelors or Masters Degree in Engineering Perks & Benefits  Excellent medical insurance options and free online doctor consultations Yearly privilege and sick leaves as per Karnataka S&E Act Generous holidays (National and Festive) recognition and parental leave policies Learning & Development fund to support your continuous learning journey and professional development Fun events to build culture across the organization Flexible benefit plans for tax exemptions (i.e. Meal card, PF, etc.) Our Commitment to Inclusion and Belonging Observe.AI is an Equal Employment Opportunity employer that proudly pursues and hires a diverse workforce. Observe AI does not make hiring or employment decisions on the basis of race, color, religion or religious belief, ethnic or national origin, nationality, sex, gender, gender identity, sexual orientation, disability, age, military or veteran status, or any other basis protected by applicable local, state, or federal laws or prohibited by Company policy. Observe.AI also strives for a healthy and safe workplace and strictly prohibits harassment of any kind. We welcome all people. We celebrate diversity of all kinds and are committed to creating an inclusive culture built on a foundation of respect for all individuals. We seek to hire, develop, and retain talented people from all backgrounds. Individuals from non-traditional backgrounds, historically marginalized or underrepresented groups are strongly encouraged to apply. If you are ambitious, make an impact wherever you go, and you're ready to shape the future of Observe.AI, we encourage you to apply. For more information, visit www.observe.ai. 

MLOps / DevOps Engineer

Apply

September 18, 2025

Hidden link

Member of Technical Staff - Training Cluster Engineer

Black Forest Labs

11-50

-

Germany

United States

Full-time

Remote

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our large GPU training clusters. Role & Responsibilities Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows Partner with cloud and colocation providers to ensure cluster availability and performance Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute) Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning Required Experience Production experience managing SLURM clusters at scale, including job scheduling policies, resource allocation, and federation Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments Proven track record managingGPU clusters, including driver management and DCGM monitoring Preferred Qualifications Understanding of distributed training patterns, checkpointing strategies, and data pipeline optimization Experience with Kubernetes for containerized workloads, particularly for inference or mixed compute environments Experience with high-performance interconnects (InfiniBand, RoCE) and NCCL optimization for multi-node training Track record of managing 1000+ GPU training runs, with deep understanding of failure modes and recovery patterns Familiarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloads Experience running hybrid training/inference infrastructure with appropriate resource isolation Strong scripting skills (Python, Bash) and infrastructure-as-code experience

MLOps / DevOps Engineer

Apply

September 17, 2025

Hidden link

Director of production support

Writer

1001-5000

-

United States

Full-time

Remote

📐 About this roleAs the Director of WRITER production support, you will lead the function that guarantees the operational success of our customers' mission-critical WRITER Agents. This is a unique leadership role with a dual mandate: first, to build and lead a world-class human support organization, and second, to architect the AI-driven future of that organization by agentifying our own support processes on the WRITER platform.Your first-hand experience building and delivering AI agents makes you uniquely qualified to lead this team. You understand the complexities of production AI from the inside out. You will be responsible for supporting everything from standardized vertical solutions to the highly complex, custom agents that power our customers' core operations, all while turning your own department into a showcase for AI-powered efficiency.This is a rare opportunity to build the support organization of the future, from the ground up. You will not just be managing a team; you will be a player-coach, a strategist, and a builder, creating a function that is both a world-class human support team and a living testament to the transformative power of the WRITER platform itself.🦸🏻‍♀️ Your responsibilities:Build the WRITER support team: Recruit, hire, and mentor a team of platform support engineers, prioritizing candidates who share a builder's mindset and a deep curiosity for how WRITER Agents work. You will define the hiring profile for individuals who can effectively triage and debug issues within the WRITER.AI ecosystem, from platform-level configurations to the behavior of individual WRITER Agents.Agentify our own support: Develop and execute a roadmap to transform our support function using the WRITER.AI platform. You will "eat our own dog food," building a suite of internal agents to automate triage, diagnosis, knowledge retrieval, and resolution, creating a model for AI-driven operational excellence.Design the agent support process: Architect our entire production support workflow, establishing definitive, AI-first escalation paths. This includes creating specialized triage processes for custom complex agents that directly involve the original builders when necessary.Own the agent knowledge base: Champion and build our technical knowledge base, with a focus on making this knowledge accessible to both human engineers and the support agents you build.Be the voice of the customer in crisis: Serve as the incident commander during P0 issues, leveraging your deep technical understanding of WRITER agents to guide your team and communicate with credibility and confidence.Drive platform & agent Insights: Implement and own all support metrics. You will analyze data from both human- and agent-led resolutions to provide the WRITER product organization with unparalleled insights into platform reliability and customer pain points.⭐️ Is this you?An experienced technical support leader: You have 7+ years of experience in technical support, with at least 3 years spent managing a team in an enterprise PaaS or API-first environment.A support transformation leader: You have a proven track record of not just managing a support function, but fundamentally transforming it using automation and AI. You have personally used an AI platform to agentify and improve support workflows.A proven AI builder: You have demonstrable, hands-on experience building and delivering AI agents or similar complex AI solutions. You have likely been an AI Architect, , a Solutions Architect, or held a similar role in the past. You don't just manage the technology—you have built it.Technically credible & hands-on: You are adept at navigating bespoke software and understand that custom agents have unique failure modes. Your past experience as a builder gives you immediate credibility with engineering and delivery teams.A process builder: You love creating order from chaos and have a proven track record of designing scalable support processes, ticketing workflows, and SLAs for a technical product.A cross-functional partner: You are skilled at working with and influencing teams you don't directly manage to ensure the stability of the entire WRITER.AI ecosystem. 🍩 Benefits & perks (US Full-time employees)Generous PTO, plus company holidaysMedical, dental, and vision coverage for you and your familyPaid parental leave for all parents (12 weeks)Fertility and family planning supportEarly-detection cancer testing through GalleriFlexible spending account and dependent FSA optionsHealth savings account for eligible plans with company contributionAnnual work-life stipends for:Home office setup, cell phone, internetWellness stipend for gym, massage/chiropractor, personal training, etc.Learning and development stipendCompany-wide off-sites and team off-sitesCompetitive compensation, company stock options and 401kWRITER is an equal-opportunity employer and is committed to diversity. We don't make hiring or employment decisions based on race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or any other basis protected by applicable local, state or federal law. Under the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.By submitting your application on the application page, you acknowledge and agree to WRITER's Global Candidate Privacy Notice.

MLOps / DevOps Engineer

Solutions Architect

Project Manager

Apply

September 17, 2025

Hidden link

Director of production support

Writer

1001-5000

-

United States

Full-time

Remote

📐 About this roleAs the Director of WRITER production support, you will lead the function that guarantees the operational success of our customers' mission-critical WRITER Agents. This is a unique leadership role with a dual mandate: first, to build and lead a world-class human support organization, and second, to architect the AI-driven future of that organization by agentifying our own support processes on the WRITER platform.Your first-hand experience building and delivering AI agents makes you uniquely qualified to lead this team. You understand the complexities of production AI from the inside out. You will be responsible for supporting everything from standardized vertical solutions to the highly complex, custom agents that power our customers' core operations, all while turning your own department into a showcase for AI-powered efficiency.This is a rare opportunity to build the support organization of the future, from the ground up. You will not just be managing a team; you will be a player-coach, a strategist, and a builder, creating a function that is both a world-class human support team and a living testament to the transformative power of the WRITER platform itself.🦸🏻‍♀️ Your responsibilities:Build the WRITER support team: Recruit, hire, and mentor a team of platform support engineers, prioritizing candidates who share a builder's mindset and a deep curiosity for how WRITER Agents work. You will define the hiring profile for individuals who can effectively triage and debug issues within the WRITER.AI ecosystem, from platform-level configurations to the behavior of individual WRITER Agents.Agentify our own support: Develop and execute a roadmap to transform our support function using the WRITER.AI platform. You will "eat our own dog food," building a suite of internal agents to automate triage, diagnosis, knowledge retrieval, and resolution, creating a model for AI-driven operational excellence.Design the agent support process: Architect our entire production support workflow, establishing definitive, AI-first escalation paths. This includes creating specialized triage processes for custom complex agents that directly involve the original builders when necessary.Own the agent knowledge base: Champion and build our technical knowledge base, with a focus on making this knowledge accessible to both human engineers and the support agents you build.Be the voice of the customer in crisis: Serve as the incident commander during P0 issues, leveraging your deep technical understanding of WRITER agents to guide your team and communicate with credibility and confidence.Drive platform & agent Insights: Implement and own all support metrics. You will analyze data from both human- and agent-led resolutions to provide the WRITER product organization with unparalleled insights into platform reliability and customer pain points.⭐️ Is this you?An experienced technical support leader: You have 7+ years of experience in technical support, with at least 3 years spent managing a team in an enterprise PaaS or API-first environment.A support transformation leader: You have a proven track record of not just managing a support function, but fundamentally transforming it using automation and AI. You have personally used an AI platform to agentify and improve support workflows.A proven AI builder: You have demonstrable, hands-on experience building and delivering AI agents or similar complex AI solutions. You have likely been an AI Architect, , a Solutions Architect, or held a similar role in the past. You don't just manage the technology—you have built it.Technically credible & hands-on: You are adept at navigating bespoke software and understand that custom agents have unique failure modes. Your past experience as a builder gives you immediate credibility with engineering and delivery teams.A process builder: You love creating order from chaos and have a proven track record of designing scalable support processes, ticketing workflows, and SLAs for a technical product.A cross-functional partner: You are skilled at working with and influencing teams you don't directly manage to ensure the stability of the entire WRITER.AI ecosystem. 🍩 Benefits & perks (US Full-time employees)Generous PTO, plus company holidaysMedical, dental, and vision coverage for you and your familyPaid parental leave for all parents (12 weeks)Fertility and family planning supportEarly-detection cancer testing through GalleriFlexible spending account and dependent FSA optionsHealth savings account for eligible plans with company contributionAnnual work-life stipends for:Home office setup, cell phone, internetWellness stipend for gym, massage/chiropractor, personal training, etc.Learning and development stipendCompany-wide off-sites and team off-sitesCompetitive compensation, company stock options and 401kWRITER is an equal-opportunity employer and is committed to diversity. We don't make hiring or employment decisions based on race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or any other basis protected by applicable local, state or federal law. Under the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.By submitting your application on the application page, you acknowledge and agree to WRITER's Global Candidate Privacy Notice.

MLOps / DevOps Engineer

Solutions Architect

Software Engineer

Apply

September 17, 2025

Hidden link

Member of Technical Staff - ML Infra

Fundamental Research Labs

51-100

-

United States

Full-time

Remote

About the RoleAs our Member of Technical Staff focused on ML infrastructure, you’ll design and scale the platforms that power cutting-edge AI: from high-performance inference engines to the underlying agent technologies and large-scale compute clusters that keep everything running.You’ll collaborate closely with researchers and product engineers to push the limits of inference performance, build reliable foundations for AI agents, and advance the next generation of training and post-training pipelines.ResponsibilitiesSpeed up research development, help researchers explore SOTA and new techniques on day oneBuild and optimize model training pipeline including data collection, data loading, SFT and RLOptimize a high-performance inference platform on top of both open-source and proprietary inference enginesDevelop and scale technologies for large-scale cluster scheduling, high-performance distributed training, and AI networkingBuild a strong engineering discipline across observability and reliability at scaleCollaborate with research and product teams to translate breakthroughs into robust, production-ready infrastructureQualificationsExpertise in one or more of: inference engines, GPU optimization, cluster scheduling, or cloud-native infraFamiliarity with modern ML frameworks (PyTorch, vLLM, Verl, etc.)Startup-ready mindset (adaptable, fast-moving, high-ownership)What makes us interestingSmall, elite team of ex-founders, researchers from top AI Labs, top CS grads, and engineers from top companiesTrue ownership You will not be blocked by bureaucracy, shipping meaningful work within weeks rather than monthsSerious momentum We're well-funded by top investors, moving fast, and focused on executionWhat we doShip consumer products powered by cutting-edge AI research, andBuild infrastructure that facilitates research and product, andInnovate cutting-edge research that will open up new consumer product formsThe DetailsFull-time, onsite role in Menlo ParkStartup hours apply Generous salary, with additional benefits to be discussed during the hiring process

MLOps / DevOps Engineer

Machine Learning Engineer

Apply

September 16, 2025

Hidden link

Senior Site Reliability Engineer - Managed Kubernetes

Lambda AI

501-1000

USD

0

267000

-

401000

United States

Full-time

Remote

We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll DoOperate and maintain bare-metal Kubernetes clusters, scaling up to thousands of nodesHandle cluster degradation, recovery, resizing, and incident response using fleet management toolsParticipate in a well-managed on-call rotation for critical incidentsAssist customers with Kubernetes questions, workload integration, storage, and authenticationWork closely with our HPC Ops and Datacenter Ops teams for low-level or cross-functional issuesUse Python and Golang to create tooling and automate the validation of platform quality.Design, build, and maintain scalable control plane services, operators, and custom controllers for KubernetesDevelop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion.Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability.About YouMust-Have6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systemsStrong programming skills in Go and Python; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operatorsProven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)Can work either independently with limited direction or as part of a teamCan work with customers during incidents either via tickets, live messaging, or as part of a larger call.Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelinesProven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similarNice-to-HaveDeep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experienceExposure to HPC clusters, AI/ML workloads, or large-scale GPU clustersHybrid or multi-cloud Kubernetes environment experienceContributions to CNCF projects or Kubernetes SIGsWhy Join UsWork on cutting-edge Managed Kubernetes platforms for AI/ML workloadsInfluence the platform roadmap and help shape operations and reliability best practicesCollaborate with a highly skilled engineerOpportunity to mentor and grow within a fast-growing, technology-driven environmentSalary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Apply

September 15, 2025

Hidden link

Security Engineer - Detection & Response

Lambda AI

501-1000

USD

296000

-

445000

United States

Full-time

Remote

We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.About the RoleLambda Security protects some of the world's most valuable digital assets: invaluable training data, model weights representing immense computational investments, and the sensitive inputs required to leverage best of breed AI models. We're responsible for securing every byte that powers breakthrough artificial intelligence.As a Security Engineer on the Detection & Response team, you'll be a core technical contributor building detection capabilities, driving incident response, and eliminating firefighting everywhere possible.Reporting to the Senior Manager of Detection & Response and working within our specialized Detection & Response team, you'll build and operate detection systems, lead incident investigations, develop threat intelligence capabilities, and contribute to red team activities. You'll coordinate closely with Security Technical Program Management to drive prioritized security remediations across the organization, ensuring that critical threats are addressed systematically rather than reactively.You will work on implementing enterprise-grade detection capabilities, automating incident response workflows, developing threat hunting programs, and building tooling that enables 24/7 security operations. You'll have unique access to LLMs hosted on our own infrastructure to implement and experiment with AI-powered detection and response capabilities that wouldn't be possible anywhere else.If you thrive on hunting threats, responding to incidents, and building detection systems that protect cutting-edge AI infrastructure at scale, we'd love to talk.We value diverse backgrounds, experiences, and skills, and we are excited to hear from candidates who can bring unique perspectives to our team. If you do not exactly meet this description but believe you may be a good fit, please still apply and help us understand your readiness for this role. Your application is not a waste of our time.What You’ll DoIncident Response & Operations:Response: Qualify reports and lead response activities from initial triage through remediation and retrospective.Automation: Develop tools and workflows that accelerate incident response and reduce mean time to resolution.Coordination: Drive prioritization and remediation of security findings across engineering teams in coordination with Security Technical Program Management.24/7 Operations: Participate in on-call rotation, ensuring rapid response to security events that threaten customer data or operations.Threat Detection & Analysis:Detection Engineering: Create and tune detection rules and alerts that identify threats across Lambda's infrastructure before they impact customers or revenue.Threat Intelligence: Research and operationalize threat intelligence specific to AI infrastructure and Lambda's unique threat landscape.Threat Hunts: Proactively search for indicators of compromise and suspicious activity that automated detection might miss.Explore AI-driven Security: Leverage Lambda's hosted LLMs to create AI-powered threat detection, automated triage, and intelligent alert correlation.Offensive Security: Support periodic tabletop exercises and red team activities to test and improve detection coverage and response capabilities.What We Think a Candidate Needs to Demonstrate to SucceedHave 3+ years of hands-on security engineering experience and 5+ years of total engineering experience, with demonstrated impact in detection and incident response.Thrive in high-speed, high-ambiguity startup environments where you build security capabilities while responding to immediate threats.Deep technical expertise with security tooling including SIEM/SOAR platforms, EDR solutions, vulnerability scanners, and cloud security monitoring.Excel at solving problems in Python, Go, or similar languages, building automations that scale security impact.Proven ability to work effectively with cross-functional technical teams both with and without authority (we're all on the same team!).Strong Linux systems experience in both bare metal and cloud environments, understanding infrastructure from kernel to application layer.Excellence at translating security concerns into business risk, enabling stakeholders to make informed decisions.Nice to HaveYou've built or contributed to detection engineering programs or incident response capabilities.Experience with threat intelligence platforms, threat hunting methodologies, or purple team exercises.Deep experience with specific SIEM platforms (Splunk, Elastic, Chronicle) or SOAR solutions.Experience driving or providing significant evidence for compliance audits, such as SOC 2, ISO 27001, PCI-DSS, HIPAA/HITECH, or FedRAMP.You've developed detection content shared with the security community (Sigma rules, YARA, etc.).Experience responding to incidents in both cloud (AWS, GCP, Azure) and bare metal environments.Security certifications like GCIH, GNFA, GCIA, or similar that demonstrate incident response expertise.Experience with forensics, malware analysis, or reverse engineering.Excitement about leveraging our direct access to state-of-the-art LLMs to enhance detection and response—imagine AI-powered threat hunting, automated incident triage, and intelligent alert correlation at a scale only possible when you host the AI infrastructure yourself.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Software Engineer

Apply

September 15, 2025

Hidden link

Senior Networking Engineer

Lambda AI

501-1000

USD

203000

-

417000

United States

Full-time

Remote

We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our San Francisco/San Jose/Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.What You’ll DoHelp to build Lambda’s cloud networking infrastructureContribute to automation of network configurationWill be part of operations and on-call for networkingWork with internal and external customer to resolve network related issuesWork on deploying and configuring networking HW, Switches, FWs, for new clustersHelp with deploying and maintaining network monitoring and management toolsYouHave 3+ years of experience in IT space, and 1+ in managing networksHave experience with virtualization technology, like ESXi, KVM, and VMs managementHave experience with FW policies configurationsHave experience with multi-data center networks and hybrid cloud networksHave understanding of BGP EVPN VXLAN networks, Spine and Leaf (Clos) network topologyAre comfortable on the Linux command line, and have an understanding of the Linux networking stack and internalsHave python and/or bash programming experience and worked with git or similar source control systemsNice to HaveExperience with Monitoring/Observability tools like Datadog, Splunk, Grafana, PrometheusHave experience building and maintaining Software Defined Networks (SDN)Experience with HPC networking, such as Infiniband or RoCEExperience automating network configuration within public clouds, with tools like Terraform/Ansible/SaltExperience with Next-Generation Firewalls (NGFW)Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Apply

September 15, 2025

Hidden link

Security Engineer - Architecture

Lambda AI

501-1000

USD

296000

-

445000

United States

Full-time

Remote

We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us. *Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. About the RoleLambda Security protects some of the world's most valuable digital assets: invaluable training data, model weights representing immense computational investments, and the sensitive inputs required to leverage best of breed AI models. We're responsible for securing every byte that powers breakthrough artificial intelligence.As a Security Engineer on our Architecture team, you'll be the technical foundation of our security design decisions, creating security architecture patterns and standards that directly protect customer data and enable Lambda to be the safest place to build with AI.Reporting to the Senior Manager of Security and collaborating closely with Product Engineering, Platform Engineering, and embedded Technical Program Managers, you'll drive security architecture improvements across our AI-focused infrastructure. Your work will span security design reviews, threat modeling, architecture patterns, and security requirements that scale with our rapid growth while maintaining the highest security standards.You will work on creating security architecture patterns, conducting threat models and security reviews, establishing security requirements for engineering teams, and developing customer-facing security documentation. You'll have unique access to LLMs hosted on our own infrastructure to pioneer AI-powered security architecture solutions that wouldn't be possible anywhere else.If you thrive on solving complex security design challenges in cutting-edge AI infrastructure and want to build security architectures that scale from hundreds to thousands of systems, we'd love to talk.We value diverse backgrounds, experiences, and skills, and we are excited to hear from candidates who can bring unique perspectives to our team. If you do not exactly meet this description but believe you may be a good fit, please still apply and help us understand your readiness for this role. Your application is not a waste of our time.What You’ll DoDrive Security Architecture: Design and document comprehensive security patterns, standards, and implementation guides that engineering teams can adopt to build secure-by-default systems.Lead Security Reviews: Conduct security design reviews and develop threat models for critical systems, identifying risks and providing actionable recommendations.Develop Security Requirements: Create clear security requirements and acceptance criteria that integrate seamlessly into engineering development cycles.Build Security Solutions: Prototype and implement security controls, tools, and automation that demonstrate secure patterns and enable self-service security.Pioneer AI-Powered Architecture: Leverage Lambda's hosted LLMs to build next-generation security capabilities including automated threat modeling, AI-assisted security reviews, and intelligent architecture recommendations that push far beyond traditional approaches.Collaborate Across Engineering: Partner with Product and Platform Engineering teams to integrate security architecture requirements into their designs at optimal moments.Enable Customer Trust: Develop customer-facing security documentation, architecture whitepapers, and technical security content that demonstrates our security maturity.Mentor Security Excellence: Coach engineers across the organization on secure design principles and security architecture patterns, multiplying your impact.Drive Architectural Standards: Establish and maintain security architecture standards that protect critical assets while enabling development velocity.Advocate for Security: Communicate security architecture value to stakeholders, translating technical risks into business impact for informed decision-making.What We Think a Candidate Needs to Demonstrate to SucceedHave 3+ years of security engineering or security architecture experience and 5+ years of total engineering experience, with demonstrated impact protecting enterprise infrastructure.Thrive in high-speed, high-ambiguity startup environments where you are constantly balancing security goals with business needs.Deep technical expertise in security architecture patterns, threat modeling methodologies, and security design principles.Excel at solving problems through design and prototyping in Python, Go, or similar languages.Proven ability to work effectively with cross-functional technical teams both with and without authority (we're all on the same team!).Strong Linux systems experience in both bare metal and cloud environments, understanding infrastructure from kernel to application layer.Demonstrated experience driving security improvements that were enthusiastically adopted by engineering teams.Excellence at translating security architecture decisions into business risk, enabling stakeholders to make informed decisions.Nice to HaveYou've led the security assessment and requirements for major platform components or enterprise systems.Experience driving or providing significant evidence for compliance audits, such as SOC 2, ISO 27001, PCI-DSS, HIPAA/HITECH, or FedRAMP.Deep experience with cloud security architecture and cloud provider security services (AWS, GCP, Azure).Experience with AI/ML system security, including model security, data pipeline protection, or adversarial threat modeling (yes, we know it’s all brand new), or other high sensitivity workloads.You've developed security architecture patterns that were adopted across multiple engineering teams.Security certifications like CISSP, OSCP, or similar that demonstrate continued learning.Experience with infrastructure-as-code security patterns and secure DevOps practices.Excitement about leveraging our direct access to state-of-the-art LLMs to revolutionize security architecture—imagine AI-powered threat modeling, automated security design reviews, and intelligent architecture validation at a scale only possible when you host the AI infrastructure yourself.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

MLOps / DevOps Engineer

Software Engineer

Apply

September 11, 2025

Hidden link

Fabric SOC Architect

Tenstorrent

1001-5000

USD

100000

-

500000

United States

Full-time

Remote

Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities.At Tenstorrent, we’re building cutting-edge hardware and software solutions that power AI, HPC, and general-purpose workloads. As a Performance Architect on our Platform Architecture team, you’ll work across ML software stacks, compilers, CPU design, cache coherency protocols, and interconnect fabrics to shape the future of high-performance systems. This role is all about bridging software execution and silicon design—making data-driven decisions that directly influence our SoC performance. This role is remote, based out of The United States. We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting.   Who You Are Passionate about solving complex system-level performance problems. Comfortable working across hardware and software boundaries. Analytical and data-driven, with a talent for turning workloads into architectural insights. Collaborative, thriving in cross-functional teams spanning compilers, CPU, and interconnect. Excited to shape the future of AI/HPC platforms through performance architecture.   What We Need BS/MS/PhD in EE, ECE, CE, or CS Deep understanding of NoC topologies, routing algorithms, QoS, and traffic scheduling. Expertise in cache coherency protocols (AMBA CHI/AXI) and modern memory/IO technologies (DDR, LPDDR, GDDR, PCIe, CCIX, CXL). Proficiency in C/C++ programming, with experience in building efficient performance models. Familiarity with ML/AI traffic patterns or formal verification of cache coherence protocols is a strong plus.   What You Will Learn How real ML/AI traffic patterns influence SoC interconnect and cache design. The art of balancing performance vs. complexity in coherence and memory hierarchies. How performance models feed into CPU and accelerator microarchitecture decisions. Best practices for correlating pre-silicon and post-silicon performance. Cutting-edge approaches to integrating heterogeneous compute systems at scale.   Compensation for all engineers at Tenstorrent ranges from $100k - $500k including base and variable compensation targets. Experience, skills, education, background and location all impact the actual offer made. Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer. This offer of employment is contingent upon the applicant being eligible to access U.S. export-controlled technology.  Due to U.S. export laws, including those codified in the U.S. Export Administration Regulations (EAR), the Company is required to ensure compliance with these laws when transferring technology to nationals of certain countries (such as EAR Country Groups D:1, E1, and E2).   These requirements apply to persons located in the U.S. and all countries outside the U.S.  As the position offered will have direct and/or indirect access to information, systems, or technologies subject to these laws, the offer may be contingent upon your citizenship/permanent residency status or ability to obtain prior license approval from the U.S. Commerce Department or applicable federal agency.  If employment is not possible due to U.S. export laws, any offer of employment will be rescinded.

MLOps / DevOps Engineer

Machine Learning Engineer

Software Engineer

Apply

September 11, 2025

Hidden link

Engineering Manager, Cloud Security

Anthropic

1001-5000

USD

0

320000

-

405000

United States

Full-time

Remote

About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.About the Team At Anthropic, the Security Engineering team's mission is to safeguard our AI systems and maintain the trust of our users and society at large. Whether we're developing critical security infrastructure, building secure development practices, or partnering with our research and product teams, we are committed to operating as a world-class security organization and keeping the safety and trust of our users at the forefront of everything we do. What you'll do: We are hiring an Engineering Manager to lead our Cloud Security team, whose remit is to build and maintain secure foundations for our AI systems. This role focuses on cloud infrastructure and platform services, with particular emphasis on data protection, intuitive access controls, and enabling secure-by-default patterns across our multi-cloud environments. In this role you will: Lead a security engineering team while maintaining deep technical involvement in cloud security architecture across all cloud environments Design and implement comprehensive security controls for model weights, customer data, and training datasets Build secure-by-default infrastructure to create self-service patterns that prevent misconfigurations while enabling engineering velocity Establish continuous attack surface monitoring and automated remediation systems using cloud security posture management tools, custom detection capabilities, and infrastructure-as-code security scanning Partner with product, research, and infrastructure teams to embed security requirements into system design and development workflows from inception through deployment Identify and assess systematic risks across the organization, translating complex security requirements into prioritized roadmaps and actionable engineering work Develop security and reliability metrics and observability systems that measure risk posture, demonstrate security improvements, and inform data-driven security investments Coach and mentor engineers while building a team culture that balances security rigor with development velocity and innovation Who You Are: 4+ years managing security engineering teams with proven track record of team productivity 5+ years hands-on infrastructure security and software engineering experience Deep expertise in securing complex cloud environments, threat modeling, and risk assessment Strong cross-functional collaboration skills, balancing security requirements with business objectives Clear and persuasive communicator in both writing and verbal settings Low ego, high empathy leader who attracts talent and builds diverse, inclusive teams Passionate about developing engineers' careers in a supportive yet challenging environment Experience working with technical internal customers Familiarity with AI safety concepts and frameworks Deadline to apply: None. Applications will be reviewed on a rolling basis. The expected salary range for this position is:Annual Salary:$320,000—$405,000 USDLogistics Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this. We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed.  Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team. How we're different We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills. The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences. Come work with us! Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process

MLOps / DevOps Engineer

Apply

September 11, 2025

Hidden link

Cloud platform engineer

Writer

1001-5000

-

United States

Full-time

Remote

📐 About this roleWRITER is experiencing an incredible market moment as generative AI has taken the world by storm. We're looking for a cloud platform engineer to establish our cloud platform team, focusing on building and scaling our multi-cloud architecture across AWS, GCP and Azure regions. In this founding role, you'll architect and implement highly scalable systems that handle complex multi-tenant workloads while ensuring proper tenant isolation and security.As a cloud platform engineer, you'll work closely with our development teams to build robust, automated solutions for environment buildout, tenant management, and cross-region capabilities. You'll design and implement systems that ensure proper tenant isolation while enabling efficient environment lifecycle management. This is a unique opportunity to establish our cloud platform team and have a direct impact on our platform's scalability and security, working with cutting-edge technologies and solving complex distributed systems challenges.You will report to Charles Cooke, head of cloud operations.🦸🏻‍♀️ Your responsibilities:Establish and lead the cloud platform teamArchitect and implement multi-cloud infrastructure across AWS, GCP and Azure regionsDesign and build highly scalable, distributed systems for multi-tenant workloadsDefine and implement best practices for automated infrastructure across cloud providersArchitect and implement tenant isolation mechanisms to ensure data security and complianceCreate and manage environment lifecycle automation for dev/qa/demo environmentsDesign and implement cross-region capabilities and active-active deploymentsDevelop tenant migration and pod management solutionsCollaborate with development teams to understand tenant requirements and constraintsImplement infrastructure as code for region-level deploymentsMonitor and optimize tenant isolation and securityDocument tenant management processes and best practicesStay current with industry trends in multi-tenant architecturesParticipate in on-call rotation for critical platform servicesContribute to technical decisions around tenant isolation and region managementMentor and grow the cloud platform teamImplement and maintain deployment automation and CI/CD pipelinesDesign and optimize Kubernetes infrastructure for multi-tenant workloadsDrive infrastructure cost optimization and efficiency⭐️ Is this you?Have 8+ years of experience in cloud platform engineering or related role (site reliability engineering)Have 3+ years of experience leading engineering teamsAre passionate about building secure, scalable multi-tenant platformsHave extensive experience with cloud platforms (AWS, GCP, or Azure)Are proficient in infrastructure as code (Terraform, CloudFormation, etc.)Have deep experience with multi-tenant architectures and tenant isolationCan write clean, maintainable code in Python, Go, or similar languagesUnderstand containerization and orchestration (Docker, Kubernetes)Have proven experience with cross-region deployments and active-active architecturesAre comfortable working with multiple development teamsCan communicate technical concepts clearly to both technical and non-technical audiencesTake ownership of projects and drive them to completionAre excited about building automated infrastructure solutionsHave a strong focus on security and tenant isolationAre comfortable with on-call responsibilitiesHave experience with agile development methodologiesHave experience establishing new teams and best practicesCan balance technical leadership with hands-on implementationAre excited about solving complex distributed systems challengesHave experience with multi-cloud architectures and hybrid deployments 🍩 Benefits & perks (US Full-time employees)Generous PTO, plus company holidaysMedical, dental, and vision coverage for you and your familyPaid parental leave for all parents (12 weeks)Fertility and family planning supportEarly-detection cancer testing through GalleriFlexible spending account and dependent FSA optionsHealth savings account for eligible plans with company contributionAnnual work-life stipends for:Home office setup, cell phone, internetWellness stipend for gym, massage/chiropractor, personal training, etc.Learning and development stipendCompany-wide off-sites and team off-sitesCompetitive compensation, company stock options and 401kWRITER is an equal-opportunity employer and is committed to diversity. We don't make hiring or employment decisions based on race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or any other basis protected by applicable local, state or federal law. Under the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.By submitting your application on the application page, you acknowledge and agree to WRITER's Global Candidate Privacy Notice.

MLOps / DevOps Engineer

Apply

September 8, 2025

Hidden link

Developer Infrastructure, Tech Lead

Augment Code

101-200

USD

0

225000

-

300000

United States

Full-time

Remote

About Augment Augment Code is the only AI coding assistant built for professional software engineers working in large, production‑grade codebases. Our Context Engine understands your entire repo, enabling developers to stay in flow while writing, reviewing, and understanding code. Backed by top‑tier investors and trusted by engineering teams at leading tech companies, Augment Code is redefining how modern software is built. About the Role We’re looking for a strong software engineer with a passion for helping developers be more productive by building the tools they need. As our Developer Infrastructure Tech Lead, you’ll design and drive the systems that make our engineers faster, safer, and more effective.You’ll shape our CI/CD pipelines, developer tooling, and infrastructure strategy, ensuring we can scale effectively in a fast-moving startup environment. This is a hands-on leadership role where you’ll architect solutions, guide best practices, and work closely with feature teams to remove friction from the development lifecycle. In this role you will: Architect, build, and maintain scalable CI/CD systems that go beyond off-the-shelf solutions. Define and implement strategies to maximize developer productivity, from build tooling to dev workflows. Push the boundaries of how AI can be applied to developer workflows, experimenting with novel tools and approaches to unlock new levels of engineering productivity. Partner with engineering teams to improve service-based architectures and streamline deployment processes. Mentor and guide other engineers in building reliable, high-performance developer infrastructure. Balance short-term fixes with long-term vision in a fast-paced startup environment. You have: Proven experience designing and building CI/CD systems from the ground up. A strong understanding of developer productivity challenges and how to address them. Deep experience with service-based architectures and Kubernetes. Experience leading technical initiatives and collaborating across teams. Comfort operating in a dynamic, unstructured startup environment. While not required, it’s an added plus if you also have: Experience with Bazel or other advanced build systems. Familiarity with cloud platforms (especially GCP). Familiarity with observability platforms and monitoring best practices. Prior experience scaling infrastructure in a high-growth startup. Employee Benefits: Flexible work hours Competitive salary & Equity Tools Stipend Health, Dental, Vision and Life Insurance Short Term and Long Term Disability Unlimited Paid Time Off + Holidays. We focus on trust and ownership, not time in the chair Numerous company social events We will do everything we can within reason to make sure that your interview takes place in an environment that fairly and accurately assesses your skills. If you need assistance or accommodation, please contact your recruiter. Augment Code is proud to be an Equal Employment Opportunity employer. We do not discriminate based upon race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. By applying for this job, the candidate acknowledges and agrees that any personal data contained in their application or supporting materials will be processed in accordance with Augment Code's Applicant Privacy Policy. Pay Transparency Notice: The actual base salary within the stated range will be based on a combination of factors such as an individual's skills, experience level, educational background, and other relevant job-related considerations. Annual Base Salary Range$225,000—$300,000 USD

MLOps / DevOps Engineer

Software Engineer

Apply

September 5, 2025

Hidden link

Top MLOps / DevOps Engineer Jobs Openings in 2025

Application Security Engineer

Senior Infrastructure Engineer

Senior Platform Engineer

Senior Site Reliability Engineer

Senior Cloud Network Engineer

Security and Compliance Lead

Cybersecurity - Site Reliablity Engineer, X Money

DevOps Engineer I

Member of Technical Staff - Training Cluster Engineer

Director of production support

Director of production support

Member of Technical Staff - ML Infra

Senior Site Reliability Engineer - Managed Kubernetes

Security Engineer - Detection & Response

Senior Networking Engineer

Security Engineer - Architecture

Fabric SOC Architect

Engineering Manager, Cloud Security

Cloud platform engineer

Developer Infrastructure, Tech Lead

Popular Categories