Top MLOps / DevOps Engineer Jobs Openings in 2025
Looking for opportunities in MLOps / DevOps Engineer? This curated list features the latest MLOps / DevOps Engineer job openings from AI-native companies. Whether you're an experienced professional or just entering the field, find roles that match your expertise, from startups to global tech leaders. Updated everyday.
Kubernetes Platform Engineer
TensorWave
51-100
-
United States
Full-time
Remote
false
At TensorWave, we're leading the charge in AI compute, building a versatile cloud platform that's driving the next generation of AI innovation. We're focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what's possible in the AI landscape.
About the Role:As a Kubernetes Platform Engineer focused on support and operations, you’ll play a critical role in maintaining the stability and reliability of our bare-metal Kubernetes infrastructure. You will work closely with senior engineers, taking point on troubleshooting, incident response, and day-to-day cluster operations across multi-tenant workloads.This is a great opportunity for engineers ready to deepen their Kubernetes expertise while supporting cutting-edge AI environments in real-time.Responsibilities:Own and troubleshoot operational issues within Kubernetes environmentsMaintain and monitor core services (e.g., Cilium, HAProxy, Prometheus, etc.)Ensure uptime, performance, and reliability of multi-tenant clustersAssist with Ingress/Egress connectivity and network debuggingSupport internal and customer teams in secure, isolated VPC environmentsCollaborate with senior engineers on automation and cluster lifecycle improvementsRequired Skills & Experience:2–4 years experience in DevOps, SRE, or Linux infrastructure roles1+ years of hands-on experience with Kubernetes in productionFamiliarity with networking, CNI plugins, and core Linux troubleshootingStrong infrastructure-as-code mindset using tools like Helm, Terraform, or AnsibleSolid experience with monitoring and logging tools (e.g., Prometheus, Grafana, Loki)Understanding of secure infrastructure design principles and least-privilege accessComfortable working in a team-oriented, fast-paced operational environmentNice to Have:Experience with RKE2, Rancher, or similar platformsExperience troubleshooting or supporting AI or GPU-based workloadsFamiliarity with HAProxy, Cilium, or other Kubernetes ingress/networking toolsWhat We Bring:In addition to a competitive salary, we offer a variety of benefits to support your needs, including:Stock Options100% paid Medical, Dental, and Vision insuranceLife and Voluntary Supplemental InsuranceShort Term Disability InsuranceFlexible Spending Account401(k)Flexible PTOPaid HolidaysParental LeaveMental Health Benefits through Spring Health
MLOps / DevOps Engineer
Data Science & Analytics
Apply
September 5, 2025
Senior Kubernetes Platform Engineer
TensorWave
51-100
-
United States
Full-time
Remote
false
At TensorWave, we're leading the charge in AI compute, building a versatile cloud platform that's driving the next generation of AI innovation. We're focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what's possible in the AI landscape.Responsibilities:Design and deploy bare-metal Kubernetes clusters at scale using RKE2Collaborate with senior engineers on architectural improvements, infrastructure planning, and automationLead the design and implementation of Ingress and Egress traffic solutions, leveraging HAProxy, Cilium, and other componentsContribute to multi-tenant environment designs including VPC-level isolation, network policy enforcement, and secure shared servicesDrive continuous improvement around observability using Prometheus and related toolingServe as a subject matter expert in core Linux, networking, and Kubernetes internalsCollaborate cross-functionally with AI platform teams and internal/external customersRequired Skills & Experience:5+ years of experience in infrastructure or DevOps engineering roles3+ years hands-on experience managing Kubernetes in bare-metal environmentsProven expertise in designing multi-tenant Kubernetes clusters with strong network isolationDeep understanding of Linux systems internals, networking (IPTables, CNI plugins, BGP), and DNSExperience with ingress controllers, load balancing, and service mesh (e.g., HAProxy, Cilium, Envoy)Strong infrastructure-as-code mindset using tools like Helm, Terraform, or AnsibleExperience monitoring Kubernetes workloads with Prometheus and related observability toolsNice to Have:Familiarity with RKE2, Rancher, or other downstream Kubernetes distributionsExposure to AI/ML infrastructure workloads or GPU resource schedulingExperience in infrastructure compliance or secure multi-tenancy (e.g., PCI, SOC2)What We Bring:In addition to a competitive salary, we offer a variety of benefits to support your needs, including:Stock Options100% paid Medical, Dental, and Vision insuranceLife and Voluntary Supplemental InsuranceShort Term Disability InsuranceFlexible Spending Account401(k)Flexible PTOPaid HolidaysParental LeaveMental Health Benefits through Spring Health
MLOps / DevOps Engineer
Data Science & Analytics
Apply
September 5, 2025
Senior Support Engineer
Thoughtful
101-200
USD
0
120000
-
160000
United States
Full-time
Remote
false
Join Our Mission to Revolutionize Healthcare Thoughtful is pioneering a new approach to automation for all healthcare providers! Our AI-powered Revenue Cycle Automation platform enables the healthcare industry to automate and improve its core business operations. We're looking for an Exceptional Principal Support Engineer to transform our support operations. As a critical member of our support organization, you will serve as the diagnostic expert and escalation point between our Customer Support Agents and Solutions Engineers. You'll bring enterprise-level troubleshooting expertise to complex customer issues, mentor the team on diagnostic methodologies, and help us scale our support capabilities to match our rapid growth. Your work will directly impact customer satisfaction and enable healthcare organizations to operate more efficiently. Your Role: Diagnose: Lead comprehensive technical investigations using advanced troubleshooting methodologies (log analysis, network diagnostics, performance profiling) Resolve: Handle all support activities from Tier 1 through Tier 3, excluding code changes, serving as the primary escalation point for complex issues Mentor: Develop and train the support team on diagnostic best practices, creating playbooks and standardized troubleshooting procedures Collaborate: Work closely with CSAs, Solutions Engineers, and Forward Deployed Engineers to ensure smooth handoffs and efficient issue resolution Optimize: Identify patterns in support issues to drive process improvements and preventive measures Document: Create and maintain comprehensive troubleshooting guides, knowledge base articles, and root cause analyses Your Qualifications: 5+ years of enterprise technical support experience in complex, distributed systems environments Expert-level diagnostic skills including: System and application log analysis Performance troubleshooting and profiling Network diagnostics (packet analysis, latency troubleshooting) API and integration troubleshooting Proven track record of handling critical escalations and reducing resolution times Understanding of distributed systems, microservices architectures, and API integrations Excellent communication skills - ability to explain complex technical issues to both technical and non-technical audiences Familiarity with monitoring and observability tools (DataDog, New Relic, ELK stack, or similar) Working knowledge of scripting for automation (Python, Bash, or similar) - not required to write production code Mentorship experience - demonstrated success in upskilling technical teams What Sets You Apart: Healthcare IT experience - Understanding of healthcare systems, HIPAA regulations, and healthcare data standards (HL7, FHIR) AI/ML exposure - Familiarity with AI/ML systems and their unique troubleshooting requirements Python familiarity - Ability to read and understand Python code for diagnostic purposes Process improvement mindset - Track record of implementing scalable support processes Data-driven approach - Experience using metrics to drive support improvements Crisis management - Proven ability to lead during high-pressure customer escalations Why Thoughtful? Competitive compensation aligned with senior support engineering market rates Equity participation: Employee Stock Options Health benefits: Comprehensive medical, dental, and vision insurance Time off: Generous leave policies and paid company holidays Impact: Direct influence on support strategy and team development Growth: Opportunity to build and lead as we scaleCalifornia Salary Range $120,000—$160,000 USD
MLOps / DevOps Engineer
Data Science & Analytics
Apply
September 3, 2025
AI Infrastructure Engineer
TensorWave
51-100
USD
-
United States
Full-time
Remote
false
At TensorWave, we're leading the charge in AI compute, building a versatile cloud platform that's driving the next generation of AI innovation. We're focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what's possible in the AI landscape.About the Role: We are looking for an AI Infrastructure Engineer with a passion for high-performance computing and distributed systems. The ideal candidate will support our vision by developing and managing the compute infrastructure that underpins our innovative AI cloud services. This role involves building and maintaining robust AI clusters, ensuring optimal performance and reliability for our clients' most demanding workloads.Responsibilities:Collaborate with a dynamic IT team to design, deploy, and maintain high-performance AI compute clusters supporting both AMD and NVIDIA GPU technologies.Lead initiatives to optimize cluster performance, resource utilization, and job scheduling to maximize efficiency across diverse AI workloads.Ensure system reliability, performance, and security for cloud services, implementing monitoring solutions and automated recovery systems.Work closely with the AI development team to align infrastructure capabilities with the evolving needs of TensorWave's cloud platform.Troubleshoot and resolve complex infrastructure issues across Linux systems, networking, and distributed computing environments, providing expert guidance to maintain high service levels.Implement and maintain configuration management, deployment automation, and infrastructure-as-code practices.Essential Skills & Qualifications:Bachelor's degree in Computer Science, Information Technology, or related field.At least 5 years of relevant experience in infrastructure engineering, with a focus on supporting high-performance computing (HPC) and AI applications.Expert-level Linux system administration skills across multiple distributions.Strong experience with clustered computing environments (GPU, CPU, or hybrid clusters).Solid understanding of networking fundamentals, including TCP/IP, routing protocols, and high-speed interconnects.Experience with container technologies (Docker, Kubernetes), job schedulers (Slurm, PBS), and configuration management tools.Familiarity with AMD and NVIDIA GPU ecosystems, CUDA, ROCm, and their infrastructure requirements.Exceptional debugging and problem-solving abilities with a methodical approach to complex system issues.Demonstrated ability to learn new technologies quickly and adapt to rapidly evolving infrastructure needs.We're looking for resilient, adaptable people to join our team—folks who enjoy collaborating and tackling tough challenges. We're all about offering real opportunities for growth, letting you dive into complex problems and make a meaningful impact through creative solutions. If you're a driven contributor, we encourage you to explore opportunities to make an impact at TensorWave. Join us as we redefine the possibilities of intelligent computing.What We Bring:In addition to a competitive salary, we offer a variety of benefits to support your needs, including:Stock Options100% paid Medical, Dental, and Vision insuranceLife and Voluntary Supplemental InsuranceShort Term Disability InsuranceFlexible Spending Account401(k)Flexible PTOPaid HolidaysParental LeaveMental Health Benefits through Spring Health
MLOps / DevOps Engineer
Data Science & Analytics
Apply
September 1, 2025
Member of Technical Staff - GPU Infrastructure
Prime Intellect
11-50
-
United States
Full-time
Remote
false
Building the Future of Decentralized AI DevelopmentAt Prime Intellect, we're enabling the next generation of AI breakthroughs by helping our customers deploy and optimize massive GPU clusters. As our Solutions Architect for GPU Infrastructure, you'll be the technical expert who transforms customer requirements into production-ready systems capable of training the world's most advanced AI models.We recently raised $15mm in funding (total of $20mm raised) led by Founders Fund, with participation from Menlo Ventures and prominent angels including Andrej Karpathy (Eureka AI, Tesla, OpenAI), Tri Dao (Chief Scientific Officer of Together AI), Dylan Patel (SemiAnalysis), Clem Delangue (Huggingface), Emad Mostaque (Stability AI) and many others.Core Technical ResponsibilitiesThis customer-facing role combines deep technical expertise with hands-on implementation. You'll be instrumental in:Customer Architecture & DesignPartner with clients to understand workload requirements and design optimal GPU cluster architecturesCreate technical proposals and capacity planning for clusters ranging from 100 to 10,000+ GPUsDevelop deployment strategies for LLM training, inference, and HPC workloadsPresent architectural recommendations to technical and executive stakeholdersInfrastructure Deployment & OptimizationDeploy and configure orchestration systems including SLURM and Kubernetes for distributed workloadsImplement high-performance networking with InfiniBand, RoCE, and NVLink interconnectsOptimize GPU utilization, memory management, and inter-node communicationConfigure parallel filesystems (Lustre, BeeGFS, GPFS) for optimal I/O performanceTune system performance from kernel parameters to CUDA configurationsProduction Operations & SupportServe as primary technical escalation point for customer infrastructure issuesDiagnose and resolve complex problems across the full stack - hardware, drivers, networking, and softwareImplement monitoring, alerting, and automated remediation systemsProvide 24/7 on-call support for critical customer deploymentsCreate runbooks and documentation for customer operations teamsTechnical RequirementsRequired Experience3+ years hands-on experience with GPU clusters and HPC environmentsDeep expertise with SLURM and Kubernetes in production GPU settingsProven experience with InfiniBand configuration and troubleshootingStrong understanding of NVIDIA GPU architecture, CUDA ecosystem, and driver stackExperience with infrastructure automation tools (Ansible, Terraform)Proficiency in Python, Bash, and systems programmingTrack record of customer-facing technical leadershipInfrastructure SkillsNVIDIA driver installation and troubleshooting (CUDA, Fabric Manager, DCGM)Container runtime configuration for GPUs (Docker, Containerd, Enroot)Linux kernel tuning and performance optimizationNetwork topology design for AI workloadsPower and cooling requirements for high-density GPU deploymentsNice to HaveExperience with 1000+ GPU deploymentsNVIDIA DGX, HGX, or SuperPOD certificationDistributed training frameworks (PyTorch FSDP, DeepSpeed, Megatron-LM)ML framework optimization and profilingExperience with AMD MI300 or Intel Gaudi acceleratorsContributions to open-source HPC/AI infrastructure projectsGrowth OpportunityYou'll work directly with customers pushing the boundaries of AI, from startups training foundation models to enterprises deploying massive inference infrastructure. You'll collaborate with our world-class engineering team while having direct impact on systems powering the next generation of AI breakthroughs.We value expertise and customer obsession - if you're passionate about building reliable, high-performance GPU infrastructure and have a track record of successful large-scale deployments, we want to talk to you.Apply now and join us in our mission to democratize access to planetary scale computing.
MLOps / DevOps Engineer
Data Science & Analytics
Solutions Architect
Software Engineering
Apply
August 29, 2025
Backline Manager (Apache Spark™)
Databricks
5000+
-
Netherlands
Remote
false
P-1455 Job Description At Databricks, we are passionate about enabling data teams to solve the world's toughest problems — from making the next mode of transportation a reality to accelerating the development of medical breakthroughs. We do this by building and running the world's best data and AI infrastructure platform so our customers can use deep data insights to improve their business. Founded by engineers — and customer-obsessed — we leap at every opportunity to tackle technical challenges, from designing next-gen UI/UX for interfacing with data to scaling our services and infrastructure across millions of virtual machines. And we're only getting started. About the Team The Backline Engineering Team serves as the critical bridge between Engineering and Frontline Support. We handle complex technical issues and escalations across the Apache Spark™ ecosystem and the Databricks Platform stack. With a strong focus on customer success, we are committed to delivering exceptional customer satisfaction by providing deep technical expertise, proactive issue resolution, and continuous improvements to the platform. We emphasize automation and tooling to enhance troubleshooting efficiency, reduce manual efforts, and improve the overall supportability of the platform. By developing smart solutions and streamlining workflows, we drive operational excellence and ensure a seamless experience for both customers and internal teams. The impact you will have Hire and develop top talent to build an outstanding team. Mentor engineers, provide clear feedback, and develop future leaders in the team. Establish and maintain high standards in troubleshooting, automation, and tooling to improve efficiency. Work closely with Engineering to enhance observability, debugging tools, and automation, reducing escalations. Collaborate with Frontline Support, Engineering, and Product teams to improve customer escalations and support processes. Define a long-term roadmap for Backline, focusing on automation, tool development, bug fixing and proactive issue resolution. Take ownership of high-impact customer escalations by leading critical incident response during Databricks runtime outages and major incidents. Participate in weekday and weekend on-call rotations, ensuring fast and effective resolution of urgent issues. Balance real-time escalations with day-to-day planning, and multitasking efficiently to drive operational excellence and provide top-tier support for mission-critical customer environments. What We Look For: 10-12 years of experience in Big Data/Data warehousing eco-system with expertise on Apache Spark™, with at least 4+ years in a managerial role. Proven ability to manage and mentor a team of Backline Engineers, guiding career development Strong technical expertise in Apache Spark™, Databricks Runtime, Delta Lake, Hadoop, and cloud platforms (AWS, Azure, GCP) to troubleshoot complex customer issues. Ability to oversee and drive customer escalations, ensuring seamless coordination between Frontline Support and Backline Engineering. Experience in designing and developing best practices, runbooks/playbooks, and enablement programs to improve troubleshooting efficiency. Strong automation mindset, identifying tooling and process gaps, and leading efforts to build scripts and automated tools to enhance support operations. Skilled in collaborating with Engineering and Product Management teams, contributing to support readiness programs and shaping product supportability improvements. Experience in building monitoring and alerting mechanisms, proactively identifying long-running cases and driving early intervention. Ability to handle critical technical escalations, providing deep expertise in architecture, best practices, product functionality, performance tuning, and cloud operations. Strong interviewing and hiring capabilities, identifying and recruiting top Backline talent with expertise in big data and cloud ecosystems. About Databricks Databricks is the data and AI company. More than 10,000 organizations worldwide — including Comcast, Condé Nast, Grammarly, and over 50% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to unify and democratize data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow. To learn more, follow Databricks on Twitter, LinkedIn and Facebook.
Benefits
At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all of our employees. For specific details on the benefits offered in your region, please visit https://www.mybenefitsnow.com/databricks.
Our Commitment to Diversity and Inclusion At Databricks, we are committed to fostering a diverse and inclusive culture where everyone can excel. We take great care to ensure that our hiring practices are inclusive and meet equal employment opportunity standards. Individuals looking for employment at Databricks are considered without regard to age, color, disability, ethnicity, family or marital status, gender identity or expression, language, national origin, physical and mental ability, political affiliation, race, religion, sexual orientation, socio-economic status, veteran status, and other protected characteristics. Compliance If access to export-controlled technology or source code is required for performance of job duties, it is within Employer's discretion whether to apply for a U.S. government license for such positions, and Employer may decline to proceed with an applicant on this basis alone.
MLOps / DevOps Engineer
Data Science & Analytics
Data Engineer
Data Science & Analytics
Apply
August 29, 2025
Customer Support Engineer, India
Together AI
201-500
-
India
Full-time
Remote
true
Customer Support Engineer Location: India (Remote) About the role: As a Customer Support Engineer at a pioneering AI company, you'll be the first line of defense to support customers as they build out training, fine tuning, and inference solutions with Together AI. You'll dive deep into complex technical challenges, providing swift and effective solutions while serving as a product expert. As a part of the Customer Experience organization, you will collaborate closely with product and sales, driving continuous improvement of our offerings. This is an exciting opportunity for a deeply technical professional passionate about AI and customer success to make a significant impact in a fast-paced, innovative environment. Responsibilities Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services; ensure swift and effective solutions every time. Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams. Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns; collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction. Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support) Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers. Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers. Qualifications 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments. Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages. Familiarity with operating storage systems in HPC environments such as Vast and Weka Familiarity with inspecting and resolving network-related errors Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters. Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success. Strong sense of ownership and willingness to learn new skills to ensure both team and customer success. Excellent communication and interpersonal skills, with the ability to explain complex technical concepts to non-technical stakeholders. Ability to operate in dynamic environments, adept at managing multiple projects, and comfortable with frequent context switching and prioritization. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work for the respective hiring region. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
August 28, 2025
Datacenter Liquid Cooling Architect
Tenstorrent
1001-5000
USD
0
100000
-
500000
Canada
United States
Full-time
Remote
false
Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities.At Tenstorrent, we’re building the future of AI compute — and keeping that future cool requires innovation at scale. We’re looking for an engineer who thrives on solving complex infrastructure challenges to design and deliver the next generation of liquid cooling systems for large AI clusters. In this role, you’ll work closely with cross-functional teams, create resilient and reliable cooling strategies, and help shape datacenter infrastructure that powers breakthrough AI workloads. This role is hybrid, based out of Toronto, Canada, Austin, Texas OR Santa Clara, California. We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting. Who You Are An engineer with a background in datacenter thermal design (a degree in Electrical or Computer Engineering is valuable but not required). Someone who enjoys tackling complex liquid cooling challenges and has experience working directly with cooling systems. Comfortable with fluids, pressure testing, and leak detection to ensure safe, reliable designs. Familiar with monitoring and control systems and how they integrate with facility HVAC infrastructure. Experienced with single-phase liquid cooling (bonus if you’ve worked with two-phase). What We Need A technical leader to architect and implement liquid cooling infrastructure for AI training and inference clusters. An engineer to define operational standards, safety protocols, and CDU control strategies that maintain uptime. A collaborator who can partner with mechanical, software, and system engineering teams to deliver advanced cooling solutions. An innovator to design leak detection methods and monitoring systems that safeguard mission-critical environments. A trusted contributor to support AI cluster deployments for internal and external customers. What You Will Learn Collaboration with experts across thermal, mechanical, and systems engineering. Practical experience integrating telemetry, sensors, and CDU controls into datacenter operations. Exposure to next-generation liquid cooling technologies, including pumped two-phase solutions. A chance to help define industry-leading infrastructure that supports the world’s most advanced AI systems. Compensation for all engineers at Tenstorrent ranges from $100k - $500k including base and variable compensation targets. Experience, skills, education, background and location all impact the actual offer made. Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer. This offer of employment is contingent upon the applicant being eligible to access U.S. export-controlled technology. Due to U.S. export laws, including those codified in the U.S. Export Administration Regulations (EAR), the Company is required to ensure compliance with these laws when transferring technology to nationals of certain countries (such as EAR Country Groups D:1, E1, and E2). These requirements apply to persons located in the U.S. and all countries outside the U.S. As the position offered will have direct and/or indirect access to information, systems, or technologies subject to these laws, the offer may be contingent upon your citizenship/permanent residency status or ability to obtain prior license approval from the U.S. Commerce Department or applicable federal agency. If employment is not possible due to U.S. export laws, any offer of employment will be rescinded.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
August 26, 2025
Senior Manager - Security Incident Detection and Response
Lambda AI
501-1000
USD
0
360000
-
540000
United States
Full-time
Remote
false
We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.About the RoleLambda Security protects some of the world's most valuable digital assets: invaluable training data, model weights representing immense computational investments, and the sensitive inputs required to leverage best of breed AI models. We're responsible for securing every byte that powers breakthrough artificial intelligence.Reporting to the Head of Security, you'll lead the Detection & Response team that acts as an intelligent backstop—ensuring Lambda is the safest place to build with AI by catching security issues in real-time while enabling our business to move at hypergrowth velocity. Your team will transform reactive security operations into a proactive threat management engine, dedicating the majority of their effort to automation, threat hunting, and capability building rather than constant firefighting.Your team will directly affect customers’ trust in the safety of their data by implementing enterprise-grade detection capabilities, automating incident response workflows, and hardening our multi-cloud and bare metal infrastructure while you are building sustainable programs where senior engineers thrive solving novel security challenges. With unique access to LLMs hosted on our own infrastructure, your team will pioneer AI-powered security solutions that wouldn't be possible anywhere else. Key priorities include 24/7 operational coverage, maintaining customer trust through rapid incident response, and delivering a comprehensive D&R strategy within your first 6 months.If you're excited about building security operations so efficient that they're "almost boring," where automation handles the routine so your team can focus on the novel, we want to talk.We value diverse backgrounds, experiences, and skills, and we are excited to hear from candidates who can bring unique perspectives to our team. If you do not exactly meet this description but believe you may be a good fit, please still apply and help us understand your readiness for a Senior Manager role. Your application is not a waste of our time.What You’ll DoTeam Leadership & Management:Build, hire, and lead a high-performing Detection & Response team that can scale with Lambda's hypergrowth while maintaining 24/7 operational excellenceDefine team processes, culture, and operating rhythms that balance startup agility with security discipline, creating an environment where senior engineers thrive on automation and novel challengesConduct regular one-on-ones, provide constructive feedback, and create clear career development paths that help security engineers advance their technical and leadership skillsDrive outcomes by managing project priorities, deadlines, and deliverables while establishing our blameless post-incident culture focused on systemic improvements rather than individual accountabilityTechnical Strategy & Execution:Define and implement threat management frameworks that transform reactive security operations into proactive threat hunting and detection, establishing automation standards that eliminate repetitive work and enable your team to focus on novel challengesArchitect incident response processes and escalation frameworks that protect Lambda from impact while scaling with the company’s growthGuide technology choices and evangelize new security tools, including pioneering AI-powered detection capabilities using our direct access to state-of-the-art LLMsStrategic Collaboration & Business Impact:Create data-driven insights showing where we are reacting most frequently to guide investments in preventative controlsPartner with Product and Platform engineering teams to evolve our detection and response capabilities as Lambda’s infrastructure growsEstablish executive reporting that translates technical incidents into business impact while maintaining a blameless culture focused on systemic improvementsOperational Excellence & Scaling:Drive weekly operations reviews that ensure nothing falls through the cracks while building institutional knowledge and defining repeatable processes from every incidentDefine sustainable on-call rotations and operational procedures that maintain 24/7 coverage without burning out senior engineersEstablish the team's 6-month strategic roadmap for comprehensive D&R capabilities while defining success criteria and measurable outcomesWhat We Think a Candidate Needs to Demonstrate to Succeed10+ years of security experience with 5+ years leading technical teams, demonstrating ability to build and manage independentlyProven ability to define and build security programs from the ground up that accelerate business initiatives, with demonstrated experience establishing team processes, technical frameworks, and cross-functional partnershipsExcellence at building automation-first security programs where technology eliminates toil and teams never do the same thing twiceClear understanding of the unique requirements of securing a cloud infrastructure providerProven ability to create sustainable team cultures where senior engineers thrive long-term rather than burning out on repetitive tasksStrong judgment in security response, understanding real business impact and calibrating actions proportionallyTrack record of translating technical security work into executive communications and business-aligned metricsThrives in high-ambiguity environments where you must build structure while executing at startup paceNice to HaveExcitement about leveraging our direct access to state-of-the-art LLMs to revolutionize security operations—imagine AI-powered threat hunting, automated security report generation, and intelligent vulnerability prioritization at a scale only possible when you host the AI infrastructure yourself.Experience building D&R programs at AI/ML companiesTrack record using AI/ML for security operations automation (yes, we know it’s all brand new)Background scaling security during hypergrowth (10x growth phases)Deep technical background allowing hands-on contribution when neededExperience with both build and buy decisions for security toolingExperience driving or providing significant evidence for compliance audits, such as SOC 2, ISO 27001, PCI-DSS, HIPAA/HITECH, or FedRAMP.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
August 26, 2025
Detection and Response Engineer
Cerebras Systems
501-1000
-
India
Remote
false
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. About The Role We are seeking an exceptional Detection and Response Engineer to serve on the front lines, where you will build systems to detect threats, investigate incidents, and lead coordinated response across teams. The right candidate brings hands-on experience creating reliable detections, automating repetitive tasks, and turning investigation findings into durable improvements to our security program, with an interest in exploring AI-driven automation. Responsibilities Create and optimize detections, playbooks, and workflows to quickly identify and respond to potential incidents. Investigate security events and participate in incident response, including on-call responsibilities. Automate investigation and response workflows to reduce time to detect and remediate incidents. Build and maintain detection and response capabilities as code, applying modern software engineering rigor. Explore and apply emerging approaches, potentially leveraging AI, to strengthen our security posture. Document investigation and response procedures as clear runbooks for triage, escalation, and containment. Skills And Qualifications 3–5 years of experience in detection engineering, incident response, or security engineering. Strong proficiency in Python and query languages such as SQL, with the ability to write clean, maintainable, and testable code. Practical knowledge of detection and response across cloud, identity, and endpoint environments. Familiarity with attacker behaviors and the ability to translate them into durable detection logic. Strong fundamentals in operating systems, networking, and log analysis. Excellent written communication skills, with the ability to create clear documentation. Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2025. Apply today and become part of the forefront of groundbreaking advancements in AI! Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
August 26, 2025
Engineering Manager - Site Reliability Engineering/SRE (f/m/d)*
Parloa
201-500
-
Germany
Full-time
Remote
false
YOUR MISSION: As an Engineering Manager for Site Reliability Engineering & Developer Experience (f/m/d) at Parloa, you will nurture and support a collaborative team that ensures the reliability, scalability, and performance of our products while empowering engineers with thoughtful tools and workflows. Your mission is to cultivate and grow SRE practices that enable our architectural transformation and ensuring our systems meet availability targets. You'll foster a caring culture where reliability is a shared responsibility and where automation helps reduce toil, empowering engineers to focus on meaningful work. IN THIS ROLE YOU WILL: Build and nurture a supportive team that harmonizes SRE excellence with developer experience Collaborate to establish SRE practices: SLI/SLOs, error budgets, and trust-based postmortems using Datadog metrics Create comprehensive observability strategies leveraging our monitoring stack Support sustainable incident response, on-call processes, and automation using GitHub Actions and Terraform to improve MTTR Partner with engineers and engineering teams to integrate reliability practices into CI/CD pipelines (ArgoCD, GitHub Actions) while supporting developer wellbeing Foster adoption of reliability best practices across our Kubernetes-based infrastructure through mentorship and collaboration Guide teams in leveraging our Azure cloud platform effectively while preparing for multi-cloud architectures Support the thoughtful adoption of AI tools (Cursor, GitHub Copilot) for operational efficiency WHAT YOU BRING TO THE TABLE: Experience supporting SRE, DevOps, or platform teams with focus on reliability and collaboration Understanding of SRE principles: SLI/SLOs, error budgets, and toil reduction Hands-on experience with our observability stack (Datadog for metrics/APM, ELK for sensitive logs) and production systems at scale Deep empathy for developer workflows and creating sustainable on-call processes that support work-life balance Familiarity with Infrastructure as Code using Terraform and container orchestration with Kubernetes Experience with CI/CD platforms (GitHub Actions, ArgoCD) and integrating reliability into deployment pipelines Understanding of Azure cloud services and preparing organizations for multi-cloud transformations Warm communication skills for incident support, postmortem facilitation, and cross-team collaboration Experience with authentication systems (e.g. Okta, EntraID) and their role in system reliability Commitment to building inclusive teams that balance operational care with innovation and wellbeing Appreciation for AI-assisted tools in improving operational efficiency and reducing toil Background with databases (MySQL, Redis, MongoDB) and their reliability considerations is valued Experience with multi-cloud architectures and distributed systems is warmly welcomed
WHAT'S IN IT FOR YOU? Join a diverse team of 40+ nationalities with flat hierarchies and a collaborative company culture. Opportunity to build and scale your career at the intersection of customer-facing roles and engineering in a dynamic startup on its journey to become an international leader in SaaS platforms for Conversational AI. Deutschland ticket, Urban Sports Club, Job Rad, Nilo Health, weekly sponsored office lunches Competitive compensation and equity package. Flexible working hours, 28 vacation days and workation opportunities. Access to a training and development budget for continuous professional growth. Regular team events, game nights, and other social activities. Hybrid work environment. However, we love to build real connections and want to welcome everyone in our beautiful Berlin office on certain days. Your recruiting process at Parloa: Recruiter video call → Meet your manager → Technical Interview + Technical Leadership Interview → Bar Raiser Interview Why Parloa? Parloa is one of the fastest growing startups in the world of Generative AI and customer service. Parloa’s voice-first GenAI platform for contact centers is built on the best AI technology to automate customer service with natural-sounding conversations for outstanding experiences on all communication channels. Leveraging natural language processing (NLP) and machine learning, Parloa creates intelligent phone and chat solutions for businesses that turn contact centers into value centers by boosting customer service efficiency. The Parloa platform resolves the majority of customer queries quickly and automatically, allowing human agents to focus on complex issues and relationships. Parloa was founded in 2018 by Malte Kosub and Stefan Ostwald and today employs over 400+ people in Berlin, Munich, and New York. When you join Parloa, you become part of a dynamic and innovative team made up of over 34 nationalities that’s revolutionizing an entire industry. We’re passionate about growing together and creating opportunities for personal and professional development. With our recent $120 million Series C investment, we’re expanding globally and looking for talented individuals to join us on this exciting journey. Do you have questions about Parloa, the role, or our team before you apply? Please feel free to get in touch with our Hiring Team. Parloa is committed to upholding the highest data protection standards for our clients' and employees' data. All our employees are instrumental in ensuring the utmost care, GDPR, and ISO compliance, including ISO 27001, in handling sensitive information. * We provide equal opportunities to all qualified applicants regardless race, gender, sexual orientation, age, religion, national origin, disability status, socioeconomic background and other characteristics.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
August 26, 2025
Manager, HPC Design
Lambda AI
501-1000
USD
0
330000
-
550000
United States
Full-time
Remote
false
We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.What You’ll DoLead a team of system designers responsible for translating architecture into detailed, executable infrastructure designs across compute, storage, and networking.Build and mature repeatable processes that turn Lambda’s reference architectures into site and customer-specific deployment plans.Own the delivery of infrastructure design packages, ensuring solutions meet functional requirements, budget targets, and delivery timelines.Partner closely with architecture, product, engineering, and customer teams to ensure alignment between design execution and platform roadmap.Guide the creation and review of design specifications, integration plans, and validation processes for new deployments.Mentor and grow a high-performing team of infrastructure designers, focused on disciplined execution and iterative delivery.YouHave 5+ years of experience designing HPC or cloud infrastructure at scale, with 2+ years in a technical leadership or management role.Understand the practical application of compute, storage, and network architectures in real-world, large-scale deployments.Can take an established architectural direction and lead your team in producing high-quality designs that deliver reliably, on-time, and within scope.Are adept at managing tradeoffs and risk in infrastructure delivery—balancing technical ambition with operational realism.Have experience mentoring senior-level technical contributors and building cohesive, execution-focused teams.Nice to HaveExperience supporting AI/ML or simulation workloads in high-performance environments.Familiarity with system integration, hardware bring-up, or working alongside hardware engineering teams.Background in designing for hyperscale or enterprise environments, especially where customer requirements drive significant customization.Understanding of delivery workflows, BOM creation, and vendor coordination in infrastructure deployment contexts.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
August 25, 2025
Staff HPC Hardware Engineer
Lambda AI
501-1000
USD
349000
-
581000
United States
Full-time
Remote
false
We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position requires presence in our San Jose office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.
What You’ll DoServe as the technical lead for integrating OEM and white-label compute, storage, and network hardware into Lambda’s HPC platform reference architectures.Drive the end-to-end process of new product introduction (NPI) for hardware systems, including evaluation, validation, documentation, and production readiness.Partner with architects to translate platform blueprints into concrete hardware selections and system configurations.Work cross-functionally with design, engineering, operations, and vendor engineering teams to ensure compatibility, performance, and scalability of new systems.Identify and resolve hardware issues across thermal, power, firmware, and mechanical domains during evaluation and bring-up cycles.Provide technical guidance during vendor engagements and benchmarking of next-generation platforms.YouHave 7+ years of experience in hardware integration or systems engineering for HPC, data center, or cloud infrastructure environments.Possess deep knowledge of server hardware platforms (x86 and ARM), PCIe accelerators, storage devices, and network fabrics.Are experienced with vendor-led product development cycles and can drive hardware evaluation, risk mitigation, and feedback into roadmap decisions.Can interpret platform-level architecture requirements and select or adapt OEM solutions to fit.Are comfortable working hands-on in labs with rack-scale deployments, BIOS/firmware tuning, and performance validation.Collaborate well across architecture, design, engineering, and vendor teams to deliver complete, production-ready hardware solutions.Nice to HaveExperience supporting AI/ML infrastructure and accelerated compute hardware (e.g., NVIDIA, AMD, Intel).Familiarity with system thermals, power delivery, or integration at rack-scale.Exposure to BMC/Redfish/IPMI configuration and automation.Background in performance tuning, benchmarking, and systems validation workflows.Prior experience contributing to reference designs or large-scale infrastructure blueprints.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
August 22, 2025
Application Security Engineer, X
X AI
5000+
USD
180000
-
340000
United States
Full-time
Remote
false
About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.About the Role We are seeking a skilled and innovative Application Security Engineer to join our technology-driven company. In this role, you will be responsible for ensuring the security and integrity of our cloud-native applications and systems throughout the software development lifecycle, with a particular focus on code security, CI/CD pipelines, and emerging AI technologies. Focus Conduct in-depth code reviews and static analysis to identify and mitigate security vulnerabilities in our applications Design and implement secure coding guidelines and best practices for development teams Collaborate closely with development teams to integrate security practices throughout the CI/CD pipeline Perform threat modeling and risk assessments for applications, developing mitigation strategies for potential risks Manage vulnerability tracking and remediation efforts, providing guidance to development teams Support incident response activities related to application security Stay current on emerging security threats and trends in cloud-native technologies and AI, continuously enhancing our security measures Evaluate and secure software supply chains, including producing and maintaining Software Bills of Materials (SBOMs) Address security concerns specific to AI and machine learning models, with a focus on the OWASP LLM Top 10 Ideal Experience Bachelor's degree in Computer Science, Cybersecurity, or a related field 3-5 years of experience in application security, with a strong focus on code security practices Deep understanding of secure coding practices, application security frameworks, and common vulnerabilities (e.g., OWASP Top 10) Proficiency in Python or Rust programming languages and experience with secure coding practices in these languages Experience securing CI/CD pipelines and implementing DevSecOps practices Familiarity with software supply chain security and SBOM generation tools Experience with security testing tools (e.g., Burp Suite, OWASP ZAP) and static/dynamic code analysis Understanding of AI/ML security implications, particularly those outlined in the OWASP LLM Top 10 Excellent communication skills, able to explain complex security issues to both technical and non-technical audiences Preferred Qualifications Experience with cloud platforms (e.g., GCP, AWS, Azure) and their security features Relevant security certifications (e.g., CSSLP, OSWE) Background in data privacy and compliance regulations relevant to cloud-native applications and AI systems Experience with GitOps and infrastructure-as-code security Familiarity with federated learning and privacy-preserving machine learning techniques Bonus Skills Experience in building custom security tooling to enhance and automate security processes Interest in leveraging AI to automate security tasks and improve efficiency Contributions to open-source security projects or tools Experience in securing AI/ML models and data pipelines Annual Salary Range $180,000 - $340,000 USD Benefits Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.xAI is an equal opportunity employer. California Consumer Privacy Act (CCPA) Notice
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
August 21, 2025
Detection and Response Engineer
Cerebras Systems
501-1000
-
Canada
Remote
false
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. About The Role We are seeking an exceptional Detection and Response Engineer to serve on the front lines, where you will build systems to detect threats, investigate incidents, and lead coordinated response across teams. The right candidate brings hands-on experience creating reliable detections, automating repetitive tasks, and turning investigation findings into durable improvements to our security program, with an interest in exploring AI-driven automation. Responsibilities Create and optimize detections, playbooks, and workflows to quickly identify and respond to potential incidents. Investigate security events and participate in incident response, including on-call responsibilities. Automate investigation and response workflows to reduce time to detect and remediate incidents. Build and maintain detection and response capabilities as code, applying modern software engineering rigor. Explore and apply emerging approaches, potentially leveraging AI, to strengthen our security posture. Document investigation and response procedures as clear runbooks for triage, escalation, and containment. Skills And Qualifications 3–5 years of experience in detection engineering, incident response, or security engineering. Strong proficiency in Python and query languages such as SQL, with the ability to write clean, maintainable, and testable code. Practical knowledge of detection and response across cloud, identity, and endpoint environments. Familiarity with attacker behaviors and the ability to translate them into durable detection logic. Strong fundamentals in operating systems, networking, and log analysis. Excellent written communication skills, with the ability to create clear documentation. Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2025. Apply today and become part of the forefront of groundbreaking advancements in AI! Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
August 21, 2025
Kubernetes Architect
TensorWave
51-100
-
United States
Full-time
Remote
true
At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.
About the Role:We are seeking an exceptional Kubernetes Architect to lead the design, development, and deployment of our next-generation infrastructure platform. This is a very senior-level role for someone who not only understands Kubernetes deeply but can write complex manifests, operators, and controllers from scratch, and architect resilient, secure, and performant systems that scale to millions of users.As a technical visionary and hands-on expert, you will lead the evolution of our cloud-native architecture, including designing serverless systems on Kubernetes, integrating with CI/CD, and ensuring observability, security, and cost-efficiency across environments.Responsibilities:Architect and implement end-to-end Kubernetes infrastructure for large-scale, cloud-native applications.
Design and build serverless platforms on top of Kubernetes using technologies such as Knative, OpenFaaS, or KEDA.
Develop and maintain Kubernetes custom resources (CRDs), controllers, operators, and admission controllers in Go or Python.
Define multi-tenant, multi-region architecture supporting millions of users with high availability and low latency.
Lead Kubernetes cluster lifecycle management (provisioning, upgrades, scaling, monitoring, troubleshooting).
Collaborate closely with engineering teams to containerize applications, write Helm charts or Kustomize overlays, and standardize deployment practices.
Implement infrastructure as code using tools like Terraform, Pulumi, or Crossplane.
Lead efforts around observability, policy enforcement, cost optimization, and RBAC/security hardening within the cluster.
Evaluate and integrate Kubernetes ecosystem tools (e.g., Istio/Linkerd, ArgoCD, Flux, Prometheus, Grafana, OPA, etc.).
Mentor and upskill DevOps engineers and SREs in Kubernetes best practices.Essential Skills & Qualifications:8+ years of experience in cloud infrastructure, DevOps, or platform engineering roles.
4+ years of hands-on Kubernetes experience, including deep knowledge of the Kubernetes API, internals, networking, and storage.
Proficiency in writing Kubernetes manifests, Helm charts, and custom Kubernetes controllers/operators (preferably in Go).
Proven experience designing cloud-native systems that scale globally (multi-region, multi-cloud or hybrid setups).
Experience with serverless technologies (Knative, OpenFaaS, AWS Lambda, etc.) in a production environment.
Strong knowledge of cloud platforms such as AWS, GCP, or Azure.
Experience with GitOps tools (ArgoCD, Flux), service meshes, policy engines (OPA/Gatekeeper), and CI/CD pipelines.
Deep understanding of security, compliance, and resilience in containerized workloads.
Additional/Preferred Qualifications:Contributions to Kubernetes open-source projects or CNCF-related tooling.
Experience with service mesh design (Istio, Linkerd).
Familiarity with eBPF, Cilium, or network-level observability.
Background in building PaaS or developer platforms on top of Kubernetes.
What Success Looks Like:A production-grade Kubernetes platform that can support millions of users globally, with self-healing, autoscaling, and strong observability.
Developer teams can deploy serverless applications with ease, speed, and reliability.
Infrastructure is resilient, secure, cost-optimized, and compliant.
Kubernetes practices and tooling are well-documented, standardized, and continuously improved across the company.
What We Bring:Stock Options100% paid Medical, Dental, and Vision insuranceLife and Voluntary Supplemental InsuranceShort Term Disability InsuranceFlexible Spending Account401(k)Flexible PTOPaid HolidaysParental LeaveMental Health Benefits through Spring Health
MLOps / DevOps Engineer
Data Science & Analytics
Apply
August 20, 2025
Site Reliability Engineer
TensorWave
51-100
-
United States
Full-time
Remote
true
At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.About the Role:We're looking for a Senior SRE Engineer with a strong software engineering background to build and maintain highly scalable, secure, and resilient infrastructure. You’ll play a critical role in designing low-level systems, automating infrastructure with modern tooling, and ensuring platform reliability. This role is ideal for someone who’s comfortable working at the intersection of systems programming and DevOps—writing code in Go, Javascript, Rust, C, or Zig while also managing infrastructure with NixOS, Kubernetes, and Terraform.Responsibilities:Design, build, and maintain infrastructure systems using Linux and NixOS.
Manage infrastructure-as-code with Terraform to provision and scale resources.
Architect and operate Kubernetes clusters with a focus on performance, security, and automation.
Write high-performance tooling and internal utilities in Go, Javascript, Rust.
Develop and maintain CI/CD pipelines for infrastructure and code deployments.
Monitor system performance, resolve issues, and improve reliability through observability tooling.
Collaborate closely with engineering teams to support deployment strategies and development workflows.
Essential Skills & Qualifications:5+ years in DevOps, Site Reliability, or Infrastructure Engineering roles.
Deep experience with Linux systems and configuration management (preferably NixOS).
Hands-on experience with Terraform, Kubernetes, and containerized environments.
Proficiency in one or more low-level languages: Rust, C, Zig, Javascript, and Go.
Strong understanding of systems programming, performance tuning, and operating system internals.
Familiarity with CI/CD practices and infrastructure monitoring/alerting tools.
We’re looking for resilient, adaptable people to join our team—folks who enjoy collaborating and tackling tough challenges. We’re all about offering real opportunities for growth, letting you dive into complex problems and make a meaningful impact through creative solutions. If you're a driven contributor, we encourage you to explore opportunities to make an impact at TensorWave. Join us as we redefine the possibilities of intelligent computing.What We Bring:Stock Options100% paid Medical, Dental, and Vision insuranceLife and Voluntary Supplemental InsuranceShort Term Disability InsuranceFlexible Spending Account401(k)Flexible PTOPaid HolidaysParental LeaveMental Health Benefits through Spring Health
MLOps / DevOps Engineer
Data Science & Analytics
Apply
August 20, 2025
AI/HPC Network Development Engineer - Networking
X AI
5000+
-
Ireland
United States
Full-time
Remote
false
About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.About the Role xAI was first in the world to build a 100k GPU cluster on an ethernet network and then did it again in 92 days, floors, walls and all. We need an engineer with deep experience in RoCEv2 that can develop at hyper scale while optimizing performance and availability. xAI is building at a furious pace with the latest hardware to help people understand the universe. To make the next significant leap forward, we need to own our own destiny by understanding our current network performance and availability and then optimize it to our training models and how we execute customer inference queries. You will spend most of your days deep inside NCCL, building metric dashboards and tweaking configurations to ensure no performance is left on the table. You will help design the next iteration of our backend and front-end networks that will allow us to seamlessly build-out new GPU infrastructure with little to no engineering assistance. There will be a significant amount of travel to Memphis for building more capacity as well as participating in a team on-call rotation and helping on other scaling and maintenance efforts. This will become easier as we build out the team and engineers contribute to deployment and operations frameworks to remove repetitive tasks. Location We have 2 openings, one based in Palo Alto, California and the other in Dublin, Ireland. There will be significant travel expected to Memphis, Tennessee for data center buildouts and to the head office in Palo Alto for team collaboration. Ideal Experiences A minimum of 10 years designing and operating large scale networks with 5 years in the ethernet AI/HPC space. Deep understanding of congestion control on ethernet with Infiniband an added bonus. Deep understanding of AI training and inference workloads and how they operate on the network. As part of this you are able to use and debug NCCL and potentially commit to the library. Expertise in creating a portfolio of metrics for performance and operations to optimize the fleet for training and inference traffic. Experience with Python to automate away repetitive tasks and facilitate your daily job working with and analyzing large sets of data. Interview Process After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to an initial interview (45 minutes - 1 hour) during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of five interviews: Coding assessment in a language of your choice. Data center network technologies and RoCEv2. Manager Interview. Meet and greet with the wider team where you will run through a presentation of a body of work you are proud of. Benefits Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.xAI is an equal opportunity employer. California Consumer Privacy Act (CCPA) Notice
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
August 20, 2025
AI/HPC Network Development Engineer - Networking
X AI
5000+
-
United States
Ireland
Full-time
Remote
false
About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.About the Role xAI was first in the world to build a 100k GPU cluster on an ethernet network and then did it again in 92 days, floors, walls and all. We need an engineer with deep experience in RoCEv2 that can develop at hyper scale while optimizing performance and availability. xAI is building at a furious pace with the latest hardware to help people understand the universe. To make the next significant leap forward, we need to own our own destiny by understanding our current network performance and availability and then optimize it to our training models and how we execute customer inference queries. You will spend most of your days deep inside NCCL, building metric dashboards and tweaking configurations to ensure no performance is left on the table. You will help design the next iteration of our backend and front-end networks that will allow us to seamlessly build-out new GPU infrastructure with little to no engineering assistance. There will be a significant amount of travel to Memphis for building more capacity as well as participating in a team on-call rotation and helping on other scaling and maintenance efforts. This will become easier as we build out the team and engineers contribute to deployment and operations frameworks to remove repetitive tasks. Location We have 2 openings, one based in Palo Alto, California and the other in Dublin, Ireland. There will be significant travel expected to Memphis, Tennessee for data center buildouts and to the head office in Palo Alto for team collaboration. Ideal Experiences A minimum of 10 years designing and operating large scale networks with 5 years in the ethernet AI/HPC space. Deep understanding of congestion control on ethernet with Infiniband an added bonus. Deep understanding of AI training and inference workloads and how they operate on the network. As part of this you are able to use and debug NCCL and potentially commit to the library. Expertise in creating a portfolio of metrics for performance and operations to optimize the fleet for training and inference traffic. Experience with Python to automate away repetitive tasks and facilitate your daily job working with and analyzing large sets of data. Interview Process After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to an initial interview (45 minutes - 1 hour) during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of five interviews: Coding assessment in a language of your choice. Data center network technologies and RoCEv2. Manager Interview. Meet and greet with the wider team where you will run through a presentation of a body of work you are proud of. Benefits Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.xAI is an equal opportunity employer. California Consumer Privacy Act (CCPA) Notice
MLOps / DevOps Engineer
Data Science & Analytics
Apply
August 20, 2025
IT Engineer
helsing
501-1000
-
Spain
Full-time
Remote
false
Who we are Helsing is a defence AI company. Our mission is to protect our democracies. We aim to achieve technological leadership, so that open societies can continue to make sovereign decisions and control their ethical standards. As democracies, we believe we have a special responsibility to be thoughtful about the development and deployment of powerful technologies like AI. We take this responsibility seriously. We are an ambitious and committed team of engineers, AI specialists and customer-facing programme managers. We are looking for mission-driven people to join our European teams – and apply their skills to solve the most complex and impactful problems. We embrace an open and transparent culture that welcomes healthy debates on the use of technology in defence, its benefits, and its ethical implications. The day-to-day Configuring and operating endpoints, including workstations, laptops and servers Supporting an office of about 30 employees Setting up and maintaining on-prem compute environments built on Linux Configuring and operating our network and VPN infrastructure Partnering with Helsing's broader IT team on the operation and continuous optimisation of Helsing's corporate, development, and customer environments, built primarily using Microsoft 365 and Azure alongside other SaaS solutions You should apply if you Note: We operate in an industry where women, as well as other minority groups, are systematically under-represented. We encourage you to apply even if you don’t meet all the listed qualifications; ability and impact cannot be summarised in a few bullet points. Have experience in an IT engineering or SysAdmin role, particularly working with macOS clients, Linux servers and orchestration software like Kubernetes Are fluent in both English and Spanish; Helsing's language of business is English Have managed modern network infrastructure, ideally including administration of network- and host-based security tools Show great communication skills for engaging at all levels within an organisation Employ strong problem-solving, critical thinking, and analytical skills combined with the ability to find creative solutions Nice to have Experience working with the Microsoft M365 stack Experience working with Mobile Device Management tools like Intune, Jamf or JumpCloud Experience working with PowerShell, Python or IaC tooling Information security expertise, especially related to regulations specific to Germany Join Helsing and work with world-leading experts in their fields Helsing’s work is important. You’ll be directly contributing to the protection of democratic countries while balancing both ethical and geopolitical concerns The work is unique. We operate in a domain that has highly unusual technical requirements and constraints, and where robustness, safety, and ethical considerations are vital. You will face unique Engineering and AI challenges that make a meaningful impact in the world Our work frequently takes us right up to the state of the art in technical innovation, be it reinforcement learning, distributed systems, generative AI, or deployment infrastructure. The defence industry is entering the most exciting phase of the technological development curve. Advances in our field of world are not incremental: Helsing is part of, and often leading, historic leaps forward In our domain, success is a matter of order-of-magnitude improvements and novel capabilities. This means we take bets, aim high, and focus on big opportunities. Despite being a relatively young company, Helsing has already been selected for multiple significant government contracts We actively encourage healthy, proactive, and diverse debate internally about what we do and how we choose to do it. Teams and individual engineers are trusted (and encouraged) to practise responsible autonomy and critical thinking, and to focus on outcomes, not conformity. At Helsing you will have a say in how we (and you!) work, the opportunity to engage on what does and doesn’t work, and to take ownership of aspects of our culture that you care deeply about What we offer A focus on outcomes, not time-tracking Competitive compensation and stock options Relocation support Social and education allowances Regular company events and all-hands to bring together employees as one team across Europe A hands-on onboarding program (affectionately labelled “Infraduction”), in which you will be building tooling and applications to be used across the company. This is your opportunity to learn our tech stack, explore the company, and learn how we get things done - all whilst working with other engineering teams from day one (Specifically for engineering and AI) Helsing is an equal opportunities employer. We are committed to equal employment opportunity regardless of race, religion, sexual orientation, age, marital status, disability or gender identity. Please do not submit personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, data concerning your health, or data concerning your sexual orientation. Helsing's Candidate Privacy and Confidentiality Regime can be found here.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
August 20, 2025
No job found
There is no job in this category at the moment. Please try again later