Top MLOps / DevOps Engineer Jobs Openings in 2025

Looking for opportunities in MLOps / DevOps Engineer? This curated list features the latest MLOps / DevOps Engineer job openings from AI-native companies. Whether you're an experienced professional or just entering the field, find roles that match your expertise, from startups to global tech leaders. Updated everyday.

stability_ai_logo

Senior Site Reliability Engineer

Stability AI
0
0
-
0
US.svg
United States
Full-time
Remote
true
< Remote - United States > Job Description: Stability AI’s Engineering Operations team is looking for a Senior Site Reliability Engineer (SRE) to join our growing team and play a pivotal role in improving and shaping our cloud infrastructure. The person will closely work with engineering, IT, security, and product teams to drive innovation and reliability in an evolving environment. Candidates should have the initiative to build and improve a maturing cloud landscape. Responsibilities: Developing and enforcing SRE best practices and standards across the organization. Architecting and managing scalable systems in AWS and other cloud environments, focusing on high availability and resilience. Implementing and maintaining infrastructure as code using Terraform. Setting up and refining monitoring, logging, and alerting systems. Driving incident management and root cause analysis to improve system reliability. Championing SRE principles and mentoring junior team members. Qualifications: Collaborating with development teams to enhance CI/CD pipelines. Experience scaling resource intensive systems, be it storage, networking, or compute. Knowledge and experience with Kubernetes or other container scaling solutions Background in software development or automation scripting. Knowledge and experience with Grafana, ELK stack, or similar tools. Cloud security experience. Equal Employment Opportunity: We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.  
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
heyjasperai_logo

DevOps Engineer

Jasper
USD
170000
-
200000
US.svg
United States
Full-time
Remote
true
Jasper is the leading AI marketing platform, enabling the world's most innovative companies to reimagine their end-to-end marketing workflows and drive higher ROI through increased brand consistency, efficiency, and personalization at scale. Jasper has been recognized as "one of the Top 15 Most Innovative AI Companies of 2024" by Fast Company and is trusted by nearly 20% of the Fortune 500 – including Prudential, Ulta Beauty, and Wayfair. Founded in 2021, Jasper is a remote-first organization with team members across the US, France, and Australia.About The Role We're looking for an experienced DevOps Engineer to join our Platform team and take ownership of the systems and pipelines that power our engineering organization. This role is built on operational excellence; you've lived through production incidents, scaling challenges, and deployment nightmares, and are passionate about building resilient systems that prevent these problems before they happen.This is a highly autonomous, high-impact role that blends Ops practices, infrastructure engineering, and delivery pipeline optimization. You'll work with a focused, collaborative, and fast-moving team where your contributions will directly impact system reliability, developer velocity, and our ability to safely deliver AI-powered products at scale. Candidates should also have a solid background in Cloud, IaC, and Kubernetes, and a drive to produce excellent solutions for a variety of challenges.This fully remote role reports to the Director, Information Security and is open to candidates located anywhere in the continental US. What you will do at JasperWe're a tight-knit team with broad responsibilities and a strong emphasis on action. We value bias toward shipping, prioritize pragmatic solutions and measurable reliability improvements, providing our engineers the autonomy to take ownership and move fast. Collaboration, trust, and clarity are at the core of how we operate. And we're always looking for people who want to make systems more reliable for everyone. If you're a systems thinker who wants to build infrastructure that empowers teams, and who thrives in a fast-paced, AI-focused environment, this is the role for you.Build and Maintain Resilient Infrastructure: Design, implement, and operate cloud-native infrastructure that scales efficiently and fails gracefully. Own the reliability, performance, and cost-effectiveness of our production systems across multiple environments.Architect CI/CD Pipelines: Create and optimize software delivery pipelines that enable safe, fast, and frequent deployments. Build robust testing frameworks, automated rollback mechanisms, and progressive deployment strategies that give engineering teams confidence to ship rapidly.Automate Everything: Develop resilient Infrastructure as Code-managed solutions using Terraform. Create self-healing systems, automated scaling policies, and intelligent alerting that reduces toil and minimizes manual intervention.Support AI Workloads: Collaborate with ML and product teams to build infrastructure that efficiently handles AI model training, inference, and evaluation pipelines. Design systems that can scale compute resources dynamically based on demand patterns.Drive Platform Reliability: Proactively identify single points of failure, performance bottlenecks, and scalability limits. Promote reliability engineering practices across all engineering teams through tooling, documentation, and direct collaboration.Secure the Platform: Implement security best practices throughout the infrastructure stack. Manage secrets, certificates, and access controls. Ensure compliance with security policies while maintaining developer productivity.What you will bring to Jasper Production Operations Mindset: As an experienced operations engineer, you've been woken up by production alerts and understand what it takes to build truly reliable systems. You have strong opinions about monitoring, incident response, and the importance of measuring what matters.Kubernetes & Cloud Native Expertise: Deep hands-on experience with Kubernetes in production environments, including cluster management, networking, storage, and security. Strong knowledge of cloud-native patterns and Google Cloud Platform services.Infrastructure as Code Mastery: Extensive experience with Terraform, Helm, and configuration management tools. You've built and maintained complex infrastructure deployments that are reproducible, version-controlled, and easy to reason about.CI/CD Pipeline Engineering: You've designed and operated sophisticated delivery pipelines using tools like GitHub Actions, Argo CD, Jenkins, or similar. You understand the tradeoffs between speed and safety in software delivery.Observability & Monitoring: Hands-on experience with observability platforms (especially Datadog). You know how to instrument applications, create meaningful dashboards, and design alerting that reduces noise while catching real issues.Programming & Automation Skills: Strong scripting abilities in Python, Go, or Bash. Experience building tools and automation that solve operational problems at scale.Preferred QualificationsMulti-Language Support: Experience supporting applications written in TypeScript, Python, Go, or other languages in production.AI/ML Infrastructure: Understanding of GPU computing, model serving infrastructure, and the unique operational challenges of AI workloads.Security Engineering: Experience with container security, secrets management, policy enforcement, and compliance frameworks.Open Source Contributions: Active contributions to infrastructure, monitoring, or CI/CD open source projects.Compensation Range At Jasper, we believe in pay transparency and are committed to providing our employees and candidates with access to information about our compensation practices. The expected base salary range offered for this role is $170,000 - $200,000. Compensation may vary based on relevant experience, skills, competencies, and certifications.Benefits & PerksComprehensive Health, Dental, and Vision coverage beginning on the first day for employees and their families401(k) program with up to 2% company matchingEquity grant participationFlexible PTO with a FlexExperience budget ($900 annually) to help you make the most of your time away from workFlexWellness program ($1,800 annually) to help support your personal health goalsGenerous budget for home office set up $1,500 annual learning and development stipend 16 weeks of paid parental leaveOur goal is to be a diverse workforce that is representative at all job levels as we know the more inclusive we are, the better our product will be. We are committed to celebrating and supporting our differences and that diversity is essential to innovation and makes us better able to serve our customers. We hire people of all levels and backgrounds who are excited to learn and develop their skills. We are an equal opportunity employer. Applicants will not be discriminated against because of race, color, creed, sex, sexual orientation, gender identity or expression, age, religion, national origin, citizenship status, disability, ancestry, marital status, veteran status, medical condition, or any protected category prohibited by local, state or federal laws.By submitting this application, you acknowledge that you have reviewed and agree to Jasper's CCPA Notice to Candidates, available at legal.jasper.ai/#ccpa.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
figure_ai_logo

Senior Cloud Network Engineer

Figure AI
USD
0
180000
-
240000
US.svg
United States
Full-time
Remote
false
Figure is an AI Robotics company developing a general purpose humanoid. Our humanoid robot, Figure 02, is designed for commercial tasks and the home. We are based in San Jose, CA and require 5 days/week in-office collaboration. It’s time to build. We are looking for a skilled Senior Network Engineer with a strong background in both cloud network administration (AWS, Azure, GCP) and on-premise Cisco networking, including switching and wireless technologies. The ideal candidate will be experienced in managing hybrid environments, configuring firewalls, implementing SD-WAN solutions, and delivering exceptional customer service. This candidate also needs to be experienced in a start-up company environment and be team oriented.  This role plays a critical part in maintaining the integrity, performance, and security of both cloud and on-site network infrastructure for our organization and its clients. Responsibilities: Cloud Network Administration: Design, configure, and support virtual networks, routing, VPNs, and load balancing in AWS, Azure, and GCP. Administer cloud network components (e.g., VPCs, Transit Gateways, Azure VNets, GCP Interconnects). Ensure cloud networking aligns with enterprise security and performance standards. Automate cloud network provisioning using tools like Terraform, CloudFormation, or ARM templates. Monitor and troubleshoot cloud network performance and incidents.  Lead efforts to audit, standardize, and secure Azure networking resources, transforming a loosely managed environment into a well-governed, cost-optimized, and scalable architecture On-Premise Networking: Deploy, configure, and manage Cisco switching and wireless solutions. Maintain wired and wireless network availability across corporate offices. Diagnose and resolve issues related to switches, APs, controllers, and PoE devices. Security & WAN Technologies: Configure and manage firewalls (e.g., Cisco ASA/Firepower, Palo Alto). Implement and support SD-WAN technologies for secure, optimized wide-area connectivity. Ensure network compliance with internal security policies and industry standards. Customer Service & Support: Serve as a point of contact for internal stakeholders and clients on network-related issues. Provide high-quality support, documentation, and communication throughout project lifecycles. Participate in on-call rotations and respond to escalations with professionalism and urgency. Requirements:  Bachelor's degree in Computer Science, Information Technology, or related field (or equivalent experience). 7+ years of hands-on experience in network engineering and cloud administration. Solid experience with: AWS, Azure, and Google Cloud Platform networking components. Experience with AWS Direct Connect or Azure ExpressRoute is a plus. Cisco switching (Layer 2/3) and wireless infrastructure. Experience with Cisco DNS/Catalyst Center is a plus. Firewall configuration and management. SD-WAN solutions (e.g., Cisco Viptela, Velo Cloud, Fatpipe). Strong understanding of networking protocols (TCP/IP, BGP, OSPF, DHCP, DNS, VLANs, etc.). Proficiency in network monitoring and diagnostic tools (e.g., Wireshark, SolarWinds, NetFlow). Excellent communication and customer service skills. Bonus Qualifications:  Relevant certifications, such as: Cisco (CCNA, CCNP) AWS Certified Advanced Networking Azure Network Engineer Associate Google Professional Cloud Network Engineer Experience working in hybrid IT environments (on-prem + cloud). Familiarity with Zero Trust architecture and secure cloud networking principles. Early stage start up experience. Strong problem-solving and troubleshooting skills. Ability to communicate complex technical concepts to non-technical stakeholders. Detail-oriented with a proactive and collaborative mindset. Comfortable working independently and as part of a cross-functional team, in a high paced start-up company environment. The US base salary range for this full-time position is between $180,000 - $240,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended. 
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
kodifly_logo

DevOps Engineer

Kodifly
-
PK.svg
Pakistan
Full-time
Remote
false
About Kodifly: Kodifly is an AI-first spatial intelligence company transforming infrastructure monitoring and management. Headquartered in Hong Kong Science Park, with an expanded presence in Pakistan and ongoing expansion into the Kingdom of Saudi Arabia, we are backed by HKAI Lab and the Nvidia Inception program. Our expertise spans 3D point cloud processing, digital twin creation, and LiDAR technology. We develop intelligent infrastructure solutions that enable cities and enterprises to operate with greater efficiency and insight. By integrating AI-powered analytics, digital twins, and real-time spatial intelligence, we help our partners streamline asset inspections, elevate quality assurance, and enhance safety throughout the entire infrastructure lifecycle.Job Description:We are looking for a DevOps Engineer with a strong focus on building internal tools, CI/CD pipelines, and scalable infrastructure that supports both edge and cloud applications. This role is critical to enabling the productivity of our AI and software engineering teams. Your contributions will empower our developers to build, deploy, and iterate on complex systems that rely on real-time spatial data processing, ultimately powering smarter, safer cities.Key Responsibilities:Design, implement, and maintain scalable and reliable infrastructure solutions for both edge and cloud environments.Develop and maintain CI/CD pipelines to ensure seamless, efficient software delivery.Monitor and optimize system performance, ensuring high availability, scalability, and reliability.Implement infrastructure as code using tools like Terraform or CloudFormation.Automate routine tasks to enhance operational efficiency and reduce manual intervention.Conduct regular system audits and security assessments to proactively address vulnerabilities.Troubleshoot and resolve infrastructure and deployment issues promptly.Collaborate with cross-functional teams to drive continuous improvement, innovation, and effective software releases.Participate in the design and development of new features leveraging advanced computer vision to enhance digital twin capabilities.Analyze performance data and iterate on solutions to improve accuracy and robustness in spatial data handling.Stay up-to-date with the latest trends and best practices in DevOps, cloud technologies, and computer vision.Qualifications:Bachelors, Masters or Ph.D. in Computer Science, Engineering, or related fieldProficiency in creating applications for the cloud and edge3 plus years of proven experience as a DevOps Engineer or in a similar roleExperience with ROS software and methodologies.Proficient in scripting and programming languages such as C++, Python, or similar.Excellent analytical, problem-solving, and communication skills.Team-oriented with an ability to work in a collaborative environment.Solid understanding of cloud computing platforms (e.g., AWS, Azure, GCP) and associated services.Experience with configuration management tools (e.g., Ansible, Puppet, Chef).Proficiency in containerization technologies such as Docker and Kubernetes.Knowledge of version control systems (e.g., Git) and collaboration tools (e.g., Jira, Confluence).Familiarity with monitoring and log aggregation tools (e.g., Prometheus, ELK stack).Strong understanding of networking concepts and protocols.Excellent problem-solving and troubleshooting skills.Ability to work independently and collaboratively in a fast-paced and dynamic environment.Strong communication and interpersonal skills.We Offer:Opportunity to work with state-of-the-art technology in a rapidly evolving field.A collaborative environment where innovation is encouraged and rewarded.Competitive salary and share optionsProfessional and personal development opportunities and a chance to make a significant impact in infrastructure safety and efficiency.Join Us at Kodifly: If you’re ready to apply your computer vision skills to tackle real-world challenges and drive technological advancement, Kodifly is looking for you. Apply now and begin your journey at the cutting edge of infrastructure technology! 🚀
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
openai_logo

Data Center Controls Engineer

OpenAI
USD
310000
-
460000
US.svg
United States
Full-time
Remote
false
About the TeamOpenAI is on a bold mission to build the world’s most advanced AI infrastructure. We are seeking a Data Center Controls Engineer with 10+ years of experience designing and implementing advanced SCADA, EPMS, and BMS control system solutions for mission-critical facilities.The Data Center Engineering team is at the core of this mission—defining infrastructure strategy, driving technical innovation, partnering with research to set performance benchmarks, and developing reference designs that enable rapid, scalable global deployment.As a member of this team, you will design and deliver next-generation data center infrastructure optimized for AI workloads. You will collaborate across research, site selection, design, construction, commissioning, hardware engineering, deployment, operations, and global partners to bring OpenAI’s infrastructure vision to life.This role offers the opportunity to shape the future of controls, automation, and monitoring at an unprecedented scale. If you are passionate about building cutting-edge systems that push the boundaries of performance and reliability, we encourage you to apply.Key Responsibilities:Develop and maintain control system design standards, specifications, templates, and reference models for hyperscale AI data centers.Design and implement control system architectures for data center infrastructure, including EPMS, SCADA, BMS, PLC, and DDC systems.Lead software integration for monitoring and control platforms, ensuring reliability, maintainability, and compliance with industry codes and safety standards.Direct the design, testing, deployment, and ongoing maintenance of control solutions for new facilities.Manage external vendors, contractors, and consultants to ensure high-quality system delivery and long-term support.Partner with engineering and operations teams to align control systems with performance, resiliency, and efficiency goals.Optimize telemetry and monitoring capabilities to accurately capture power and cooling load transients and improve overall system performance.Qualifications:10+ years of experience in control system architecture, design, and implementation for large-scale data centers or mission-critical facilities.Strong knowledge of industry codes, standards, and compliance requirements for mission-critical operations.Proven expertise in automation and industrial control system protocols (e.g., BACnet, Modbus, OPC, SNMP, LON).Hands-on experience with PLC, DCS, SCADA, and HMI systems.Deep understanding of data center operations, infrastructure systems, and control system integration.Demonstrated ability to rapidly design, test, and implement reliable control software solutions.Extensive experience in monitoring and controls for HVAC, power, and mechanical systems.Successful track record of selecting, managing, and collaborating with external vendors, contractors, and consultants.Bachelor’s degree in Controls Engineering or a related field; advanced degree or professional certifications preferred.Preferred Skills:Master’s degree in Electrical Engineering.15+ years of experience in control systems for global, large-scale data center operations.Expertise in AI-driven automation and predictive analytics for control systems.Experience with energy-efficient and sustainable control system design practices.Strong background in big data telemetry, analytics, and infrastructure monitoring.About OpenAIOpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.OpenAI Global Applicant Privacy PolicyAt OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
1691021621180

Cybersecurity - Site Reliablity Engineer, X Money

X AI
USD
0
180000
-
360000
No items found.
Remote
false
About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.About the Role The Cybersecurity / SRE team is focused on ensuring the security and reliability of X Money. This role will primarily focus on the X Money platform but will also cross over with the X Social platform. The ideal candidate will have experience in the banking, money transmission, and P2P payments industry. We emphasize working with large distributed systems and security platforms at scale, with an automation-first mindset. You’ll be responsible for securing and maintaining the reliability of X Money’s infrastructure. You’ll work closely with cross-functional teams to enhance security measures, improve system resilience, and implement best practices. Your role will include: Responsibilities Building and securing mission-critical applications within AWS. Ensuring proper identity and role management within AWS. Implementing and maintaining KMS for data management in RDS and DynamoDB. Strengthening Kubernetes and container security. Writing and maintaining infrastructure code using Python and Terraform. Integrating and maintaining code scanning platforms. Taking ownership of cybersecurity projects, identifying problems, and implementing solutions. Conducting critical analysis and applying strong problem-solving skills. Minimum qualifications: Proficiency in Python and Terraform. Hands-on experience with code scanning platforms. A proactive, problem-solving mindset with a strong sense of ownership. Excellent critical thinking and analytical skills. AWS experience, particularly with identity management and security. Expertise in Kubernetes and container security & experience with self-managed Kubernetes or EKS on AWS. Be based in the SF Bay Area, or willing to relocate here. Annual Salary Range $180,000 - $360,000 USD Benefits Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.xAI is an equal opportunity employer. California Consumer Privacy Act (CCPA) Notice
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
file.jpeg

DevOps Engineer I

Observe
0
0
-
0
IN.svg
India
Full-time
Remote
false
About Us: Observe.AI enables enterprises to transform how they connect with customers - through AI agents and copilots that engage, assist, and act across every channel. From automating conversations to guiding human agents in real time to uncovering insights that shape strategy, Observe.AI turns every interaction into a driver of loyalty and growth. Trusted by global leaders, we’re creating a future where every customer experience is smarter, faster, and more impactful. Why Join Us  At Observe.AI, DevOps isn’t just about maintaining infrastructure—it’s about building scalable, reliable, and secure systems that empower innovation across the organization. As a DevOps Engineer, you’ll help design and automate the foundation that powers our AI/ML platforms, ensuring seamless operations across AWS accounts, Kubernetes clusters, and diverse environments while driving efficiency and cost optimization. You’ll work on automation that goes beyond CI/CD, tackling challenges in scalability, reliability, and security, while collaborating closely with engineering, data, and product teams to create resilient systems. If you’re looking for an opportunity where your expertise shapes the future of our infrastructure, your work enables faster innovation, and your growth is fueled by solving meaningful challenges alongside a talented team, this is the place for you. What you’ll be doing  Help in the definition of best practices in production monitoring and alerting and able to own application of the same Assist and troubleshoot  in the setup and maintenance of various environments (Production, testing, etc) Automate, optimize and drive efficiency of effort, code, and process Be able to assist with product stability and closely collaborate with other tech teams to suggest improvements for the same Assist in the implementation of security best practices, especially in public cloud infrastructure and in audit/compliance requirements. Own integration of existing systems using appropriate Kubernetes/ docker / Terraform scripts to automate and improve the efficiency of the deployment  Develop CI/CD pipelines for various services Coordinate and monitor releases of the same What you’ll bring to the role Expertise in scripting and programming skills (e.g., Python, Shell, Go). Good problem-solving and hands-on with the programming language or scripting for infra-automation CI/CD experience with Jenkins and cloud deployment technologies like Code Deploy (AWS), and/or GitLab. Understanding of enterprise software development and infrastructure processes and lifecycle; ability to adjust and apply this knowledge in a dynamic environment using Agile or similar methodologies. Hands-on experience with Infrastructure as Code, using Terraform, CloudFormation, or other tools. Hands-on experience with microservices and distributed applications, such as orchestration and containers, Kubernetes, and/or serverless technology. Understanding of different kinds of infra components such as DB/pub-sub services/cache etc. Bachelors or Masters Degree in Engineering Perks & Benefits  Excellent medical insurance options and free online doctor consultations Yearly privilege and sick leaves as per Karnataka S&E Act Generous holidays (National and Festive) recognition and parental leave policies Learning & Development fund to support your continuous learning journey and professional development Fun events to build culture across the organization Flexible benefit plans for tax exemptions (i.e. Meal card, PF, etc.) Our Commitment to Inclusion and Belonging Observe.AI is an Equal Employment Opportunity employer that proudly pursues and hires a diverse workforce. Observe AI does not make hiring or employment decisions on the basis of race, color, religion or religious belief, ethnic or national origin, nationality, sex, gender, gender identity, sexual orientation, disability, age, military or veteran status, or any other basis protected by applicable local, state, or federal laws or prohibited by Company policy. Observe.AI also strives for a healthy and safe workplace and strictly prohibits harassment of any kind. We welcome all people. We celebrate diversity of all kinds and are committed to creating an inclusive culture built on a foundation of respect for all individuals. We seek to hire, develop, and retain talented people from all backgrounds. Individuals from non-traditional backgrounds, historically marginalized or underrepresented groups are strongly encouraged to apply. If you are ambitious, make an impact wherever you go, and you're ready to shape the future of Observe.AI, we encourage you to apply. For more information, visit www.observe.ai. 
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
getwriter_logo

Director of production support

Writer
-
US.svg
United States
Full-time
Remote
false
📐 About this roleAs the Director of WRITER production support, you will lead the function that guarantees the operational success of our customers' mission-critical WRITER Agents. This is a unique leadership role with a dual mandate: first, to build and lead a world-class human support organization, and second, to architect the AI-driven future of that organization by agentifying our own support processes on the WRITER platform.Your first-hand experience building and delivering AI agents makes you uniquely qualified to lead this team. You understand the complexities of production AI from the inside out. You will be responsible for supporting everything from standardized vertical solutions to the highly complex, custom agents that power our customers' core operations, all while turning your own department into a showcase for AI-powered efficiency.This is a rare opportunity to build the support organization of the future, from the ground up. You will not just be managing a team; you will be a player-coach, a strategist, and a builder, creating a function that is both a world-class human support team and a living testament to the transformative power of the WRITER platform itself.🦸🏻‍♀️ Your responsibilities:Build the WRITER support team: Recruit, hire, and mentor a team of platform support engineers, prioritizing candidates who share a builder's mindset and a deep curiosity for how WRITER Agents work. You will define the hiring profile for individuals who can effectively triage and debug issues within the WRITER.AI ecosystem, from platform-level configurations to the behavior of individual WRITER Agents.Agentify our own support: Develop and execute a roadmap to transform our support function using the WRITER.AI platform. You will "eat our own dog food," building a suite of internal agents to automate triage, diagnosis, knowledge retrieval, and resolution, creating a model for AI-driven operational excellence.Design the agent support process: Architect our entire production support workflow, establishing definitive, AI-first escalation paths. This includes creating specialized triage processes for custom complex agents that directly involve the original builders when necessary.Own the agent knowledge base: Champion and build our technical knowledge base, with a focus on making this knowledge accessible to both human engineers and the support agents you build.Be the voice of the customer in crisis: Serve as the incident commander during P0 issues, leveraging your deep technical understanding of WRITER agents to guide your team and communicate with credibility and confidence.Drive platform & agent Insights: Implement and own all support metrics. You will analyze data from both human- and agent-led resolutions to provide the WRITER product organization with unparalleled insights into platform reliability and customer pain points.⭐️ Is this you?An experienced technical support leader: You have 7+ years of experience in technical support, with at least 3 years spent managing a team in an enterprise PaaS or API-first environment.A support transformation leader: You have a proven track record of not just managing a support function, but fundamentally transforming it using automation and AI. You have personally used an AI platform to agentify and improve support workflows.A proven AI builder: You have demonstrable, hands-on experience building and delivering AI agents or similar complex AI solutions. You have likely been an AI Architect, , a Solutions Architect, or held a similar role in the past. You don't just manage the technology—you have built it.Technically credible & hands-on: You are adept at navigating bespoke software and understand that custom agents have unique failure modes. Your past experience as a builder gives you immediate credibility with engineering and delivery teams.A process builder: You love creating order from chaos and have a proven track record of designing scalable support processes, ticketing workflows, and SLAs for a technical product.A cross-functional partner: You are skilled at working with and influencing teams you don't directly manage to ensure the stability of the entire WRITER.AI ecosystem. 🍩 Benefits & perks (US Full-time employees)Generous PTO, plus company holidaysMedical, dental, and vision coverage for you and your familyPaid parental leave for all parents (12 weeks)Fertility and family planning supportEarly-detection cancer testing through GalleriFlexible spending account and dependent FSA optionsHealth savings account for eligible plans with company contributionAnnual work-life stipends for:Home office setup, cell phone, internetWellness stipend for gym, massage/chiropractor, personal training, etc.Learning and development stipendCompany-wide off-sites and team off-sitesCompetitive compensation, company stock options and 401kWRITER is an equal-opportunity employer and is committed to diversity. We don't make hiring or employment decisions based on race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or any other basis protected by applicable local, state or federal law. Under the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.By submitting your application on the application page, you acknowledge and agree to WRITER's Global Candidate Privacy Notice.
MLOps / DevOps Engineer
Data Science & Analytics
Solutions Architect
Software Engineering
Project Manager
Product & Operations
Apply
Hidden link
getwriter_logo

Director of production support

Writer
-
US.svg
United States
Full-time
Remote
false
📐 About this roleAs the Director of WRITER production support, you will lead the function that guarantees the operational success of our customers' mission-critical WRITER Agents. This is a unique leadership role with a dual mandate: first, to build and lead a world-class human support organization, and second, to architect the AI-driven future of that organization by agentifying our own support processes on the WRITER platform.Your first-hand experience building and delivering AI agents makes you uniquely qualified to lead this team. You understand the complexities of production AI from the inside out. You will be responsible for supporting everything from standardized vertical solutions to the highly complex, custom agents that power our customers' core operations, all while turning your own department into a showcase for AI-powered efficiency.This is a rare opportunity to build the support organization of the future, from the ground up. You will not just be managing a team; you will be a player-coach, a strategist, and a builder, creating a function that is both a world-class human support team and a living testament to the transformative power of the WRITER platform itself.🦸🏻‍♀️ Your responsibilities:Build the WRITER support team: Recruit, hire, and mentor a team of platform support engineers, prioritizing candidates who share a builder's mindset and a deep curiosity for how WRITER Agents work. You will define the hiring profile for individuals who can effectively triage and debug issues within the WRITER.AI ecosystem, from platform-level configurations to the behavior of individual WRITER Agents.Agentify our own support: Develop and execute a roadmap to transform our support function using the WRITER.AI platform. You will "eat our own dog food," building a suite of internal agents to automate triage, diagnosis, knowledge retrieval, and resolution, creating a model for AI-driven operational excellence.Design the agent support process: Architect our entire production support workflow, establishing definitive, AI-first escalation paths. This includes creating specialized triage processes for custom complex agents that directly involve the original builders when necessary.Own the agent knowledge base: Champion and build our technical knowledge base, with a focus on making this knowledge accessible to both human engineers and the support agents you build.Be the voice of the customer in crisis: Serve as the incident commander during P0 issues, leveraging your deep technical understanding of WRITER agents to guide your team and communicate with credibility and confidence.Drive platform & agent Insights: Implement and own all support metrics. You will analyze data from both human- and agent-led resolutions to provide the WRITER product organization with unparalleled insights into platform reliability and customer pain points.⭐️ Is this you?An experienced technical support leader: You have 7+ years of experience in technical support, with at least 3 years spent managing a team in an enterprise PaaS or API-first environment.A support transformation leader: You have a proven track record of not just managing a support function, but fundamentally transforming it using automation and AI. You have personally used an AI platform to agentify and improve support workflows.A proven AI builder: You have demonstrable, hands-on experience building and delivering AI agents or similar complex AI solutions. You have likely been an AI Architect, , a Solutions Architect, or held a similar role in the past. You don't just manage the technology—you have built it.Technically credible & hands-on: You are adept at navigating bespoke software and understand that custom agents have unique failure modes. Your past experience as a builder gives you immediate credibility with engineering and delivery teams.A process builder: You love creating order from chaos and have a proven track record of designing scalable support processes, ticketing workflows, and SLAs for a technical product.A cross-functional partner: You are skilled at working with and influencing teams you don't directly manage to ensure the stability of the entire WRITER.AI ecosystem. 🍩 Benefits & perks (US Full-time employees)Generous PTO, plus company holidaysMedical, dental, and vision coverage for you and your familyPaid parental leave for all parents (12 weeks)Fertility and family planning supportEarly-detection cancer testing through GalleriFlexible spending account and dependent FSA optionsHealth savings account for eligible plans with company contributionAnnual work-life stipends for:Home office setup, cell phone, internetWellness stipend for gym, massage/chiropractor, personal training, etc.Learning and development stipendCompany-wide off-sites and team off-sitesCompetitive compensation, company stock options and 401kWRITER is an equal-opportunity employer and is committed to diversity. We don't make hiring or employment decisions based on race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or any other basis protected by applicable local, state or federal law. Under the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.By submitting your application on the application page, you acknowledge and agree to WRITER's Global Candidate Privacy Notice.
MLOps / DevOps Engineer
Data Science & Analytics
Solutions Architect
Software Engineering
Software Engineer
Software Engineering
Apply
Hidden link
anthropicresearch_logo

Engineering Manager, Secure Frameworks

Anthropic
USD
320000
-
485000
US.svg
United States
Full-time
Remote
false
About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.About Anthropic Anthropic is an AI safety and research company working to build reliable, interpretable, and steerable AI systems. Our mission is to ensure that artificial intelligence benefits all of humanity. We have a staggering amount of work ahead, which means you have an unprecedented opportunity to shape the future of AI while doing the most important work of your career. About the Role We are seeking a Security Engineering Manager to lead our Secure Frameworks team, which builds high-leverage security frameworks and libraries that make secure development easy and prevent entire classes of vulnerabilities through careful design. You'll collaborate closely with teams and leaders across Anthropic to own critical security foundations, including cryptographic frameworks and secure serialization and authorization systems that empower teams to work securely without becoming security experts themselves. Responsibilities: Design frameworks and libraries that enable secure handling of sensitive data including model weights, customer data, and training datasets Own what you build completely; ensuring outstanding user experience, proactive monitoring, and responsive support Enable other teams to build their own security solutions by providing design pattern guidance and expanding security ownership beyond your team Partner with product, research, and infrastructure teams, as well as other Security teams to ensure frameworks integrate smoothly with lower-layer security controls  Make strategic decisions about which frameworks to build based on security risk and translate into prioritized roadmaps Scale the team's impact from primarily supporting researchers today to enabling the broader product engineering organization as Anthropic grows Define success metrics around framework adoption and impact - the team succeeds when engineers naturally reach for Secure Frameworks' tools rather than building security solutions from scratch Manage and grow a team of engineers to deliver high-impact projects that balance security rigor with development velocity You may be a good fit if you: 5+ years managing security engineering teams with proven track record of team productivity 5+ years hands-on security and software engineering experience Deep expertise in securing complex architectures, threat modeling, and risk assessment with ability to evaluate security tradeoffs and make risk-based decisions Strong cross-functional collaboration skills, balancing security requirements with developer experience and velocity Clear and persuasive communicator in both writing and verbal settings Passionate about building diverse, high-performing teams and growing engineers in a fast-paced environment Low ego, high empathy, and have a track record as a talent magnet Experience working with technical internal customers Familiarity with AI safety concepts and frameworks Deadline to apply: None. Applications will be reviewed on a rolling basis. The expected salary range for this position is:Annual Salary:$320,000—$485,000 USDLogistics Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this. We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed.  Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team. How we're different We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills. The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences. Come work with us! Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
openai_logo

Technical Lead, Multimodal Infrastructure

OpenAI
USD
460000
0
-
0
US.svg
United States
Full-time
Remote
false
About the TeamThe Multimodal Research team at OpenAI is building the next generation of AI systems that can understand and generate content across multiple modalities—including text, audio, images, and video. The team’s mission is to unlock new capabilities by enabling models to process and reason about diverse data types simultaneously.This team sits at the intersection of cutting-edge research and production, and plays a central role in shaping OpenAI's multimodal offerings—from speech-to-speech agents to video understanding and image generation. Recent projects include real-time voice agents, modular speech components, and fine-tuning pipelines that support product launches like GPT-4oAbout the RoleAs a TL (Tech Lead) for the multimodal infrastructure team, you will help provide technical leadership to a high-caliber team of ML infrastructure engineers supporting every multimodal research initiative at OpenAI. This is a deeply technical, high-impact role at the intersection of systems engineering and ML research. You’ll be responsible for both driving technical direction and supporting the growth and execution of the team.We’re looking for people who are passionate about enabling world-class research through robust, scalable infrastructure and excited by the unique challenges of multimodal systems at massive scale.This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.In this role, you will:Provide technical leadership to a hands-on team building infrastructure for large-scale multimodal research.Collaborate closely with researchers and engineers across OpenAI to support diverse, product-facing multimodal projects.Design, build, and operate software systems for training and evaluating ML models on complex, high-volume multimodal data.You might thrive in this role if you:Are highly technical.Have strong software engineering skills and experience building high-performance infrastructure.Understand or are excited to learn about ML systems, especially those involving large-scale training or multimodal data.Have experience with model training infrastructure, ML performance optimization or kernel development, large scale multimodal data pipelines, and ML developer toolsExcel at cross-functional leadership—able to prioritize across projects and build strong partnerships across research and infrastructure teams.About OpenAIOpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.OpenAI Global Applicant Privacy PolicyAt OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
MLOps / DevOps Engineer
Data Science & Analytics
Machine Learning Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
anthropicresearch_logo

Engineering Manager - CI Infrastructure

Anthropic
USD
0
405000
-
485000
US.svg
United States
Full-time
Remote
false
About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.About the role: Anthropic is seeking an experienced engineering leader to manage our Continuous Integration team that enables engineers to be maximally effective in a  secure, safe, fast and high quality way, developing state-of-the-art models and products. You'll lead initiatives to build the best large-scale CI system in the world. Responsibilities: Manage a team of engineers building world-class developer productivity tools and infrastructure Consult with different stakeholders to deeply understand developer needs, identifying potential solutions to support secure and efficient development Set strategy and oversee development of CI test  infrastructure, and release processes that enable rapid innovation while maintaining high security standards Hire and coach top technical talent Design processes that help the team operate effectively and continuously improve developer productivity as the organization scales Drive adoption of AI tools to guide developers and increase productivity Ensure seamless integration of model improvements into the development flow Work closely with Researchers as well as Software Engineers, catering to each of their unique needs and identifying potential opportunities to serve both You may be a good fit if you: Have 3+ years experience managing technical teams of more than 6 people Have strong leadership ability; experience managing and growing senior engineers Have passion for developer productivity and tooling Have excellent communication skills to build consensus with stakeholders Possess deep knowledge of modern development practices, source control, CI/CD pipelines, and developer tooling. Are obsessed with security, scalability, and continuous improvement Strong candidates may also have experience with: Experience building tools and environments for AI/ML development Expertise in security and privacy best practices for development infrastructure Experience managing initiatives across multiple products/teams Experience with large scale monorepos and enabling hundreds of developers to collaborate effectively together Deadline to apply: None. Applications will be reviewed on a rolling basis.The expected salary range for this position is:Annual Salary:$405,000—$485,000 USDLogistics Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this. We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed.  Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team. How we're different We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills. The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences. Come work with us! Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
file.jpeg

Director, 2.5D/3D Process Development

Celestial AI
USD
0
215000
-
245000
US.svg
United States
Full-time
Remote
false
About Celestial AI As Generative AI continues to advance, the performance drivers for data center infrastructure are shifting from systems-on-chip (SOCs) to systems of chips. In the era of Accelerated Computing, data center bottlenecks are no longer limited to compute performance, but rather the system’s interconnect bandwidth, memory bandwidth, and memory capacity. Celestial AI’s Photonic Fabric™ is the next-generation interconnect technology that delivers a tenfold increase in performance and energy efficiency compared to competing solutions. The Photonic Fabric™ is available to our customers in multiple technology offerings, including optical interface chiplets, optical interposers, and Optical Multi-chip Interconnect Bridges (OMIB). This allows customers to easily incorporate high bandwidth, low power, and low latency optical interfaces into their AI accelerators and GPUs. The technology is fully compatible with both protocol and physical layers, including standard 2.5D packaging processes. This seamless integration enables XPUs to utilize optical interconnects for both compute-to-compute and compute-to-memory fabrics, achieving bandwidths in the tens of terabits per second with nanosecond latencies. This innovation empowers hyperscalers to enhance the efficiency and cost-effectiveness of AI processing by optimizing the XPUs required for training and inference, while significantly reducing the TCO2 impact. To bolster customer collaborations, Celestial AI is developing a Photonic Fabric ecosystem consisting of tier-1 partnerships that include custom silicon/ASIC design, system integrators, HBM memory, assembly, and packaging suppliers.ABOUT THE ROLE As Director of 2.5D/3D Process development you will lead, grow, and manage a high‑performing assembly & packaging team, utilizing your in‑depth 2.5D assembly process experience and partner management experience to drive the company’s AI products from concept to high volume manufacturing. You will be responsible for managing and driving the assembly process technology development in close collaboration with foundries and OSATs to ensure design for manufacturing, reliability, and cost. Responsibilities will include identification and mitigation of risk to new technologies used in product integration and ensure factory readiness through maturing product during the NPI phase. This job will require you to have hands‑on experience in TSV and wafer process development, 2.5D/3D CoW assembly development along with strong project management, written and verbal communication skills as well as a can‑do attitude to prioritize and address issues with a high sense of urgency.   ESSENTIAL DUTIES AND RESPONSIBILITIES Build, lead, and manage a high‑performance team responsible for assembly & packaging process technology; set goals, direct resources, mentor staff, conduct performance reviews, and ensure team delivers on objectives. Work with cross‑functional packaging teams and lead TSV, backend stack development at the foundry along with 2.5D/3D CoW process development at OSATs to bring packaging solutions from concept to prototypes and ramp to high volume manufacturing with aggressive cost reduction strategies. Actively manage qualification of packages with sensitivity to physics of failures for high thermo‑mechanical reliability, while operating within established cost constraints. Manage internal and external resources effectively and efficiently towards established corporate milestones. Drive ideation and innovation of advanced package solutions and specifications with vendors to advance productization efforts by Celestial AI.   QUALIFICATIONS 15+ years of experience in Semiconductor Packaging, Process and Technology Development. Expert level understanding of advanced foundry process node and its interaction with packaging/assembly/substrate across various package technologies is required. Expertise in advanced packaging technologies: knowledge and insight to deliver high density/high performance interconnects with TSVs in various 2.5D/3D package form factors. Proven track record of bringing products from technology development, package design/definition, qualification, NPI through HVM is required. Working level understanding of cross‑functional packaging areas: package architecture, design rules, BOM, enabling material/process technologies, mechanical, design for manufacturing, reliability, and cost. Familiarity with component & system level reliability, testing, and Failure Analysis (FA). Experience with photonics packaging will be preferred but is not necessary. Experience in project management and communicating technical and project/program status and issue resolution at the executive level. Previous management experience: must have managed direct reports and been responsible for hiring, mentoring, performance reviews, and resource allocation. Excellent problem‑solving skills with strong fundamentals in science, sound engineering judgment, and analytical ability. Effective communicator and comfortable engaging with internal cross functional teams as well as domestic and overseas suppliers. Excellent attention to detail, process and operationally oriented and self‑driven with the ability to work independently and take projects to completion with minimum supervision. M.S or Ph.D. in Mechanical Engineering, Materials Science or Physics is required.     LOCATION: Santa Clara, CA   For California Location: As an early stage start up, we offer an extremely attractive total compensation package inclusive of competitive base salary, bonus and a generous grant of our valuable early-stage equity. The target base salary for this role is approximately $215,000.00 - $245,000.00. The base salary offered may be slightly higher or lower than the target base salary, based on the final scope as determined by the depth of the experience and skills demonstrated by candidate in the interviews.We offer great benefits (health, vision, dental and life insurance), collaborative and continuous learning work environment, where you will get a chance to work with smart and dedicated people engaged in developing the next generation architecture for high performance computing. Celestial AI Inc. is proud to be an equal opportunity workplace and is an affirmative action employer.   #LI-Onsite
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
dataiku_logo

Field Engineer - Dubai

Dataiku
-
AE.svg
United Arab Emirates
Full-time
Remote
false
Dataiku is The Universal AI Platform™, giving organizations control over their AI talent, processes, and technologies to unleash the creation of analytics, models, and agents. Providing no-, low-, and full-code capabilities, Dataiku meets teams where they are today, allowing them to begin building with AI using their existing skills and knowledge.Why Engineering at Dataiku?  Dataiku’s SaaS, cloud or on-premise deployed platform connects many Data Science technologies. Our technology stack reflects our commitment to quality and innovation. We integrate the best of data and AI tech, selecting tools that truly enhance our product. From the latest LLMs to our dedication to open source communities, you'll  work with a dynamic range of technologies and contribute to the collective knowledge of global tech innovators. You can find out even more about working in Engineering at Dataiku by taking a look here. What to know about the Field Engineering team  As a Field Engineer, you’ll work with customers at every stage of their relationship with Dataiku - from the initial evaluations to enterprise-wide deployments. In this role, you will help customers to design, build, validate, and run their Data Science and AI Platforms. How you’ll make an impact This role requires strong technical abilities, adaptability, inventiveness, and strong communication skills. Sometimes you will work with clients on traditional big data technologies such as SQL data warehouses, while at other times you will be helping them to discover and implement the most cutting edge tools; Spark on Kubernetes, cloud-based elastic compute engines, and GPUs. If you are interested in staying at the bleeding edge of big data and AI while maintaining a strong working knowledge of existing enterprise systems, this will be a great fit for you. Some expected outcomes for this role: Understand customer requirements in terms of scalability, availability and security and provide architecture recommendations Deploy Dataiku in a large variety of technical environments (SaaS, Kubernetes, Spark, Cloud or on-prem) Automate operation, installation, and monitoring of the Data Science ecosystem components in our infrastructure stack Collaborate with Revenue and Customer teams to deliver a consistent experience to our customers Drive technical success by being a trusted advisor to our customers and our internal account teams  What you need to be successful: Professional experience with at least one cloud based services (AWS, GCP or Azure) Hands-on experience with the Kubernetes ecosystem for setup, administration, troubleshooting and tuning Familiarity with Ansible or other application deployment tools (Terraform, CloudFormation, etc) Experience with cloud based Data Warehouses and Data Lakes (Snowflake, Databricks) Some experience with Python Grit when faced with technical issues Comfort and confidence in client-facing interactions Ability to work both pre and post sale What will make you stand out: Some knowledge in Data Science and/or machine learning Linux system administration experience Hands-on experience with Spark ecosystem for setup, administration, troubleshooting and tuning  Professional experience or Familiarity with Hadoop environment (Cloudera) is a plus Experience with authentication and authorization systems like(A)AD, IAM, and LDAP What does the hiring process look like? #LI-Hybrid #LI-AN1 Initial call with a member of our Technical Recruiting team Video call with the Field Engineer Hiring Manager Technical Assessment to show your skills (Home Test) Debrief of your Tech Assessment with Field Engineer Team members Final Interview with the VP Field Engineering   What are you waiting for! At Dataiku, you'll be part of a journey to shape the ever-evolving world of AI. We're not just building a product; we're crafting the future of AI. If you're ready to make a significant impact in a company that values innovation, collaboration, and your personal growth, we can't wait to welcome you to Dataiku! And if you’d like to learn even more about working here, you can visit our Dataiku LinkedIn page.   Our practices are rooted in the idea that everyone should be treated with dignity, decency and fairness. Dataiku also believes that a diverse identity is a source of strength and allows us to optimize across the many dimensions that are needed for our success. Therefore, we are proud to be an equal opportunity employer. All employment practices are based on business needs, without regard to race, ethnicity, gender identity or expression, sexual orientation, religion, age, neurodiversity, disability status, citizenship, veteran status or any other aspect which makes an individual unique or protected by laws and regulations in the locations where we operate. This applies to all policies and procedures related to recruitment and hiring, compensation, benefits, performance, promotion and termination and all other conditions and terms of employment. If you need assistance or an accommodation, please contact us at: reasonable-accommodations@dataiku.com     Protect yourself from fraudulent recruitment activity Dataiku will never ask you for payment of any type during the interview or hiring process. Other than our video-conference application, Zoom, we will also never ask you to make purchases or download third-party applications during the process. If you experience something out of the ordinary or suspect fraudulent activity, please review our page on identifying and reporting fraudulent activity here.
MLOps / DevOps Engineer
Data Science & Analytics
Data Engineer
Data Science & Analytics
Apply
Hidden link
lambda_labs_logo

Senior Site Reliability Engineer - Managed Kubernetes

Lambda AI
USD
0
267000
-
401000
US.svg
United States
Full-time
Remote
false
We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll DoOperate and maintain bare-metal Kubernetes clusters, scaling up to thousands of nodesHandle cluster degradation, recovery, resizing, and incident response using fleet management toolsParticipate in a well-managed on-call rotation for critical incidentsAssist customers with Kubernetes questions, workload integration, storage, and authenticationWork closely with our HPC Ops and Datacenter Ops teams for low-level or cross-functional issuesUse Python and Golang to create tooling and automate the validation of platform quality.Design, build, and maintain scalable control plane services, operators, and custom controllers for KubernetesDevelop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion.Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability.About YouMust-Have6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systemsStrong programming skills in Go and Python; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operatorsProven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)Can work either independently with limited direction or as part of a teamCan work with customers during incidents either via tickets, live messaging, or as part of a larger call.Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelinesProven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similarNice-to-HaveDeep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experienceExposure to HPC clusters, AI/ML workloads, or large-scale GPU clustersHybrid or multi-cloud Kubernetes environment experienceContributions to CNCF projects or Kubernetes SIGsWhy Join UsWork on cutting-edge Managed Kubernetes platforms for AI/ML workloadsInfluence the platform roadmap and help shape operations and reliability best practicesCollaborate with a highly skilled engineerOpportunity to mentor and grow within a fast-growing, technology-driven environmentSalary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
lambda_labs_logo

Security Engineer - Detection & Response

Lambda AI
USD
296000
-
445000
US.svg
United States
Full-time
Remote
false
We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.About the RoleLambda Security protects some of the world's most valuable digital assets: invaluable training data, model weights representing immense computational investments, and the sensitive inputs required to leverage best of breed AI models. We're responsible for securing every byte that powers breakthrough artificial intelligence.As a Security Engineer on the Detection & Response team, you'll be a core technical contributor building detection capabilities, driving incident response, and eliminating firefighting everywhere possible.Reporting to the Senior Manager of Detection & Response and working within our specialized Detection & Response team, you'll build and operate detection systems, lead incident investigations, develop threat intelligence capabilities, and contribute to red team activities. You'll coordinate closely with Security Technical Program Management to drive prioritized security remediations across the organization, ensuring that critical threats are addressed systematically rather than reactively.You will work on implementing enterprise-grade detection capabilities, automating incident response workflows, developing threat hunting programs, and building tooling that enables 24/7 security operations. You'll have unique access to LLMs hosted on our own infrastructure to implement and experiment with AI-powered detection and response capabilities that wouldn't be possible anywhere else.If you thrive on hunting threats, responding to incidents, and building detection systems that protect cutting-edge AI infrastructure at scale, we'd love to talk.We value diverse backgrounds, experiences, and skills, and we are excited to hear from candidates who can bring unique perspectives to our team. If you do not exactly meet this description but believe you may be a good fit, please still apply and help us understand your readiness for this role. Your application is not a waste of our time.What You’ll DoIncident Response & Operations:Response: Qualify reports and lead response activities from initial triage through remediation and retrospective.Automation: Develop tools and workflows that accelerate incident response and reduce mean time to resolution.Coordination: Drive prioritization and remediation of security findings across engineering teams in coordination with Security Technical Program Management.24/7 Operations: Participate in on-call rotation, ensuring rapid response to security events that threaten customer data or operations.Threat Detection & Analysis:Detection Engineering: Create and tune detection rules and alerts that identify threats across Lambda's infrastructure before they impact customers or revenue.Threat Intelligence: Research and operationalize threat intelligence specific to AI infrastructure and Lambda's unique threat landscape.Threat Hunts: Proactively search for indicators of compromise and suspicious activity that automated detection might miss.Explore AI-driven Security: Leverage Lambda's hosted LLMs to create AI-powered threat detection, automated triage, and intelligent alert correlation.Offensive Security: Support periodic tabletop exercises and red team activities to test and improve detection coverage and response capabilities.What We Think a Candidate Needs to Demonstrate to SucceedHave 3+ years of hands-on security engineering experience and 5+ years of total engineering experience, with demonstrated impact in detection and incident response.Thrive in high-speed, high-ambiguity startup environments where you build security capabilities while responding to immediate threats.Deep technical expertise with security tooling including SIEM/SOAR platforms, EDR solutions, vulnerability scanners, and cloud security monitoring.Excel at solving problems in Python, Go, or similar languages, building automations that scale security impact.Proven ability to work effectively with cross-functional technical teams both with and without authority (we're all on the same team!).Strong Linux systems experience in both bare metal and cloud environments, understanding infrastructure from kernel to application layer.Excellence at translating security concerns into business risk, enabling stakeholders to make informed decisions.Nice to HaveYou've built or contributed to detection engineering programs or incident response capabilities.Experience with threat intelligence platforms, threat hunting methodologies, or purple team exercises.Deep experience with specific SIEM platforms (Splunk, Elastic, Chronicle) or SOAR solutions.Experience driving or providing significant evidence for compliance audits, such as SOC 2, ISO 27001, PCI-DSS, HIPAA/HITECH, or FedRAMP.You've developed detection content shared with the security community (Sigma rules, YARA, etc.).Experience responding to incidents in both cloud (AWS, GCP, Azure) and bare metal environments.Security certifications like GCIH, GNFA, GCIA, or similar that demonstrate incident response expertise.Experience with forensics, malware analysis, or reverse engineering.Excitement about leveraging our direct access to state-of-the-art LLMs to enhance detection and response—imagine AI-powered threat hunting, automated incident triage, and intelligent alert correlation at a scale only possible when you host the AI infrastructure yourself.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
lambda_labs_logo

Senior Networking Engineer

Lambda AI
USD
203000
-
417000
US.svg
United States
Full-time
Remote
false
We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our San Francisco/San Jose/Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.What You’ll DoHelp to build Lambda’s cloud networking infrastructureContribute to automation of network configurationWill be part of operations and on-call for networkingWork with internal and external customer to resolve network related issuesWork on deploying and configuring networking HW, Switches, FWs, for new clustersHelp with deploying and maintaining network monitoring and management toolsYouHave 3+ years of experience in IT space, and 1+ in managing networksHave experience with virtualization technology, like ESXi, KVM, and VMs managementHave experience with FW policies configurationsHave experience with multi-data center networks and hybrid cloud networksHave understanding of BGP EVPN VXLAN networks, Spine and Leaf (Clos) network topologyAre comfortable on the Linux command line, and have an understanding of the Linux networking stack and internalsHave python and/or bash programming experience and worked with git or similar source control systemsNice to HaveExperience with Monitoring/Observability tools like Datadog, Splunk, Grafana, PrometheusHave experience building and maintaining Software Defined Networks (SDN)Experience with HPC networking, such as Infiniband or RoCEExperience automating network configuration within public clouds, with tools like Terraform/Ansible/SaltExperience with Next-Generation Firewalls (NGFW)Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
helsing_logo

Avionics Integration Lab Lead

helsing
-
GE.svg
Germany
Full-time
Remote
false
Who we are Helsing is a defence AI company. Our mission is to protect our democracies. We aim to achieve technological leadership, so that open societies can continue to make sovereign decisions and control their ethical standards.  As democracies, we believe we have a special responsibility to be thoughtful about the development and deployment of powerful technologies like AI. We take this responsibility seriously.  We are an ambitious and committed team of engineers, AI specialists and customer-facing programme managers. We are looking for mission-driven people to join our European teams – and apply their skills to solve the most complex and impactful problems. We embrace an open and transparent culture that welcomes healthy debates on the use of technology in defence, its benefits, and its ethical implications.  The role As a Avionics Integration Lab lead, you will be responsible for overseeing and managing a multi-site systems integration lab environment. Your role will involve leading a team of engineers and technicians in developing, integrating, and deploying advanced technologies to support the testing of avionics and mission systems.  In achieving this goal you will work with a diverse, high-calibre team of experts, strategists and partners. The day-to-day Develop facilities and rigs to support HIL testing and Avionics/mission system integration testing Work with the system design teams to understand the test lab requirements Provide systems engineering expertise in relation to system architecture and design: requirements analysis and development, rig and network design, rig qualification Oversee the development of rig hardware, equipment and test infrastructure software Manage and lead the systems integration lab to ensure efficient operation and maintenance of lab facilities Interface and coordinate with Test Integrators, Design Offices, Chief Engineering Offices and Program Offices Provide expertise to system architecture and design definition for Avionics & Mission Systems integration You should apply if you Hold a relevant degree such as a Bachelor's or Master's in Aerospace Engineering, Electrical Engineering, Computer Science, or a related field Have built an avionics HIL and system integration environment lab environment from the ground up Have proven experience leading technical teams, providing direction, mentorship, and coordination to achieve project goals Have deep technical understanding in avionics rig design and test system software development and workflows Have experience in avionics integration, testing, or related fields, ideally within an aerospace or defence environment Possess excellent communication skills and the ability to report and present results clearly and effectively to both internal and external stakeholders Note: We operate in an industry where women, as well as other minority groups, are systematically under-represented. We encourage you to apply even if you don’t meet all the listed qualifications; ability and impact cannot be summarised in a few bullet points. Join Helsing and work with world-leading experts in their fields  Helsing’s work is important. You’ll be directly contributing to the protection of democratic countries while balancing both ethical and geopolitical concerns The work is unique. We operate in a domain that has highly unusual technical requirements and constraints, and where robustness, safety, and ethical considerations are vital. You will face unique Engineering and AI challenges that make a meaningful impact in the world Our work frequently takes us right up to the state of the art in technical innovation, be it reinforcement learning, distributed systems, generative AI, or deployment infrastructure. The defence industry is entering the most exciting phase of the technological development curve. Advances in our field of world are not incremental: Helsing is part of, and often leading, historic leaps forward In our domain, success is a matter of order-of-magnitude improvements and novel capabilities. This means we take bets, aim high, and focus on big opportunities. Despite being a relatively young company, Helsing has already been selected for multiple significant government contracts We actively encourage healthy, proactive, and diverse debate internally about what we do and how we choose to do it. Teams and individual engineers are trusted (and encouraged) to practise responsible autonomy and critical thinking, and to focus on outcomes, not conformity. At Helsing you will have a say in how we (and you!) work, the opportunity to engage on what does and doesn’t work, and to take ownership of aspects of our culture that you care deeply about What we offer A focus on outcomes, not time-tracking Competitive compensation and stock options Relocation support Social and education allowances Regular company events and all-hands to bring together employees as one team across Europe A hands-on onboarding program (affectionately labelled “Infraduction”), in which you will be building tooling and applications to be used across the company. This is your opportunity to learn our tech stack, explore the company, and learn how we get things done - all whilst working with other engineering teams from day one    Helsing is an equal opportunities employer. We are committed to equal employment opportunity regardless of race, religion, sexual orientation, age, marital status, disability or gender identity. Please do not submit personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, data concerning your health, or data concerning your sexual orientation.  Helsing's Candidate Privacy and Confidentiality Regime can be found here.     
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
togethercomputer_logo

Senior Network Operations Engineer

Together AI
USD
0
160000
-
230000
US.svg
United States
Full-time
Remote
false
As a Senior Network Operations Engineer at Together AI, you are our front-line responder for break/fix incidents—owning alert triage, collaborating with SRE and MLOps teams, and driving rapid resolution to keep our global network and platform running smoothly. You combine strong operational discipline with hands-on troubleshooting and a bias for automation. Beyond traditional networking, you’ll work hands-on with Kubernetes and Slurm to diagnose issues that span infrastructure, container networking, and HPC job fabrics. You’re fluent in routing/switching and network security fundamentals, comfortable on Linux, and thrive in fast-moving environments where clear communication and crisp execution matter. You’ll improve monitoring, runbooks, and recovery playbooks to reduce MTTA/MTTR and prevent repeat incidents. Outstanding problem-solving abilities and a solid understanding of fundamental network theory are also critical to your success. Responsibilities Serve as first responder for network alerts and incidents: assess impact, prioritize, mitigate, and escalate as needed to SRE/MLOps/Network Engineering. Own end-to-end incident lifecycle: detection, triage, containment, remediation, comms, and post-incident reviews with clear timelines and action items. Monitor network health and capacity across routing/switching, firewalls, and data center fabrics; tune alert thresholds and dashboards to reduce noise. Troubleshoot L2–L4 issues (ARP, VLAN/VXLAN/EVPN, routing protocols, ACLs/NAT, DNS, TLS termination, QoS) using packet capture and flow/telemetry tools. Execute standard changes (MOPs) and emergency changes with rigorous change control and validation; document outcomes and update runbooks. Operate multi-cluster add-ons (e.g., MetalLB/Traefik/NGINX), observe health via Prometheus/Grafana/Loki, and tune alerts to reduce noise. Debug CNI/data plane (e.g., VXLAN/EVPN, iptables/nftables, network policies), kube-proxy/iptables mode, CoreDNS, Services (ClusterIP/NodePort/LoadBalancer), and Ingress/EGRESS. Maintain accurate network documentation: diagrams, inventories, IPAM, device configs, and topology state. Improve operational excellence: automate repetitive tasks, enhance self-service tooling, and contribute to SLOs, error budgets, and reliability roadmaps. Participate in a shared on-call rotation providing 24×7 coverage for critical services. Requirements 3+ years in a NOC/Network Operations or Network Support role for large-scale data center or service provider-style environments (hybrid/on-prem + cloud). Solid understanding of TCP/IP and core protocols: BGP, OSPF/IS-IS, VLAN, VXLAN, EVPN, ACLs/NAT, DHCP, DNS, and QoS. Proficiency with troubleshooting tools: Wireshark/tcpdump, mtr/traceroute, nmap, curl, iperf; comfortable on Linux for diagnostics and log analysis. Experience operating multi-vendor networks (e.g., Arista, Cisco, Juniper, NVIDIA/Mellanox) and load balancers/firewalls. Familiarity with AWS/GCP/Azure networking concepts (VPC/VNet, IGW/NATGW, peering, PrivateLink, routing, security groups). Strong scripting/automation fundamentals (e.g., Bash/Python), and comfort with Git-based workflows for config versioning and change reviews. Clear, concise communicator—able to write incident timelines, RCAs, and user-facing updates under time pressure. Preferred Knowledge of RoCE and Infiniband protocols a plus Hands-on Kubernetes troubleshooting experience: CNI fundamentals (policies, encapsulation), Services/Ingress, DNS (CoreDNS), kube-proxy, and container runtime basics a huge plus Understanding of AI training workloads and the demands they exert on networks a plus About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy  
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
lambda_labs_logo

Security Engineer - Architecture

Lambda AI
USD
296000
-
445000
US.svg
United States
Full-time
Remote
false
We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us.  *Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. About the RoleLambda Security protects some of the world's most valuable digital assets: invaluable training data, model weights representing immense computational investments, and the sensitive inputs required to leverage best of breed AI models. We're responsible for securing every byte that powers breakthrough artificial intelligence.As a Security Engineer on our Architecture team, you'll be the technical foundation of our security design decisions, creating security architecture patterns and standards that directly protect customer data and enable Lambda to be the safest place to build with AI.Reporting to the Senior Manager of Security and collaborating closely with Product Engineering, Platform Engineering, and embedded Technical Program Managers, you'll drive security architecture improvements across our AI-focused infrastructure. Your work will span security design reviews, threat modeling, architecture patterns, and security requirements that scale with our rapid growth while maintaining the highest security standards.You will work on creating security architecture patterns, conducting threat models and security reviews, establishing security requirements for engineering teams, and developing customer-facing security documentation. You'll have unique access to LLMs hosted on our own infrastructure to pioneer AI-powered security architecture solutions that wouldn't be possible anywhere else.If you thrive on solving complex security design challenges in cutting-edge AI infrastructure and want to build security architectures that scale from hundreds to thousands of systems, we'd love to talk.We value diverse backgrounds, experiences, and skills, and we are excited to hear from candidates who can bring unique perspectives to our team. If you do not exactly meet this description but believe you may be a good fit, please still apply and help us understand your readiness for this role. Your application is not a waste of our time.What You’ll DoDrive Security Architecture: Design and document comprehensive security patterns, standards, and implementation guides that engineering teams can adopt to build secure-by-default systems.Lead Security Reviews: Conduct security design reviews and develop threat models for critical systems, identifying risks and providing actionable recommendations.Develop Security Requirements: Create clear security requirements and acceptance criteria that integrate seamlessly into engineering development cycles.Build Security Solutions: Prototype and implement security controls, tools, and automation that demonstrate secure patterns and enable self-service security.Pioneer AI-Powered Architecture: Leverage Lambda's hosted LLMs to build next-generation security capabilities including automated threat modeling, AI-assisted security reviews, and intelligent architecture recommendations that push far beyond traditional approaches.Collaborate Across Engineering: Partner with Product and Platform Engineering teams to integrate security architecture requirements into their designs at optimal moments.Enable Customer Trust: Develop customer-facing security documentation, architecture whitepapers, and technical security content that demonstrates our security maturity.Mentor Security Excellence: Coach engineers across the organization on secure design principles and security architecture patterns, multiplying your impact.Drive Architectural Standards: Establish and maintain security architecture standards that protect critical assets while enabling development velocity.Advocate for Security: Communicate security architecture value to stakeholders, translating technical risks into business impact for informed decision-making.What We Think a Candidate Needs to Demonstrate to SucceedHave 3+ years of security engineering or security architecture experience and 5+ years of total engineering experience, with demonstrated impact protecting enterprise infrastructure.Thrive in high-speed, high-ambiguity startup environments where you are constantly balancing security goals with business needs.Deep technical expertise in security architecture patterns, threat modeling methodologies, and security design principles.Excel at solving problems through design and prototyping in Python, Go, or similar languages.Proven ability to work effectively with cross-functional technical teams both with and without authority (we're all on the same team!).Strong Linux systems experience in both bare metal and cloud environments, understanding infrastructure from kernel to application layer.Demonstrated experience driving security improvements that were enthusiastically adopted by engineering teams.Excellence at translating security architecture decisions into business risk, enabling stakeholders to make informed decisions.Nice to HaveYou've led the security assessment and requirements for major platform components or enterprise systems.Experience driving or providing significant evidence for compliance audits, such as SOC 2, ISO 27001, PCI-DSS, HIPAA/HITECH, or FedRAMP.Deep experience with cloud security architecture and cloud provider security services (AWS, GCP, Azure).Experience with AI/ML system security, including model security, data pipeline protection, or adversarial threat modeling (yes, we know it’s all brand new), or other high sensitivity workloads.You've developed security architecture patterns that were adopted across multiple engineering teams.Security certifications like CISSP, OSCP, or similar that demonstrate continued learning.Experience with infrastructure-as-code security patterns and secure DevOps practices.Excitement about leveraging our direct access to state-of-the-art LLMs to revolutionize security architecture—imagine AI-powered threat modeling, automated security design reviews, and intelligent architecture validation at a scale only possible when you host the AI infrastructure yourself.Salary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
No job found
There is no job in this category at the moment. Please try again later