Top MLOps / DevOps Engineer Jobs Openings in 2025

Looking for opportunities in MLOps / DevOps Engineer? This curated list features the latest MLOps / DevOps Engineer job openings from AI-native companies. Whether you're an experienced professional or just entering the field, find roles that match your expertise, from startups to global tech leaders. Updated everyday.

joinanyscale_logo

Software Engineer (Site Reliability Engineer)

Anyscale
USD
0
180600
-
200900
US.svg
United States
Full-time
Remote
false
About Anyscale: At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We’re commercializing Ray, a popular open-source project that's creating an ecosystem of libraries for scalable machine learning. Companies like OpenAI, Uber, Spotify, Instacart, Cruise, and many more, have Ray in their tech stacks to accelerate the progress of AI applications out into the real world. With Anyscale, we’re building the best place to run Ray, so that any developer or data scientist can scale an ML application from their laptop to the cluster without needing to be a distributed systems expert. Proud to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date. About the role:As a Site Reliability Engineer, you will play a crucial role in ensuring the smooth operation of all user-facing services and other Anyscale production systems. Anyscale values diversity and inclusion, and we encourage applications from individuals of all backgrounds.This includes processes for provisioning, negotiating prices, managing costs, seeing opportunities for teams to reduce wastage by finding applications across the company. You will apply sound engineering principles, operational discipline, and mature automation to our environments and the Anyscale codebase as we scale.As part of this role, you will:Develop a unified perspective on how cloud components are utilized across the company, taking into account diverse needs and requirements.Ensure that deployment methodologies align with the company's reliability goals.Build systems that promote understanding of production environments, facilitating quick identification of issues through robust observability infrastructure for metrics, logging, and tracing.Create monitoring and alerting systems at different levels, enabling teams to easily contribute and enhance the overall monitoring capabilities.Establish testing infrastructure to support the team in writing and executing tests effectively.Develop tools for measuring service level objectives (SLOs) and define organization-wide SLOs.Implement best practices and on-call systems, ensuring efficient incident management and up-leveling the incident management system at Anyscale.Coordinate the creation and deployment of cloud-based services, including tracking deployments and establishing effective communication channels for issue resolution.We'd love to hear from you if have:At least 3 years of relevant work experience in a similar role.CompensationAt Anyscale, we take a market-based approach to compensation. We are data-driven, transparent, and consistent. As the market data changes over time, the target salary for this role may be adjusted.This role is also eligible to participate in Anyscale's Equity and Benefits offerings, including the following:Stock OptionsHealthcare plans, with premiums covered by Anyscale at 99%401k Retirement PlanWellness stipendEducation stipendPaid Parental LeaveFlexible Time OffCommute reimbursement100% of in office meals coveredAnyscale Inc. is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Anyscale Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
luma_ai_logo

Product Security Engineer – Multimodal & Generative AI

Luma AI
-
US.svg
United States
Full-time
Remote
true
About Luma LabsAt Luma Labs, we’re pioneering the next generation of multimodal generative AI, enabling models to create hyper-realistic videos and images from natural language and other rich input modalities. Our products empower creators, developers, and companies to generate content that was previously impossible instantly and intelligently.As we scale our AI platform and reach millions of users, we are hiring our Product Security Engineer to set the foundation for security across everything we build. This is a critical role that blends hands-on security engineering with strategic leadership ideal for someone who thrives in fast-paced, high-impact environments and wants to shape security from day one.Role OverviewYou will be Luma Labs’ first dedicated security engineering hire. As the Product Security Engineer, you’ll own the security posture of our products, services, and generative systems. You’ll work directly with engineering, ML, infrastructure, and leadership to proactively design and implement secure systems with a strong focus on the unique risks and opportunities in multimodal video and image generation.This is a leadership-track position with both strategic ownership and deep technical execution.What You’ll DoOwn Product & Application Security: Define and drive Luma’s approach to secure product development from design reviews to automated scanning to runtime protections.Secure GenAI Systems: Analyze and secure the full lifecycle of generative models (image, video, multimodal), including data ingestion, model inference, and API surface.Lead Threat Modeling & Reviews: Run deep security reviews on new features, architectures, and model capabilities, with a focus on abuse prevention, data leakage, and content safety.Build Security Infrastructure: Stand up tools and systems for static analysis, dependency scanning, secrets detection, and CI/CD hardening.Define Misuse & Abuse Guardrails: Partner with ML and product teams to mitigate prompt injection, jailbreaks, adversarial inputs, and misuse of generative outputs.Incident Response & Detection: Lead investigations and forensics for product-related security incidents, vulnerabilities, or model abuse cases.Influence Org-wide Security Culture: Establish best practices, run internal training, and serve as a go-to security expert across Luma’s growing technical teams.Build the Function: Help hire and grow a high-caliber security team as the company scales. Requirements:Must-Have:5+ years in security engineering, with deep experience in product/application security.Have successful track of getting product through security certificationsProven ability to operate as a hands-on engineer and technical leader.Strong understanding of generative AI systems or high-complexity ML applications.Proficient in secure development with Python and experience securing cloud-native environments (AWS/GCP, Docker/K8s).Deep experience with threat modeling, secure design, and modern application security tooling (SAST, DAST, IaC scanning, etc.).Ability to balance pragmatism and rigor you can make fast, thoughtful decisions and execute in a fast-moving startup environment.Excellent written and verbal communication skills; comfortable collaborating across research, product, infra, and leadership.Bonus / Nice-to-Have:Hands-on experience with generative models (e.g., diffusion, transformers, vision-language) and related risks (e.g., prompt injection, data leakage).Experience building or leading security teams in an early-stage startup.Exposure to red teaming, adversarial ML, or AI safety frameworks.Public speaking, open-source contributions, or research in security or AI fields.Why This Role is UniqueGreenfield Security: You’ll be defining the security architecture of one of the most advanced generative AI stacks in the world from the ground up.Cross-Disciplinary Impact: Collaborate directly with ML researchers, creative technologists, infra engineers, and designers.Fast Path to Leadership: This is a founding role with direct access to leadership and influence over future hires and security roadmap.Deep Tech with Real Users: Work on cutting-edge video and image generation tools already in production and scaling fast.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Machine Learning Engineer
Data Science & Analytics
Apply
Hidden link
parallelsystems_logo

Member of Technical Staff, Infrastructure & Scaling

Parallel
-
US.svg
United States
Full-time
Remote
false
At Parallel Web Systems, we are bringing a new web to life: it’s built with, by, and for AIs. Our work spans innovations across crawling, indexing, ranking, retrieval, and reasoning systems. Our first product is a set of APIs for AIs to do more with web data. We are a fully in-person team based in Palo Alto, CA. Our organization is flat; our team is small and talent dense.We want to talk to you if you are someone who can bring us closer to living our aspirational values:Own customer impact - It’s on us to ensure real-world outcomes for our customers.Obsess over craft - Perfect every detail because quality compounds.Accelerate change - Ship fast, adapt faster, and move frontier ideas into production.Create win-wins - Creatively turn trade-offs into upside.Make high-conviction bets - Try and fail. But succeed an unfair amount.Job: You will build, operate, and scale our infrastructure, including our infrastructure around large language models, and ensure that our systems are reliable and cost-efficient as we grow. You will anticipate bottlenecks before they appear, ensure that our architecture evolves to meet increasing demands, and build the tools and systems that keep engineering velocity high.You: Have deep intuition on distributed systems, cloud platforms, performance tuning, and scalable architecture. You like to reason about trade-offs between cost, reliability, and speed of iteration. You care about your work enabling every team to build faster and ship confidently, and about infrastructure that can support products used by millions without breaking a sweat.Our founder is Parag Agrawal. Previously, he was the CEO and CTO at Twitter. Our investors include First Round Capital, Index Ventures, Khosla Ventures, and many others.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
anthropicresearch_logo

Engineering Manager - AI Reliability

Anthropic
USD
405000
-
485000
US.svg
United States
Full-time
Remote
false
About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.About the role Anthropic is seeking an experienced engineering leader to lead one of our Reliability Engineering teams. This team includes Software Engineers and Systems Engineers focused on defining and achieving reliability metrics for Anthropic's critical serving systems. As a manager, you'll lead the team that's significantly improving reliability for Anthropic's services while pioneering the use of modern AI capabilities to reengineer how we approach reliability engineering. This leadership role is critical to Anthropic's mission to bring groundbreaking AI technologies to benefit humanity in a safe and reliable way. Responsibilities: Lead and grow a team of reliability engineers responsible for large language model serving. Drive the development of Service Level Objectives that balance availability/latency with development velocity across the organization Oversee the design and implementation of comprehensive monitoring systems for availability, latency and other critical metrics Guide your team in architecting high-availability language model serving infrastructure capable of supporting millions of external customers and high-traffic internal workloads Lead the strategy for automated failover and recovery systems across multiple regions and cloud providers Establish and manage incident response processes for critical AI services, ensuring your team drives rapid recovery and systematic improvements Direct cost optimization initiatives for large-scale AI infrastructure, with focus on accelerator (GPU/TPU/Trainium) utilization and efficiency Partner with cross-functional teams to align reliability engineering efforts with broader company objectives Build a strong engineering culture focused on reliability, operational excellence, and innovation You may be a good fit if you: Have experience managing and scaling reliability or infrastructure engineering teams Possess deep technical knowledge of distributed systems observability and monitoring at scale Understand the unique challenges of operating AI infrastructure and can guide technical decisions Have successfully implemented SLO/SLA frameworks and can drive adoption across organizations Bring experience with both traditional infrastructure metrics and AI-specific performance indicators Can effectively lead technical discussions while translating between ML engineers and infrastructure teams Have excellent leadership and communication skills, with ability to influence at all levels Demonstrate strong hiring and talent development capabilities Strong candidates may also: Have managed teams operating large-scale model training or serving infrastructure (>1000 GPUs) Bring hands-on experience with ML hardware accelerators (GPUs, TPUs, Trainium, etc.) Understand ML-specific networking optimizations and their operational implications Have led teams through major reliability transformations or infrastructure migrations Possess experience building reliability engineering practices from the ground up Have contributed to or led open-source infrastructure or ML tooling initiatives Demonstrate thought leadership in the reliability engineering community The expected salary range for this position is:Annual Salary:$405,000—$485,000 USDLogistics Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this. We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed.  Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team. How we're different We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills. The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences. Come work with us! Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process
MLOps / DevOps Engineer
Data Science & Analytics
Machine Learning Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
1691021621180

Electrical Engineer: AI Hardware

X AI
0
0
-
0
US.svg
United States
Remote
false
About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.About the Role As an Electrical Engineer on the AI Hardware team at xAI’s Memphis datacenters, you will play a critical role in designing, evaluating, and maintaining the electrical infrastructure that supports our high-performance AI computing systems. This role requires hands-on expertise in debugging complex electrical infrastructure issues to ensure reliability, safety, and scalability for mission-critical operations. You will collaborate with multidisciplinary teams, including hardware engineers, data center operations, and external vendors, to deliver robust solutions that align with xAI’s ambitious goals. Key Responsibilities Infrastructure Design & Optimization: Contribute to the design and implementation of electrical systems, including power distribution units (PDUs), uninterruptible power supplies (UPS), transformers, switchgear, and backup generators, tailored for AI hardware workloads. Debugging & Troubleshooting: Diagnose and resolve complex electrical infrastructure issues, minimizing downtime and ensuring continuous operation of AI hardware systems. Utilize analytical tools and methodologies to identify root causes of power anomalies, failures, or inefficiencies. System Maintenance: Develop and execute maintenance schedules for electrical systems to maximize uptime and performance. Perform regular inspections and testing to ensure compliance with industry standards (e.g., Uptime Institute Tier Standards) and safety regulations. Capacity Planning: Analyze current and projected power demands for AI hardware deployments, proposing scalable solutions to meet growing computational needs while optimizing energy efficiency. Project Coordination: Support infrastructure projects from concept to commissioning, coordinating with internal teams and external contractors to meet project timelines and quality standards. Collaboration & Innovation: Work closely with AI hardware engineers, data center technicians, and senior electrical engineers to integrate power systems with cutting-edge AI hardware. Contribute to a culture of continuous improvement by proposing innovative solutions for energy resilience and efficiency. Compliance & Safety: Ensure all electrical systems and operations adhere to local, state, and federal regulations, as well as xAI’s safety protocols, to maintain a secure working environment. Basic Qualifications Bachelor’s degree in Electrical Engineering or a related field. 3+ years of experience in electrical engineering, with a focus on data center or high-performance computing environments. Proven expertise in debugging and troubleshooting electrical infrastructure, including power distribution, backup systems, and cooling units. Proficiency with tools such as AutoCAD, Revit, Bluebeam, or SKM for system design and analysis. Strong understanding of data center electrical components (e.g., PDUs, UPS systems, generators, switchgear). Familiarity with industry standards and safety regulations for electrical systems. Excellent problem-solving skills and the ability to work in a fast-paced, mission-driven environment. Preferred Qualifications Experience working with electrical systems supporting AI hardware or high-performance computing (HPC) workloads. Engineer in Training (EIT) certification or progress toward Professional Engineer (PE) licensure. Knowledge of energy efficiency strategies and renewable energy integration for data centers. Prior experience in project management or coordination for large-scale infrastructure projects. Familiarity with AI-driven diagnostic tools or predictive maintenance technologies for electrical systems. xAI is an equal opportunity employer. California Consumer Privacy Act (CCPA) Notice
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
notablehealth_logo

Senior Manager - Integrations Specialist

Notable
USD
0
150000
-
165000
US.svg
United States
Full-time
Remote
true
Notable is the leading healthcare AI platform for transforming workforce productivity. Health systems, hospitals, and payers use Notable to improve healthcare quality, close gaps in patient care, drive member enrollment, and patient acquisition, retention, and reimbursement, scaling growth without hiring more staff. We are on a mission to improve the lives of patients, staff, and clinicians - to improve healthcare for humanity. This isn't just a lofty goal - it's something we're achieving every single day. When you join Notable, you become part of a force actively transforming healthcare. Our aim to impact 100 million patients isn't just a number; it's a commitment to creating meaningful change on a massive scale. Therefore, our culture is purposeful in pursuit of this mission. We believe our culture gives each person the opportunity to do the best work of their lives, work with the best teammates, and have fun achieving great things together.Role SummaryAs Senior Manager, Integration Specialists you’ll lead a high-performing team that designs, builds, and maintains the connections between our customers’ ecosystems (EHRs, ancillary systems, data warehouses, and third-party apps) and Notable’s Flow Builder platform. You’ll own the Integrations delivery function end to end - defining technical standards, steering roadmap priorities, and ensuring our integrations are resilient, scalable, and future-proof.This is a true player-coach role: you’ll be hands-on with complex interface challenges while also developing team capabilities, driving strategic initiatives, and aligning cross-functional partners in Customer Success, Delivery, Product, and Engineering. Your leadership will directly shape how Notable accelerates implementations, expands integration capabilities with health systems, and unlocks new automation opportunities for providers and payers.What You’ll DoLead, mentor, and grow a team of Integration Specialists - guiding technical execution, customer engagement, and career development.Own delivery of new interfaces and automations (HL7 v2, X12, FHIR, flat file, REST, RPA, etc.) ensuring they track against KPIs and are built on time, meet performance SLAs, and align with customer outcomes.Provide deep technical oversight on Mirth Connect channels, routing logic, transformations, and JavaScript extensions - setting standards for coding, testing, and monitoring.Shape the integrations delivery roadmap - partnering with Product and our AI Platform Architects to prioritize reusable adapters, API accelerators, and tooling that boost speed-to-value and reduce maintenance effort.Cultivate & manage integration partnerships where needed, to maintain existing relationships and contracts and to create expansion opportunities for the suite of available integrations on the platform.Act as an escalation point for thorny interface issues - troubleshooting data flow, message parsing, and system performance across Epic, Oracle Cerner, Athena, and other partners.Nurture customer relationships - engage with customer technical teams and project leads to build technical trust and help them understand our integration models, implementation methodologies, and any technical dependencies on their teams.Drive operational excellence - instituting best-practice documentation and implementation standards, identifying ways to make our suite of integrations capabilities faster and easier to implement for our customers.Collaborate cross-functionally with Product, Engineering, and AI Platform Architects to embed integration considerations into solution design early and ensure seamless end-to-end workflows.Champion team learning & innovation - promoting knowledge-sharing on new standards (e.g., emerging EHR APIs), industry-standard best practices, and advanced automation techniques.Support hiring and onboarding, building a diverse pipeline of integration talent and setting new team members up for success. You’re a Great Fit If You…Love leading technical teams and coaching others while staying close to hands-on technical integration work.Thrive in an entrepreneurial, mission-driven environment where autonomy and innovation are valued.Bring a systems mindset - balancing near-term delivery with longer-term scalability and platform thinking.Communicate clearly across technical and non-technical audiences - from integration engineers to clinical executives.Are energized by building repeatable frameworks that unlock measurable customer impact.What We’re Looking For8+ years in healthcare integration or interoperability roles (interface engine developer, solution architect, technical lead), including 3–5+ years leading teams.Expert-level proficiency with Mirth Connect (channel design, JavaScript/JavaScript Reader/Writer, filters, transformers, alerts) and a strong understanding of modern interface paradigms (RESTful, APIs, event-driven workflows, etc.).Deep knowledge of HL7 v2 messaging (ADT, ORM, ORU, REF, MDM, etc.), FHIR APIs, flat-file ETL, and REST/JSON web services.Proven experience integrating with Epic and Oracle Cerner EHRs. Comfortable reading and troubleshooting JavaScript, and collaborating with engineering on reusable libraries and extensions.Demonstrated success automating clinical or operational workflows that reduce manual effort and improve data quality.Track record driving complex projects from discovery through go-live, with strong project, organizational, and time-management skills.High intellectual horsepower and rigorous, analytical problem-solving approach; ability to mentor rising talent on platform thinking and accountability.Willingness to travel up to 25–30 % for key customer onsite sessions and team events.Beware of job scam fraudsters! Our recruiters use @notablehealth.com email addresses exclusively. We do not conduct interviews via text or instant message and we do not ask candidates to download software other than Zoom, to purchase equipment through us, or to provide sensitive personally identifiable information such as bank account or social security numbers. If you have been contacted by someone claiming to be me from a different domain about a job offer, please report it as potential job fraud to law enforcement and contact us here.
MLOps / DevOps Engineer
Data Science & Analytics
Solutions Architect
Software Engineering
Apply
Hidden link
tensorwave_logo

DevOps Automation Engineer

TensorWave
-
US.svg
United States
Full-time
Remote
false
At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.DevOps Automation Engineer:We are seeking a Senior DevOps Automation Engineer to help scale and maintain the core infrastructure supporting large-scale AI workloads and managed Kubernetes environments. This role requires deep Linux expertise, hands-on experience with virtualization platforms, and a strong grasp of automation tools and CI/CD strategy.While primarily operational, this role also requires the ability to discuss architecture, improve tooling and automation pipelines, and collaborate across teams. You’ll work with technologies such as Proxmox, MAAS, Terraform, Packer, Ansible, and custom tooling to automate bare-metal provisioning, VM lifecycle management, and infrastructure configuration. Responsibilities:Automate bare-metal provisioning and VM lifecycle management using Proxmox, MAAS, and AnsibleDesign and implement infrastructure-as-code solutions using Terraform and PackerBuild and maintain CI/CD pipelines and automation frameworks using GitHub Actions and similar toolsSupport the lifecycle of Kubernetes clusters, including provisioning, upgrades, and integration with virtualization layersContribute to drift detection, image customization, and hardware configuration managementWrite and maintain scripts and tooling in Bash, Functional Python, and YAMLCollaborate with other engineers to improve system observability using Prometheus and related toolsParticipate in operational support, system debugging, and performance tuningRequired Skills & Experience:4+ years in DevOps, Infrastructure, or Systems Automation rolesStrong Linux experience, particularly with UbuntuProficiency with Ansible, Terraform, Packer, and CI/CD tools like GitHub ActionsExperience with virtualization platforms, particularly ProxmoxFamiliarity with bare-metal deployment tools such as MAAS or similarStrong scripting skills in Bash, Python, and YAMLWorking knowledge of Prometheus and infrastructure monitoring best practicesComfortable discussing infrastructure architecture and CI/CD strategiesNice to HaveExperience managing infrastructure for AI/ML workloadsFamiliarity with Kubernetes internals and cluster bootstrappingExposure to hardware management, PXE booting, and image customizationWhat We Bring:In addition to a competitive salary, we offer a variety of benefits to support your needs, including:Stock Options100% paid Medical, Dental, and Vision insuranceLife and Voluntary Supplemental InsuranceShort Term Disability InsuranceFlexible Spending Account401(k)Flexible PTOPaid HolidaysParental LeaveMental Health Benefits through Spring Health
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
retellai_logo

Founding DevOps Engineer

Retell AI
USD
0
215000
-
290000
US.svg
United States
Full-time
Remote
false
About Retell AIAt Retell AI, we're not just automating calls—we’re transforming how the world communicates. Our AI voice agents are reshaping sales, support, and customer engagement for leading brands. Backed by Alt Capital, Y Combinator, and top-tier investors, we've raised $4.7M in seed funding and hit $14M ARR with just 12 people.We’re one of the fastest-growing Voice AI startups and we're on a mission to become the standard for voice automation at scale. We're also one of the top ranking startups at https://leanaileaderboard.com/.About the RoleAs a Founding DevOps Engineer, you’ll be the owner of our build, release, and runtime foundations. You’ll design and automate deployment pipelines for both cloud SaaS and on-prem environments, orchestrate containers at scale, and ship reliable releases that meet compliance requirements. You’ll work cross-functionally with product, security, and customer teams—then turn what you learn in the field into reusable platform capabilities.Key ResponsibilitiesOwn CI/CD end-to-end: design, implement, and operate pipelines with blue/green, canary, and phased rollouts; define graceful draining for HA systems.Architect, maintain, and harden Kubernetes-based runtime (Docker, Kubernetes, Helm), including multi-cluster and multi-tenant concerns.Manage cloud deployments across AWS/Azure/GCP and coordinate with on-prem infrastructure teams; standardize with IaC (e.g., Terraform).Implement robust observability (metrics, logs, traces), SLOs/error budgets, and automated rollback/one-click restore.Partner with compliance to integrate SOC 2 / ISO 27001 / HIPAA controls into pipelines (artifact signing, SBOMs, change management, access/keys).Deploy at customer sites (cloud or on-prem), collaborating with client teams for integration, runbooks, and handover.Lead incident response & postmortems; drive resilience, cost, and performance improvements.Document release processes and platform conventions; codify best practices into tooling and templates.You Might Thrive If YouHave deep hands-on experience with a major cloud (AWS, Azure, or GCP) and container orchestration (Kubernetes, Helm).Build production-grade CI/CD with GitHub Actions / GitLab CI / Jenkins (or similar), including complex rollout strategies.Have shipped both SaaS and on-prem solutions, navigating networking, security, and environment drift.Can integrate compliance and security into delivery (secret management, image signing, policy-as-code).Are comfortable with networking fundamentals, security hardening, and performance tuning.Communicate clearly, move fast in ambiguity, and enjoy being the responsible adult in prod.Job DetailsJob Type: Full-time, 70 hr/week (50 hr/week onsite with flexible hours + 20 hr/week work from home)Cash: 215k - 290kEquity: 0.3 - 0.6%Location: Redwood City, CA, USUS Visas: Sponsors Visa & Green CardOther Benefits100% medical, dental, vision insurance coverageUnlimited breakfast, lunch, dinner, and snacksGym and daily commute fee reimbursementInternet and phone bill coveredCompensation PhilosophyBest Offer Upfront: Choose from three cash-equity balance options, no negotiation needed.Top 1% Talent: Above-market pay (top 5 percentile) to attract high performers.High Ownership: Small teams, >$1M revenue/employee, and significant equity.Performance-Based: Offers tied to interview performance, not experience or past salaries.Interview ProcessOnline Assessment (25–30 min): One HackerRank coding questions on practical problem-solving (7 days to complete).Technical Phone Interview 1 (30 min): Live coding on CoderPad, focusing on data structures and algorithms.Technical Phone Interview 2 (30–45 min): Full-stack development with JavaScript, TypeScript, React, and Node.js in a local environment.Onsite/Virtual Interviews (2-3 hrs): Hosted in our office if located in the Bay Area or virtual, with three rounds:DevOps Build & Run: Design a Kubernetes deployment with blue/green, draining, and a 2-hour instance lifetime constraint; walk through rollout/rollback.Communication (FDE-style): Partner exercise on explaining trade-offs and aligning stakeholders.Systems Design (DevOps): Architect a generalized on-prem solution deployable across multiple clouds with different data stores, key vaults, encryption, availability/failover, monitoring, upgrades, and maintenance.Learn MoreRetell AI - API That Turns Your LLM Into A Human-Like Voice AgentRetell AI Basics: Everything You Need to Start Building Voice AgentsJoin Retell AI to shape the future of voice automation, building scalable, impactful full-stack systems that redefine AI-driven communication.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
hellorobin_logo

Lead SRE

Robin
-
ZA.svg
South Africa
Full-time
Remote
false
About RobinRobin is on a mission to rebuild the legal industry — starting with making contracts simple for everyone. We are a pioneer in Legal AI, built on proprietary models, licensed data, and deep partnerships with Anthropic and AWS. Since 2019, we’ve expanded our footprint to 4 continents and have been supporting many of the world’s most successful businesses, including GE, Pfizer, KPMG, and UBS.What will you do as an SRE Team Lead?As a SRE Team Lead at Robin AI, you'll lead a team of two SRE Engineers while reporting directly to the CTO. You'll help build and maintain our cloud infrastructure and applications that powers our cutting-edge Legal AI platform. You'll provide Change leadership to your team and collaborate with engineering leaders to establish robust monitoring, incident response, and deployment strategies that ensure high availability and reliability of our proprietary models and services, maintaining optimal SLOs for our global customer base.Your day-to-day responsibilities:Lead and mentor a team of two SRE Engineers, providing technical guidance and career developmentWork closely with the CTO to define and implement the technical infrastructure roadmapEstablish monitoring strategies and implement solutions to enhance reliability, scalability, and cost-efficiencyCollaborate with development team leads to optimise build, test, and deployment processesLead incident response and establish processes for troubleshooting production issuesOrganise and oversee on-call rotations to ensure 24/7 system reliabilityDrive documentation standards and knowledge sharing within the engineering organisationIdeally, you should have the following qualifications:5+ years of experience in DevOps or Site Reliability Engineering roles, with 2+ years in a managerial positionProven experience managing and mentoring technical team membersProficiency in at least one backend programming language (We use Python)Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by TerraformKnowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)Excellent leadership, communication, and problem-solving skillsExperience with AI/ML infrastructure deployment and scalingWhat’s in it for youSalary: CompetitiveHybrid schedule: We offer a flexible working schedule. #LI-HYBRIDEquity package: Generous equity scheme - everyone gets to be an owner of Robin AI!Annual leave: 20 days PTO, in addition to the public holidays observed in South Africa.Growth opportunities: We prioritise promotions for high performers and help you to progress your career.What’s it like working at Robin?Our culture and values attract people who are creative, resourceful, and share our passion for excellence. At Robin, you're encouraged to push yourself and empowered to take risks. We support each other to think big, try new ideas, and navigate uncertainty. Whether you're at our headquarters or one of our worldwide offices, you'll find a world of opportunities to grow, thrive, and make a meaningful impact. See what life is like at Robin.Diversity, Equity and Inclusion at RobinWe are committed to building one of the most diverse technology companies in the world. As of 2024, more than 30% of our employees come from ethnic minority backgrounds, and 51% of roles are held by women. We know that transforming the legal industry requires diverse perspectives, so we're creating an environment where innovation thrives through inclusion.Robin operates a direct hiring model and any speculative CVs shared via agencies will be treated as a gift.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
hellorobin_logo

SRE

Robin
-
ZA.svg
South Africa
Full-time
Remote
false
About RobinRobin is on a mission to rebuild the legal industry — starting with making contracts simple for everyone. We are a pioneer in Legal AI, built on proprietary models, licensed data, and deep partnerships with Anthropic and AWS. Since 2019, we’ve expanded our footprint to 4 continents and have been supporting many of the world’s most successful businesses, including GE, Pfizer, KPMG, and UBS.What will you do as an SRE?As an SRE at Robin AI, you'll help build and maintain our cloud infrastructure and applications that powers our cutting-edge Legal AI platform. You'll collaborate with engineering teams to establish robust monitoring, incident response, and deployment strategies that ensure high availability and reliability of our proprietary models and services, maintaining optimal SLOs for our global customer base.Your day-to-day responsibilities:You will be responsible for ensuring the Robin systems are highly available and scalable.Standardise and implement observability practices in our service-based architecture through logging, traces, metrics and monitorsDesign, deploy, and operate infrastructure to support Robin's product teams as we expand into new regions.Adding automation around manual operational tasksCollaborate with development team leads to optimise build, test, and deployment processesParticipating in and improving our on-call and incident handling processes to ensure 24/7 system reliabilityIdeally, you should have the following qualifications:3+ years of experience in DevOps or Site Reliability Engineering rolesProficiency in at least one backend programming language (We use Python)Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by TerraformComfortable troubleshooting across the full stack, starting from the browser, through the networking components, into the containerised applications and then onto data stores.Knowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)Excellent problem-solving and communication skillsExperience with AI/ML infrastructure deployments is a plusWhat’s in it for youSalary: CompetitiveHybrid schedule: We offer a flexible working schedule. #LI-HYBRIDEquity package: Generous equity scheme - everyone gets to be an owner of Robin AI!Annual leave: 20 days PTO, in addition to the public holidays observed in South Africa.Growth opportunities: We prioritise promotions for high performers and help you to progress your career.What’s it like working at Robin?Our culture and values attract people who are creative, resourceful, and share our passion for excellence. At Robin, you're encouraged to push yourself and empowered to take risks. We support each other to think big, try new ideas, and navigate uncertainty. Whether you're at our headquarters or one of our worldwide offices, you'll find a world of opportunities to grow, thrive, and make a meaningful impact. See what life is like at Robin.Diversity, Equity and Inclusion at RobinWe are committed to building one of the most diverse technology companies in the world. As of 2024, more than 30% of our employees come from ethnic minority backgrounds, and 51% of roles are held by women. We know that transforming the legal industry requires diverse perspectives, so we're creating an environment where innovation thrives through inclusion.Robin operates a direct hiring model and any speculative CVs shared via agencies will be treated as a gift.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
metropolisio_logo

Senior Manager, Central Cloud Infrastructure

Metropolis
USD
0
200000
-
250000
US.svg
United States
Full-time
Remote
false
The Company Metropolis is an artificial intelligence company that uses computer vision technology to enable frictionless, checkout-free experiences in the real world. Today, we are reimagining parking to enable millions of consumers to just "drive in and drive out." We envision a future where people transact in the real world with a speed, ease and convenience that is unparalleled, even online. Tomorrow, we will power checkout-free experiences anywhere you go to make the everyday experiences of living, working and playing remarkable - giving us back our most valuable asset, time.   The Role Metropolis is seeking a Senior Manager to the Central Cloud Infrastructure team: the pivotal engineering group that governs cloud usage, manages foundational elements of the cloud, develops shared tooling for managing cloud resources, supports other teams through consultancy, and ensures the cloud is secure / meets compliance requirements. With a holistic perspective on Metropolis's cloud ecosystem the team provides support for sub-systems such as workload orchestration, data persistence, observability, and endpoint access. In this role you will lead a team of engineers in creating a substrate of systems, developing best practices, and engineering breakthroughs that accelerate other teams while maintaining the health of Metropolis's cloud and increasing the velocity of the CCI team itself. As a senior member of the Advanced Technologies Group you will provide technical as well as both intra- and inter- team leadership to find optimal solutions and strategies for a diverse set of areas of concern at various timeframes and a growing suite of verticals.   Responsibilities Facilitate the further development by team members of overall team strategy and scope for various timeframes and directly contribute to that planning. Support the tactical coordination of work-streams and prioritization of items within those streams across team members based on technical dependencies, business urgency, and other considerations. Ensure talent development by identifying opportunities for and supporting team members' participation in technical and leadership growth opportunities. Foster a collaborative atmosphere that delivers business value at an increasing velocity. Maximize the utility of engineering deliverables (code, documentation, etc.) by promoting their socialization, identifying opportunities for further value extraction, and enabling reusability. Aid CCI team members by providing timely and relevant knowledge of the priorities / plans of other teams by maintaining informative communications and collaborative relations across the company. Help broadcast the goals, activities, and requirements of the team and provide meaningful measures of success and progress to upper management. Be a stakeholder for the cloud and related systems with a vision for the long-term health, performance, efficiency, and utility of these resources. Participate in an on-call system to handle high urgency issues.   Qualifications  BS, MS or PhD in Computer Science or a relevant engineering discipline. 8+ years of experience with at least 2+ years of experience leading and managing cloud infrastructure related engineering teams. 2+ years of experience as a hands-on senior, staff or principal engineer before transitioning into managing teams. A inter- and intra- team facilitator mindset while recognizing the technical and organizational leadership role possible of a central cloud infrastructure team. Experience with meeting compliance requirements for cloud environments. Experience crafting disaster recovery plans and architecting technical solutions. Track record of successfully developing cloud infrastructure that meets product requirements. Technical experience with AWS, Terraform, Terraform wrappers, Kubernetes, EKS, Helm, containerization, SQL databases, cloud networking, and Python. Excellent written and verbal communication skills with a proven ability to present complex technical information in a clear and concise manner to a variety of audiences. Previous experience working inside innovative, high-growth environments is a plus.   When you join Metropolis, you’ll join a team of world-class product leaders and engineers, building an ecosystem of technologies at the intersection of parking, mobility, and real estate. Our goal is to build an inclusive culture where everyone has a voice and the best idea wins. You will play a key role in building and maintaining this culture as our organization grows. The anticipated base salary for this position is $200,000.00 to $250,000.00 annually. The actual base salary offered is determined by a number of variables, including, as appropriate, the applicant's qualifications for the position, years of relevant experience, distinctive skills, level of education attained, certifications or other professional licenses held, and the location of residence and/or place of employment. Base salary is one component of Metropolis’s total compensation package, which may also include access to or eligibility for healthcare benefits, a 401(k) plan, short-term and long-term disability coverage, basic life insurance, a lucrative stock option plan, bonus plans and more.  #LI-NM1 #LI-Onsite Metropolis Technologies is an equal opportunity employer. We make all hiring decisions based on merit, qualifications, and business needs, without regard to race, color, religion, sex (including gender identity, sexual orientation, or pregnancy), national origin, disability, veteran status, or any other protected characteristic under federal, state, or local law.  
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
metropolisio_logo

Senior Manager, Central Cloud Infrastructure

Metropolis
USD
0
200000
-
250000
US.svg
United States
Full-time
Remote
false
The Company Metropolis is an artificial intelligence company that uses computer vision technology to enable frictionless, checkout-free experiences in the real world. Today, we are reimagining parking to enable millions of consumers to just "drive in and drive out." We envision a future where people transact in the real world with a speed, ease and convenience that is unparalleled, even online. Tomorrow, we will power checkout-free experiences anywhere you go to make the everyday experiences of living, working and playing remarkable - giving us back our most valuable asset, time.   The Role Metropolis is seeking a Senior Manager to the Central Cloud Infrastructure team: the pivotal engineering group that governs cloud usage, manages foundational elements of the cloud, develops shared tooling for managing cloud resources, supports other teams through consultancy, and ensures the cloud is secure / meets compliance requirements. With a holistic perspective on Metropolis's cloud ecosystem the team provides support for sub-systems such as workload orchestration, data persistence, observability, and endpoint access. In this role you will lead a team of engineers in creating a substrate of systems, developing best practices, and engineering breakthroughs that accelerate other teams while maintaining the health of Metropolis's cloud and increasing the velocity of the CCI team itself. As a senior member of the Advanced Technologies Group you will provide technical as well as both intra- and inter- team leadership to find optimal solutions and strategies for a diverse set of areas of concern at various timeframes and a growing suite of verticals.   Responsibilities Facilitate the further development by team members of overall team strategy and scope for various timeframes and directly contribute to that planning. Support the tactical coordination of work-streams and prioritization of items within those streams across team members based on technical dependencies, business urgency, and other considerations. Ensure talent development by identifying opportunities for and supporting team members' participation in technical and leadership growth opportunities. Foster a collaborative atmosphere that delivers business value at an increasing velocity. Maximize the utility of engineering deliverables (code, documentation, etc.) by promoting their socialization, identifying opportunities for further value extraction, and enabling reusability. Aid CCI team members by providing timely and relevant knowledge of the priorities / plans of other teams by maintaining informative communications and collaborative relations across the company. Help broadcast the goals, activities, and requirements of the team and provide meaningful measures of success and progress to upper management. Be a stakeholder for the cloud and related systems with a vision for the long-term health, performance, efficiency, and utility of these resources. Participate in an on-call system to handle high urgency issues.   Qualifications  BS, MS or PhD in Computer Science or a relevant engineering discipline. 8+ years of experience with at least 2+ years of experience leading and managing cloud infrastructure related engineering teams. 2+ years of experience as a hands-on senior, staff or principal engineer before transitioning into managing teams. A inter- and intra- team facilitator mindset while recognizing the technical and organizational leadership role possible of a central cloud infrastructure team. Experience with meeting compliance requirements for cloud environments. Experience crafting disaster recovery plans and architecting technical solutions. Track record of successfully developing cloud infrastructure that meets product requirements. Technical experience with AWS, Terraform, Terraform wrappers, Kubernetes, EKS, Helm, containerization, SQL databases, cloud networking, and Python. Excellent written and verbal communication skills with a proven ability to present complex technical information in a clear and concise manner to a variety of audiences. Previous experience working inside innovative, high-growth environments is a plus.   When you join Metropolis, you’ll join a team of world-class product leaders and engineers, building an ecosystem of technologies at the intersection of parking, mobility, and real estate. Our goal is to build an inclusive culture where everyone has a voice and the best idea wins. You will play a key role in building and maintaining this culture as our organization grows. The anticipated base salary for this position is $200,000.00 to $250,000.00 annually. The actual base salary offered is determined by a number of variables, including, as appropriate, the applicant's qualifications for the position, years of relevant experience, distinctive skills, level of education attained, certifications or other professional licenses held, and the location of residence and/or place of employment. Base salary is one component of Metropolis’s total compensation package, which may also include access to or eligibility for healthcare benefits, a 401(k) plan, short-term and long-term disability coverage, basic life insurance, a lucrative stock option plan, bonus plans and more.  #LI-NM1 #LI-Onsite Metropolis Technologies is an equal opportunity employer. We make all hiring decisions based on merit, qualifications, and business needs, without regard to race, color, religion, sex (including gender identity, sexual orientation, or pregnancy), national origin, disability, veteran status, or any other protected characteristic under federal, state, or local law.  
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
hp_iq_logo

Lead Quality Assurance Engineer

HP IQ
USD
150000
-
210000
US.svg
United States
Full-time
Remote
false
Who We Are HP IQ is HP’s new AI innovation lab. Combining startup agility with HP’s global scale, we’re building intelligent technologies that redefine how the world works, creates, and collaborates. We’re assembling a diverse, world-class team—engineers, designers, researchers, and product minds—focused on creating an intelligent ecosystem across HP’s portfolio. Together, we’re developing intuitive, adaptive solutions that spark creativity, boost productivity, and make collaboration seamless. We create breakthrough solutions that make complex tasks feel effortless, teamwork more natural, and ideas more impactful—always with a human-centric mindset. By embedding AI advancements into every HP product and service, we’re expanding what’s possible for individuals, organisations, and the future of work. Join us as we reinvent work, so people everywhere can do their best work.About the Role The Quality Engineering team ensures every HP IQ experience is robust, reliable, and delightful to use. We are looking for a Lead Quality Assurance Engineer to lead test strategy and quality initiatives across software and hardware teams. You’ll define and scale test automation, influence architecture for testability, and partner across disciplines to raise the bar for what quality means in the AI era. What You Might Do Design comprehensive test strategies that align with product goals, technical architecture, and customer experience. Develop and maintain automation frameworks for functional, regression, and integration testing. Lead cross-functional collaboration with engineers, product managers, UX, and design to embed quality from ideation through release. Define and drive key quality metrics, triage defect patterns, and surface actionable insights through dashboards and reports. Establish standards for test planning, execution, and documentation, mentoring QA engineers and championing best practices across teams. Build tooling and infrastructure to support continuous integration, faster feedback cycles, and scalable test coverage. Proactively identify future testing needs across cloud, edge, and device interfaces—shaping the long-term QA roadmap. Essential Qualifications 8+ years of experience in software quality engineering, with a proven record of raising quality bars in complex systems. 4+ years leading QA projects or frameworks, including mentoring or coaching peers. Strong background in both manual and automated testing for customer-facing applications. Proficiency with test automation tools and scripting (e.g., Selenium, Playwright, Python, or similar). Deep understanding of QA methodologies, metrics, and test architecture design. Experience programmatically validating product behavior and analyzing diagnostic signals from CI pipelines. Bachelor's degree in Computer Science or a related field. Preferred Skills Experience testing AI-powered or edge-connected applications. Familiarity with backend service testing, distributed systems, and API validation. Strong communication skills that emphasize collaboration, documentation, and continuous improvement. Knowledge of cloud-based infrastructure and CI/CD tools (e.g., GitHub Actions, Jenkins, or CircleCI). Passion for user experience, accessibility, and long-term product health. Salary range: $150,000 - $210,000Compensation & Benefits (Full-Time Employees) The salary range for this role is listed above. Final salary offered is based upon multiple factors including individual job-related qualifications, education, experience, knowledge and skills. At HP IQ, we offer a competitive and comprehensive benefits package, including: Health insurance Dental insurance Vision insurance Long term/short term disability insurance Employee assistance program Flexible spending account Life insurance Generous time off policies, including;  4-12 weeks fully paid parental leave based on tenure 11 paid holidays Additional flexible paid vacation and sick leave (US benefits overview) Why HP IQ? HP IQ is HP’s new AI innovation lab, building the intelligence to empower humanity—reimagining how we work, create, and connect to shape the future of work. Innovative Work Help shape the future of intelligent computing and workplace transformation. Autonomy and Agility Work with the speed and focus of a startup, backed by HP’s scale. Meaningful Impact Build AI-powered solutions that help people and organisations thrive. Flexible Work Environment Freedom and flexibility to do your best work. Forward-Thinking Culture We learn fast, stay future-focused, and imagine what comes next—together. Equal Opportunity Employer (EEO) Statement HP, Inc. provides equal employment opportunity to all employees and prospective employees, without regard to race, color, religion, sex, national origin, ancestry, citizenship, sexual orientation, age, disability, or status as a protected veteran, marital status, familial status, physical or mental disability, medical condition, pregnancy, genetic predisposition or carrier status, uniformed service status, political affiliation or any other characteristic protected by applicable national, federal, state, and local law(s). Please be assured that you will not be subject to any adverse treatment if you choose to disclose the information requested. This information is provided voluntarily. The information obtained will be kept in strict confidence. If you’d like more information about HP’s EEO Policy or your EEO rights as an applicant under the law, please click here: Equal Employment Opportunity is the Law Equal Employment Opportunity is the Law – Supplement
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
otter_ai_logo

Senior Software Engineer, Cloud Security

Otter.ai
USD
0
185000
-
210000
US.svg
United States
Full-time
Remote
false
The Opportunity We are seeking an experienced Cloud Security Engineer to join our team. The successful candidate will be responsible for designing, implementing, and maintaining the security of our cloud infrastructure and applications. This includes ensuring compliance with regulatory requirements, identifying and mitigating security risks, and collaborating with DevOps teams to ensure secure cloud deployments.  Your Impact Design and implement secure cloud architectures and configurations Conduct cloud security assessments and risk analyses Implement and manage cloud security controls, such as firewalls, access controls, and encryption technologies Monitor cloud security logs and investigate security alerts Respond to security incidents and develop incident response plans Ensure cloud compliance with regulatory requirements, such as HIPAA, PCI-DSS, and GDPR Collaborate with DevOps teams to ensure secure cloud deployments Develop and deliver security awareness training programs Stay up-to-date with emerging cloud security threats and technologies We're looking for someone who 4+ years of experience in cloud security engineering Strong knowledge of cloud security architectures, controls, and compliance requirements Expertise in the security of public cloud platforms (e.g. AWS, Microsoft Azure), especially securing multi-cloud networks and infrastructure, and designing cloud agnostic systems. Understand container security, network security, and cloud security services Experience building cloud security infrastructure (e.g. logging, monitoring vuln management, DLP)   Strong understanding of security frameworks, such as NIST and ISO 27001 Excellent problem-solving and analytical skills Strong communication and collaboration skills Bachelor's degree in Computer Science, Cybersecurity, or related field About Otter.ai  We are in the business of shaping the future of work. Our mission is to make conversations more valuable. With over 1B meetings transcribed, Otter.ai is the world’s leading tool for meeting transcription, summarization, and collaboration. Using artificial intelligence, Otter generates real-time automated meeting notes, summaries, and other insights from in-person and virtual meetings - turning meetings into accessible, collaborative, and actionable data that can be shared across teams and organizations. The company is backed by early investors in Google, DeepMind, Zoom, and Tesla. Otter.ai is an equal opportunity employer. We proudly celebrate diversity and are dedicated to inclusivity. *Otter.ai does not accept unsolicited resumes from 3rd party recruitment agencies without a written agreement in place for permanent placements. Any resume or other candidate information submitted outside of established candidate submission guidelines (including through our website or via email to any Otter.ai employee) and without a written agreement otherwise will be deemed to be our sole property, and no fee will be paid should we hire the candidate. Salary range Salary Range: $185,000 to $210,000 USD per year This salary range represents the low and high end of the estimated salary range for this position. The actual base salary offered for the role is dependent based on several factors. Our base salary is just one component of our comprehensive total rewards package.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
horizon3ai_logo

Senior Cloud Security Engineer

Horizon3ai
USD
0
185000
-
215000
US.svg
United States
Full-time
Remote
true
Get to Know UsHorizon3.ai is a fast-growing, remote cybersecurity company dedicated to the mission of enabling organizations to proactively find, fix and verify exploitable attack vectors before criminals exploit them. Our flagship product, the NodeZeroTM platform, delivers production-safe autonomous pentests and other key assessment operations that scale across the largest internal, external, cloud, and hybrid cloud environments. NodeZero has been adopted by organizations of all sizes, from small educational institutions to government agencies and Global 100 enterprises. It is used by IT Ops/SecOps teams, consulting pentesters, and MSSPs and MSPs.We are a fusion of former U.S. Special Operations cyber operators, startup engineers & operators, and formerly frustrated cybersecurity practitioners. We're committed to helping solve our common security problems: ineffective security tools and false positives, resulting in alert fatigue, blind spots, "checkbox” security culture, cybersecurity skills shortage, and the long lead time and expense of hiring outside consultants. Collectively, we are a team of learn it alls, committed to a culture of respect, collaboration, ownership, and results.As a remote first company, we require minimum 25Mbps consumer grade broadband connection. What You’ll DoWe are seeking a skilled Sr. Cloud Security Engineer with strong focus on AWS to join our growing team. The ideal candidate will be a self-starter with a "learn it all" attitude and a strong desire to stay current with the latest trends and technologies in the field. In this role, you will be responsible for designing, implementing, maintaining, and validating security solutions for our AWS cloud infrastructure. Your role will involve working closely with development and engineering teams to ensure secure cloud architecture and implementation.This role will be responsible for…..Strong experience with modern SDLC tools and branching strategiesDesign and implement security controls across our AWS environment (e.g., IAM, SCPs, VPC security, S3 bucket policies, security groups, key management, logging).Continuously monitor and improve cloud posture by managing and tuning services like GuardDuty, Security Hub, AWS WAF, CloudTrail, and InspectorDevelop and maintain security policies, standards, and procedures to ensure compliance with industry standards such as SOC2, GDPR, ISO27001, FedRAMP, etc.Evaluate and recommend new security technologies, tools, and techniques to improve the security posture of our AWS cloud infrastructure.Implement and maintain Gitlab CI/CD pipelines and tools for automated security testing and scanning of AWS resources.Conduct threat modeling, architecture reviews, and risk assessments for cloud deployments and new featuresImplementing security features and monitoring tools, performing periodic security assessments to verify best practice configuration and secure systems hardening in the cloudResponding swiftly to new and emerging security threats and vulnerabilities with the cloudWhere required, investigate suspected attacks and help manage security incidents including providing post-mortem analysis, identify causes, develop solutions and preventive measuresImplement process and technologies that reduce cloud security deficiencies and help develop creative reporting mechanisms including metrics/key themes that communicate risk to business owners and leadershipParticipate in development and implementation of cloud security standards and cloud service certificationProvide subject matter expertise to assist with building detective controls for malicious activity within the AWS environment.Define and enforce identity and access management (IAM) best practices, including least privilege policies, federated identity, role-based access control (RBAC), and automated remediation.Demonstrate a commitment to integrity, process improvement, and customer satisfaction What You’ll BringIn-depth knowledge of Terraform and GitlabDeep knowledge of AWS services and security architectureStrong understanding of AWS security and data security principlesExperience with threat modeling and risk assessmentsExcellent communication skills and ability to explain technical concepts to non-technical stakeholdersAbility to work independently and as part of a team, and a strong sense of ownership and accountabilityKnowledge of compliance standards such as SOC2, GDPR, ISO27001, FedRAMP, etc.Familiarity with cybersecurity frameworks such as NIST, CIS, and MITRE ATT&CKKnowledge of Data Loss Prevention (DLP) including data classification, identification, and protectionBroad knowledge across the Security domain, as well as deep focus in one (or more) areas such as: (Logs and events processing, Incident Management, detection, response tool development, etc.)What Sets You Apart?5+ years of general cybersecurity field experience5+ years of experience in securing cloud environmentsAWS Certified Security - SpecialtyCISSP or relevant security certifications preferred5+ Experience securing an Amazon Web Services (AWS) environment.Compensation and ValuesAt Horizon3, we believe that our people are our greatest asset, and our compensation philosophy reflects this core value. We are committed to fostering an environment where all employees feel valued, respected, and rewarded for their contributions. Our compensation structure is designed to be fair, competitive, and transparent, ensuring that every team member is recognized and compensated equitably across roles, levels, and locations.In accordance with various State’s transparency regulations, we provide the following salary range information for this position:Base salary range: $185,000 - $215,000 annually. The exact salary will be determined based on the selected candidate’s location, qualifications, experience, and relevant skills.Additional compensation: This role may also be eligible for an equity package (in the form of stock options). If any other compensation benefits apply, they will be discussed during the interview process.Perks of Horizon3.aiInclusive Team: We value diversity and promote an inclusive culture where everyone can thrive.Growth Opportunities: Be part of a dynamic and growing team with numerous career development opportunities.Innovative Culture: Work in a collaborative environment that encourages creativity and out-of-the-box thinking.Remote Work: We are a 100% remote company. Enjoy the flexibility to work in the way that supports you and brings out your best.Competitive Compensation: We offer competitive salary and benefits which includes health, vision & dental care for you and your family, a flexible vacation policy, and generous parental leave.You Belong HereHorizon3 is not just an equal opportunity employer - we are a community that values diversity, equity, and inclusion as fundamental principles of our culture and success. We are dedicated to fostering a workplace where everyone feels welcome and respected, regardless of race, color, religion, sex, national origin, age, disability, veteran status, sexual orientation, gender identity or expression, genetic information, marital status, hair length or any other legally protected status by law.Our commitment to diversity and inclusion means we strive to attract, develop, and retain a workforce that reflects the varied communities we serve. We believe that diverse perspectives drive innovation and strengthen our ability to create cutting-edge cybersecurity solutions. At Horizon3, every team member is valued and supported in an environment that encourages personal and professional growth.We welcome candidates from all backgrounds and experiences, and we encourage all qualified individuals to apply. Come be a part of Horizon3, where your unique contributions are recognized, and your potential is limitless.Other DutiesPlease note this job description is not designed to cover or contain a comprehensive listing of activities, duties or responsibilities that are required of the employee. Duties, responsibilities, and activities may change at any time with or without notice.Application NoteIn any materials you submit, you may redact or remove age-identifying information such as age, date of birth, or dates of school attendance or graduation. You will not be penalized for redacting or removing this information.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
articul8_ai_logo

Senior Site Reliability Engineer (SRE) - (Brazil)

Articul8
-
BR.svg
Brazil
Full-time
Remote
true
About UsArticul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.Position OverviewWe are seeking an experienced Site Reliability Engineer (SRE) to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As an SRE, you will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.Key ResponsibilitiesArchitect and maintain scalable, highly available infrastructure for our GenAI platform.Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.Implement and enforce security best practices across all systems and environments.Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.QualificationsRequiredBachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience5+ years of experience in DevOps, SRE, or similar rolesStrong experience with cloud platforms (AWS, GCP, or Azure)Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)Solid background in containerization technologies (Docker, Kubernetes)Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)Strong understanding of CI/CD pipelines and automationExceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systemsPreferredExperience supporting AI/ML systems in productionKnowledge of GPU infrastructure management and optimizationFamiliarity with distributed systems and high-performance computingExperience with database systems (SQL and NoSQL)Certifications in cloud platforms (AWS, GCP, Azure)Experience with chaos engineering and resilience testingKnowledge of security best practices and compliance requirementsReady to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow’s AI at Articul8 AI!
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
articul8_ai_logo

Senior Site Reliability Engineer (SRE)

Articul8
-
US.svg
United States
Full-time
Remote
false
About UsArticul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.Position OverviewWe are seeking an experienced Site Reliability Engineer (SRE) to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As an SRE, you will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.Key ResponsibilitiesArchitect and maintain scalable, highly available infrastructure for our GenAI platform.Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.Implement and enforce security best practices across all systems and environments.Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.QualificationsRequiredBachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience5+ years of experience in DevOps, SRE, or similar rolesStrong experience with cloud platforms (AWS, GCP, or Azure)Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)Solid background in containerization technologies (Docker, Kubernetes)Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)Strong understanding of CI/CD pipelines and automationExceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systemsPreferredExperience supporting AI/ML systems in productionKnowledge of GPU infrastructure management and optimizationFamiliarity with distributed systems and high-performance computingExperience with database systems (SQL and NoSQL)Certifications in cloud platforms (AWS, GCP, Azure)Experience with chaos engineering and resilience testingKnowledge of security best practices and compliance requirementsReady to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow’s AI at Articul8 AI!
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
scaleai_logo

AI Infrastructure Engineer, Agents

Scale AI
USD
156000
-
225600
US.svg
United States
Full-time
Remote
false
As a Software Engineer on the ML Infrastructure team, you will design and build the platform for our agent sandboxing platform: the secure, high-performance code execution layer powering our agentic workflows. This system underpins critical applications and research initiatives, and is deployed across both internal and customer-managed environments. This position requires deep expertise in systems engineering: operating systems, virtualization, networking, containers, and performance optimization. Your work will directly enable agents to execute untrusted or user-submitted code safely, efficiently, and repeatedly, and with fast startup times, strong isolation guarantees, and support for snapshotting and inspection. You will: Design and build the sandboxing platform for code execution across containerized and virtualized environments. Ensure strong isolation, security, and reproducibility of execution across user sessions and workloads. Optimize for cold-start latency, memory footprint, and resource utilization at scale. Collaborate across security, infra, and product teams to support both internal research use cases and enterprise customer deployments. Lead architecture reviews and own projects from design through deployment in fast-paced, cross-functional settings. Ideally you'd have: 3+ years of experience building high-performance systems software (e.g. OS, container runtime, VMM, networking stack). Deep understanding of Linux internals, process isolation, memory management, cgroups, namespaces, etc. Experience with containerization and virtualization technologies (e.g., Docker, Firecracker, gVisor, QEMU, Kata Containers). Proficiency in a systems programming language such as Go, Rust, or C/C++. Familiarity with networking, security hardening, sandboxing techniques, and kernel-level performance tuning. Comfort working across infrastructure layers, from kernel modules to orchestration frameworks (e.g., Kubernetes). Strong debugging skills and the ability to make performance/security tradeoffs in production systems. Nice to haves: Familiarity with LLM agents and agent frameworks (e.g., OpenHands, Agent2Agent, MCP). Experience running secure workloads in multi-tenant or untrusted environments (e.g., FaaS, CI sandboxes, remote notebooks). Exposure to snapshotting and restore techniques (e.g., CRIU, VM snapshots, overlayfs). Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You’ll also receive benefits including, but not limited to: Comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend.Please reference the job posting's subtitle for where this position will be located. For pay transparency purposes, the base salary range for this full-time position in the locations of San Francisco, New York, Seattle is:$156,000—$225,600 USDPLEASE NOTE: Our policy requires a 90-day waiting period before reconsidering candidates for the same role. This allows us to ensure a fair and thorough evaluation of all applicants. About Us: At Scale, we believe that the transition from traditional software to AI is one of the most important shifts of our time. Our mission is to make that happen faster across every industry, and our team is transforming how organizations build and deploy AI.  Our products power the world's most advanced LLMs, generative models, and computer vision models. We are trusted by generative AI companies such as OpenAI, Meta, and Microsoft, government agencies like the U.S. Army and U.S. Air Force, and enterprises including GM and Accenture. We are expanding our team to accelerate the development of AI applications. We believe that everyone should be able to bring their whole selves to work, which is why we are proud to be an inclusive and equal opportunity workplace. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability status, gender identity or Veteran status.  We are committed to working with and providing reasonable accommodations to applicants with physical and mental disabilities. If you need assistance and/or a reasonable accommodation in the application or recruiting process due to a disability, please contact us at accommodations@scale.com. Please see the United States Department of Labor's Know Your Rights poster for additional information. We comply with the United States Department of Labor's Pay Transparency provision.  PLEASE NOTE: We collect, retain and use personal data for our professional business purposes, including notifying you of job opportunities that may be of interest and sharing with our affiliates. We limit the personal data we collect to that which we believe is appropriate and necessary to manage applicants’ needs, provide our services, and comply with applicable laws. Any information we collect in connection with your application will be treated in accordance with our internal policies and programs designed to protect personal data. Please see our privacy policy for additional information.
MLOps / DevOps Engineer
Data Science & Analytics
Software Engineer
Software Engineering
Apply
Hidden link
hellorobin_logo

Site Reliability Engineer Team Lead

Robin
0
0
-
0
ZA.svg
South Africa
Full-time
Remote
false
About RobinRobin is on a mission to rebuild the legal industry — starting with making contracts simple for everyone. We are a pioneer in Legal AI, built on proprietary models, licensed data, and deep partnerships with Anthropic and AWS. Since 2019, we’ve expanded our footprint to 4 continents and have been supporting many of the world’s most successful businesses, including GE, Pfizer, KPMG, and UBS.What will you do as an SRE Team Lead?As a SRE Team Lead at Robin AI, you'll lead a team of two SRE Engineers while reporting directly to the CTO. You'll help build and maintain our cloud infrastructure and applications that powers our cutting-edge Legal AI platform. You'll provide Change leadership to your team and collaborate with engineering leaders to establish robust monitoring, incident response, and deployment strategies that ensure high availability and reliability of our proprietary models and services, maintaining optimal SLOs for our global customer base.Your day-to-day responsibilities:Lead and mentor a team of two SRE Engineers, providing technical guidance and career developmentWork closely with the CTO to define and implement the technical infrastructure roadmapEstablish monitoring strategies and implement solutions to enhance reliability, scalability, and cost-efficiencyCollaborate with development team leads to optimise build, test, and deployment processesLead incident response and establish processes for troubleshooting production issuesOrganise and oversee on-call rotations to ensure 24/7 system reliabilityDrive documentation standards and knowledge sharing within the engineering organisationIdeally, you should have the following qualifications:5+ years of experience in DevOps or Site Reliability Engineering roles, with 2+ years in a managerial positionProven experience managing and mentoring technical team membersProficiency in at least one backend programming language (We use Python)Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by TerraformKnowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)Excellent leadership, communication, and problem-solving skillsExperience with AI/ML infrastructure deployment and scalingWhat’s in it for youSalary: CompetitiveHybrid schedule: We offer a flexible working schedule. #LI-HYBRIDEquity package: Generous equity scheme - everyone gets to be an owner of Robin AI!Annual leave: 20 days PTO, in addition to the public holidays observed in South Africa.Growth opportunities: We prioritise promotions for high performers and help you to progress your career.What’s it like working at Robin?Our culture and values attract people who are creative, resourceful, and share our passion for excellence. At Robin, you're encouraged to push yourself and empowered to take risks. We support each other to think big, try new ideas, and navigate uncertainty. Whether you're at our headquarters or one of our worldwide offices, you'll find a world of opportunities to grow, thrive, and make a meaningful impact. See what life is like at Robin.Diversity, Equity and Inclusion at RobinWe are committed to building one of the most diverse technology companies in the world. As of 2024, more than 30% of our employees come from ethnic minority backgrounds, and 51% of roles are held by women. We know that transforming the legal industry requires diverse perspectives, so we're creating an environment where innovation thrives through inclusion.Robin operates a direct hiring model and any speculative CVs shared via agencies will be treated as a gift.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
hellorobin_logo

Site Reliability Engineer

Robin
0
0
-
0
ZA.svg
South Africa
Full-time
Remote
false
About RobinRobin is on a mission to rebuild the legal industry — starting with making contracts simple for everyone. We are a pioneer in Legal AI, built on proprietary models, licensed data, and deep partnerships with Anthropic and AWS. Since 2019, we’ve expanded our footprint to 4 continents and have been supporting many of the world’s most successful businesses, including GE, Pfizer, KPMG, and UBS.What will you do as an SRE?As an SRE at Robin AI, you'll help build and maintain our cloud infrastructure and applications that powers our cutting-edge Legal AI platform. You'll collaborate with engineering teams to establish robust monitoring, incident response, and deployment strategies that ensure high availability and reliability of our proprietary models and services, maintaining optimal SLOs for our global customer base.Your day-to-day responsibilities:You will be responsible for ensuring the Robin systems are highly available and scalable.Standardise and implement observability practices in our service-based architecture through logging, traces, metrics and monitorsDesign, deploy, and operate infrastructure to support Robin's product teams as we expand into new regions.Adding automation around manual operational tasksCollaborate with development team leads to optimise build, test, and deployment processesParticipating in and improving our on-call and incident handling processes to ensure 24/7 system reliabilityIdeally, you should have the following qualifications:3+ years of experience in DevOps or Site Reliability Engineering rolesProficiency in at least one backend programming language (We use Python)Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by TerraformComfortable troubleshooting across the full stack, starting from the browser, through the networking components, into the containerised applications and then onto data stores.Knowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)Excellent problem-solving and communication skillsExperience with AI/ML infrastructure deployments is a plusWhat’s in it for youSalary: CompetitiveHybrid schedule: We offer a flexible working schedule. #LI-HYBRIDEquity package: Generous equity scheme - everyone gets to be an owner of Robin AI!Annual leave: 20 days PTO, in addition to the public holidays observed in South Africa.Growth opportunities: We prioritise promotions for high performers and help you to progress your career.What’s it like working at Robin?Our culture and values attract people who are creative, resourceful, and share our passion for excellence. At Robin, you're encouraged to push yourself and empowered to take risks. We support each other to think big, try new ideas, and navigate uncertainty. Whether you're at our headquarters or one of our worldwide offices, you'll find a world of opportunities to grow, thrive, and make a meaningful impact. See what life is like at Robin.Diversity, Equity and Inclusion at RobinWe are committed to building one of the most diverse technology companies in the world. As of 2024, more than 30% of our employees come from ethnic minority backgrounds, and 51% of roles are held by women. We know that transforming the legal industry requires diverse perspectives, so we're creating an environment where innovation thrives through inclusion.Robin operates a direct hiring model and any speculative CVs shared via agencies will be treated as a gift.
MLOps / DevOps Engineer
Data Science & Analytics
Apply
Hidden link
No job found
There is no job in this category at the moment. Please try again later