Staff Product Designer, Go Enterprise
Own the observability and lifecycle management of AI features across the organization. Build tools and infrastructure to enable teams to develop, monitor, and optimize LLM-powered features. Design and implement closed-loop evaluation pipelines that automatically validate prompt changes. Develop comprehensive metrics and dashboards to track LLM usage, including cost per feature, token patterns, and latency. Create systems that tie user feedback to specific prompts and LLM calls. Establish best practices and processes for the full lifecycle of prompts including development, testing, deployment, and monitoring. Collaborate with engineering teams across the organization to ensure they have the tools and visibility needed to build high-quality AI features.
Senior Software Engineer, Managed AI - AI Platform
Lead the design and implementation of core AI services including resilient fault-tolerant queues for task distribution, model catalogs for managing and versioning AI models, and scheduling mechanisms optimized for cost and performance. Architect and scale infrastructure to handle millions of API requests per second, implement robust monitoring and alerting for system health and 24/7 availability. Collaborate with product management, business strategy, and other engineering teams to define the AI platform roadmap, influence long-term vision and architecture decisions, contribute to open-source AI frameworks, actively participate in the AI community, and prototype and iterate on emerging technologies and new features.
Engineering Manager, Managed AI
As an Engineering Manager on the Managed AI team, you will lead and scale a team of engineers building a next-generation platform for Large Language Models (LLMs). You will be responsible for guiding the team through designing and implementing highly scalable, fault-tolerant infrastructure. Your role includes leading, mentoring, and growing a team of software engineers; partnering with leadership to define and execute the AI roadmap; cultivating a high-performance, collaborative engineering culture; overseeing the architecture and development of core AI services such as fault-tolerant task queues, model management systems, and cost-aware scheduling; ensuring delivery of scalable systems capable of handling millions of API requests per second; delivering an AI platform to handle a large variety of load from training to agentic execution; working cross-functionally with Product, Infrastructure, and GTM stakeholders; representing Engineering in strategic discussions to influence AI platform growth and customer adoption; and promoting knowledge sharing, technical mentorship, and the evolution of engineering processes.
Senior Staff Software Engineer, Model LifeCycle
The Senior Staff Engineer for the Model LifeCycle team is responsible for building a comprehensive managed platform for the entire application development lifecycle with a focus on leveraging Machine Learning models including Large Language Models (LLMs). Responsibilities include managing fine-tuning systems for large foundation models with multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling; implementing and maintaining end-to-end training pipelines for LLMs; developing distillation and reinforcement learning pipelines; managing agent execution infrastructure; and handling dataset, model, and experiment management such as versioning, lineage, evaluation, and reproducible fine-tuning at scale. The role also involves close collaboration with product, business, and platform teams to shape core abstractions and APIs, influencing long-term architectural decisions around training runtimes, scheduling, storage, and model lifecycle management. Additionally, the engineer will contribute to and engage with the open-source LLM ecosystem and take ownership of designing and building core systems from first principles.
Staff Software Engineer, Model LifeCycle
The Staff Software Engineer for the Model LifeCycle team is responsible for building a comprehensive managed platform for the entire application development lifecycle focused on Machine Learning models, including Large Language Models (LLMs). Responsibilities include contributing to fine-tuning systems for large foundation models, including multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling, implementing and maintaining end-to-end training pipelines for LLMs, contributing to distillation and reinforcement learning pipelines, developing and maintaining agent execution infrastructure, and implementing features for dataset, model, and experiment management such as versioning, lineage, evaluation, and reproducible fine-tuning at scale. The role involves close collaboration with Principal Engineers, product, business, and platform teams to implement core abstractions and APIs, contributing to architectural decisions around training runtimes, scheduling, storage, and model lifecycle management, and engaging with the open-source LLM ecosystem. The role offers significant scope for ownership in implementing and contributing to the design of core systems.
Staff Software Engineer, Managed AI - AI Platform
Lead the design and implementation of core AI services including resilient fault-tolerant queues for efficient task distribution, model catalogs for managing and versioning AI models, and scheduling mechanisms optimized for cost and performance. Architect and scale infrastructure to handle millions of API requests per second while ensuring robust monitoring and alerting for system health and 24/7 availability. Collaborate closely with product management, business strategy, and other engineering teams to define the AI platform roadmap, influence long-term vision and architectural decisions, contribute to open-source AI frameworks, participate in the AI community, and prototype and rapidly iterate on emerging technologies and new features.
Principal Engineer, C++/Integration (R4539)
The role involves creating reference implementations for potential future products or product components by integrating new hardware platforms, sensor suits, simulators, and concepts of operation with the Hivemind SDK (C++) for commercial applications, focusing on autonomy and simulation. The role requires demonstrating developed architectures as solutions to customers, gathering feedback, and iterating accordingly. It also includes exploring and evaluating future hardware and software technologies relevant to Shield AI's product roadmap beyond current projects, identifying areas of technical debt across the software stack, and analyzing and synthesizing solutions to address them. The position requires working closely with product teams and forward-sprinting within the Special Projects team to contribute strategically and tactically to the development of foundational Hivemind products, especially Hivemind Enterprise commercial applications.
Senior Backend Engineer, LangSmith Deployments
Design distributed queue and worker systems that handle concurrent agent execution, background tasks, and multi-agent coordination across horizontally scalable infrastructure; own core data infrastructure including state persistence, atomic job claiming, connection management, and schema evolution; collaborate on architectural decisions to ensure solutions are scalable and robust; ship resumable streaming infrastructure enabling clients to disconnect and reconnect mid-execution without losing state; instrument and monitor production systems with tracing, metrics, and alerting; participate in on-call rotations and own incident response for the runtime; create and maintain technical documentation including system design and operational runbooks; contribute to and extend the open-source LangGraph platform used for building agent applications.
Software Engineer - AI Trainer
Use software engineering experience to design job-related coding questions and review AI-generated responses for correctness, efficiency, clarity, and alignment with real-world engineering practices. Evaluate AI-generated code and technical content, provide structured feedback, and help improve AI's understanding of programming tasks, system design, and engineering best practices.
Software Engineer, Infrastructure
The Infrastructure Team builds the underlying tooling and infrastructure that powers all Exa's systems, including building GPU cluster orchestration in Kubernetes, map-reduce batchjobs on Ray, and the best observability tooling in the world to enable fast movement as an engineering organization.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Need help with something? Here are our most frequently asked questions.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
