Site Reliability Engineer, Managed AI
The Site Reliability Engineer is responsible for designing and operating reliable managed AI services focused on serving and scaling large language model workloads. They build automation and reliability tooling to support distributed AI pipelines and inference services, define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability, and collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters. Additionally, they automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services, investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling, and contribute to the architecture of next-generation distributed systems designed specifically for AI-first environments.
Site Reliability Engineer, Inference Infrastructure
As a Site Reliability Engineer on the Model Serving team, you will build self-service systems that automate managing, deploying, and operating services, including custom Kubernetes operators supporting language model deployments. You will automate environment observability and resilience, enabling all developers to troubleshoot and resolve problems, and take steps to ensure defined SLOs are met, including participating in an on-call rotation. Additionally, you will build strong relationships with internal developers and influence the Infrastructure team’s roadmap based on their feedback, as well as develop the team through knowledge sharing and an active review process.
Senior Network Engineer (f/m/d)
The Senior Network Engineer will design, implement, and maintain enterprise network infrastructure, ensuring its stability and scalability across multiple datacenters. Responsibilities include managing firewall and security standards, supporting GPU clusters used for AI/ML workloads, and collaborating with cross-functional teams to document and optimize systems.
DevOps Engineer
The DevSecOps / Platform Engineer will design, implement, and operate secure, cloud-native infrastructure powering core data and application platforms for a defense-focused company. They will develop CI/CD pipelines, automate deployments, uphold security practices, and collaborate across teams to ensure reliability, scalability, and compliance for government users.
Staff Software Engineer, Infrastructure
You will design, build, and operate production infrastructure for high-scale, low-latency systems, owning critical services end-to-end to improve reliability and performance. The role also involves partnering with research and product teams, optimizing service latencies, evolving CI/CD and self-service tooling, and leading infrastructure-as-code and GitOps practices.
Staff Infrastructure Security Engineer
The engineer will architect, deploy, and operationalize foundational security services to support Crusoe's move toward Zero Trust, serving as a technical leader for secrets management and identity architecture. Responsibilities span from driving enterprise-wide platforms like HashiCorp Vault to defining trust patterns and secure onboarding in a hybrid, multi-cloud environment.
Forward Deployed Engineer, Infrastructure Specialist (Public Sector)
Lead end-to-end deployment of the North AI platform in private cloud and on-premises environments for enterprise clients. Collaborate with IT teams to ensure secure, compliant integration and troubleshoot deployment issues to deliver robust client solutions.
Enterprise Security Engineer
You will be responsible for building and operationalizing the company's compliance program, implementing controls, and supporting audits in a fast-paced SaaS environment. Key tasks include managing GRC tools, automating workflows for compliance standards such as SOC 2 and ISO 27001, and supporting responses to customer security assessments.
Freelance AI Red Team Engineer
As a Freelance AI Red Team Engineer, you will evaluate and red team AI models, agents, and machine learning systems for safety risks and vulnerabilities. You will also develop automation tools, create rigorous test scenarios, and contribute to security research initiatives in the AI domain.
Freelance AI Red Team Engineer
Evaluate and red team AI models and agents for vulnerabilities and safety risks, and develop automation tools and test harnesses for AI systems. Contribute to security research initiatives, including designing and implementing challenging attack scenarios for AI models.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.