AI Researcher & Engineer - Multimodal (Real-time Audio and Video)

X AI

5000+

United States

Apply

Location

Palo Alto United States

Salary

(Yearly)

Undisclosed

USD

180000

440000

About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers and researchers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

The reasoning team at xAI creates magical AI experiences beyond text, enabling the understanding and generation of content across various modalities, including image, video, and audio. Our team is pushing the frontier of multimodal intelligence through Grok Voice, our advanced multimodal AI assistant that is able to listen, see, and respond to you in real time. We actively work to develop novel audio and video understanding capabilities that solve user problems in both the physical and digital worlds.

As a Researcher & Engineer on the Reasoning team specializing in real-time audio and video, you'll lead the advancement of multimodal capabilities across data, modeling, serving infrastructure, and product integration. Collaborating closely with pre-training, post-training, and product teams, you'll drive innovations that expand the boundaries of model performance and elevate end-to-end user experiences. Ideal candidates thrive at the intersection of cutting-edge research and engineering.

What You'll Do

Research, design, and implement algorithms to enhance audio and video understanding and generation, whether through developing new models, systems, or tools.
Collaborate closely with product and engineering teams to carry multimodal capabilities from initial concept through production deployment, proactively monitoring and addressing issues along the way.
Improve data quality by curating robust datasets, developing data filtering and generation techniques, building scalable data pipelines, and analyzing user interactions to inform product improvements.
Create evaluation frameworks, internal benchmarks, and metrics to systematically measure and improve real-world model performance, proactively identifying and resolving user-facing challenges.
Manage the complete experimental lifecycle: from designing experiments and training models to deployment and iterative refinement based on feedback and data.

Ideal Experience

You're an exceptional candidate if you have some (or all) of the following:

A proven track record of leading research or engineering efforts that have significantly enhanced neural network capabilities and performance.
Hands-on experience building and deploying large-scale distributed machine learning systems and backend services.
Expertise in reinforcement learning, agentic models, or real-world multimodal AI applications.
Strong engineering skills combined with the ability and enthusiasm to rapidly navigate and master complex, unfamiliar codebases.
Demonstrated excellence in systematic experiment design, model debugging, performance analysis, and iterative improvements.
A pragmatic, execution-oriented approach: you proactively solve problems and prioritize getting things done efficiently.

Location

The role is based in Palo Alto. Our team usually works from the office 5 days a week but allow work-from-home days when required. Candidates are expected to be located near Palo Alto or open to relocation.

Interview Process

After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to a 15-minute interview ("phone interview") during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of four technical interviews:

One-on-one research discussion & coding interviews (three meetings total)
Project deep-dive: Present your past exceptional work and your vision with xAI to a small audience.
Every application is reviewed by a member of our technical team. All interviews will be conducted via Google Meet.

Annual Salary Range

$180,000 - $440,000 USD

xAI is an equal opportunity employer.

California Consumer Privacy Act (CCPA) Notice