We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position requires presence in our San Francisco or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
The Lambda Observability team builds and operates large scale monitoring systems for our AI cloud product suite. We deploy observability solutions across the stack, from datacenter infrastructure to our in-house software stack. Keeping those offerings reliable and instantly detecting issues in the latest high-performance AI clusters is what makes us tick.
Along with the Platform Engineering organization, we help to build the foundations that unlock product excellence and a highly reliable experience for our customers.
Our expertise lies at the intersection of:
Scalable Observability Platforms: We build and operate mission-critical platforms for metrics, logs, and traces based on both open-source software and systems developed in-house.
AI Infrastructure Observability: We design observability solutions for large-scale AI clusters running the latest GPU, Networking, and Storage technologies.
Observability Practices: We engage across the company to promote best practices, help teams adopt our platforms, and enable applications that require observability data.
About the Role:
We are seeking a seasoned Observability Engineering Manager with deep experience in development and operation of modern observability platforms. You will hire and guide a team of observability engineers in building out critical pillars of our internal observability stack. You will lead the team in building monitoring solutions for new products, and in measuring and reporting the availability of our products.
Your role is not just to manage people, but to coordinate the delivery of observability solutions to customers inside and outside Lambda. Your leadership will be pivotal in ensuring our ability to deliver a high-quality, reliable product experience.
This is a unique opportunity to work at the intersection of large-scale observability systems and the rapidly evolving field of artificial intelligence infrastructure. You will be building the systems that monitor some of the world’s most advanced AI solutions.
What You’ll Do
Team Leadership & Management:
Grow/Hire, lead, and mentor a team of high-performing observability engineers and SREs.
Foster a culture of technical excellence, collaboration, and customer service.
Conduct regular one-on-one meetings, provide constructive feedback, and support career development for team members.
Drive outcomes by managing project priorities, deadlines, and deliverables.
Technical Strategy & Execution:
Work with the engineering team to drive strategy for Lambda internal and customer observability solutions.
Improve observability of AI infrastructure and develop new monitoring solutions as new products are introduced.
Lead the broader engineering organization in adoption of Observability and SRE practices.
Manage costs of both vendors and internally developed platforms.
Lead team in the continued development of our existing Metrics solutions based on the Prometheus and OpenTelemetry ecosystems.
Lead team in tasks related to delivery of new Logging and Tracing solutions based on Clickhouse.
Guide team in problem identification, requirements gathering, solution ideation, and stakeholder alignment on engineering RFCs.
Participate in design of solutions for bringing observability data to our customers.
Identify gaps in our observability posture and drive resolution.
Lead the team in supporting internal customers from across Lambda engineering.
Cross-Functional Collaboration:
Collaborate with the infrastructure and HPC teams on infrastructure monitoring and alerting.
Work closely with Lambda product engineering teams on instrumentation and best practices usage of our platforms.
Work to understand the needs of engineering teams and drive our Observability solutions towards self-service.
Manage a short list of vendors that provide SaaS solutions in the monitoring space.
You
Experience:
10+ years of experience in observability systems or platform engineering with at least 3 years in a management or lead role.
Demonstrated experience leading a team of engineers and SREs on complex, cross-functional projects in a fast-paced startup environment.
Significant experience in environments that require the monitoring of bare-metal infrastructure is preferred.
Experience with a wide variety of modern open-source observability software.
Strong background in software engineering and the SDLC.
Strong project management skills, leading planning, project execution, and delivery of team outcomes on schedule.
Extensive experience with site reliability engineering and ability to champion improved SRE practices.
Experience building a high-performance team through deliberate hiring, upskilling, performance-management, and expectation setting.
Nice to Have
Experience:
Experience driving cross-functional engineering management initiatives (coordinating events, strategic planning, coordinating large projects).
Experience driving organizational improvements (processes, systems, etc.)
Experience with Kubernetes, designing scalable distributed systems,
Salary Range Information
The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About Lambda
Founded in 2012, ~400 employees (2025) and growing fast
We offer generous cash & equity compensation
Our investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.
We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
Health, dental, and vision coverage for you and your dependents
Wellness and Commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible Paid Time Off Plan that we all actually use
A Final Note:
You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.
Equal Opportunity Employer
Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.