Evaluation Scenario Writer - AI Agent Testing Specialist
Design realistic and structured evaluation scenarios for LLM-based agents by creating test cases that simulate human-performed tasks and defining gold-standard behavior to compare agent actions against. Create structured test cases that simulate complex human workflows, define gold-standard behavior and scoring logic to evaluate agent actions, analyze agent logs, failure modes, and decision paths. Work with code repositories and test frameworks to validate scenarios, iterate on prompts, instructions, and test cases to improve clarity and difficulty. Ensure scenarios are production-ready, easy to run, and reusable.
MCP & Tools Python Developer - Agent Evaluation Infrastructure
Developing and maintaining MCP-compatible evaluation servers, implementing logic to check agent actions against scenario definitions, creating or extending tools that writers and QAs use to test agents, working closely with infrastructure engineers to ensure compatibility, and occasionally helping with test writing or debug sessions when needed.
Mathematician - Freelance AI Trainer
As an AI Tutor in Mathematics on the Mindrift platform, you will generate prompts that challenge AI, define comprehensive scoring criteria to evaluate the accuracy of the AI's answers, and correct the model's responses based on your domain-specific knowledge, contributing to projects aligned with your skills on your own schedule, including creating training prompts and refining model responses.
First-Line Supervisors of Food Preparation and Serving Workers - AI Trainer (Contract)
The responsibilities include evaluating what AI models produce related to the field of food preparation and serving work, assessing content related to the field of work, delivering clear and structured feedback to improve the AI model's understanding of workplace tasks and language, developing prompts for AI models that reflect the field, and evaluating AI responses. The work is performed remotely and asynchronously with flexible hours, and involves leveraging professional experience in food preparation and serving supervision to train AI models.
Member of Technical Staff - ML Research Engineer; Multi-Modal - Audio
Invent and prototype new model architectures that optimize inference speed, including on edge devices; build and maintain evaluation suites for multimodal performance across a range of public and internal tasks; collaborate with the data and infrastructure teams to build scalable pipelines for ingesting and preprocessing large audio datasets; work with the infrastructure team to optimize model training across large-scale GPU clusters; contribute to publications, internal research documents, and thought leadership within the team and the broader ML community; collaborate with the applied research and business teams on client-specific use cases.
Software Engineer, Evaluation Frontend
As an Evaluation Frontend Software Engineer, you will design tools and visualizations that enable researchers and engineers to compare and analyse hundreds of model evaluations, including both data visualization tools and statistical tools to extract signal from noisy data. You will develop an understanding of the relative merits and limitations of each model evaluation and suggest new facets of model evaluation. Your work will involve collaborating closely with cross-functional teams, including researchers and engineers, to surface necessary insights for model development.
Freelance Cybersecurity Analyst - AI Trainer
Analyze and investigate simulated security alerts and incidents across endpoints, identities, and cloud environments. Conduct proactive threat hunting using KQL or similar query languages to identify hidden vulnerabilities and emerging threats that automated systems may miss. Assess the accuracy and depth of AI-generated security incident reports and threat analyses. Review, validate, and improve the model’s understanding of Microsoft Defender products and SOC workflows. Provide expert feedback on AI performance in identifying and classifying cybersecurity threats.
Finance Platform Engineer
Use proprietary software applications to provide input and labels on defined projects. Support and ensure the delivery of high-quality curated data. Contribute to the training of new tasks by working closely with the technical staff to develop and implement cutting-edge initiatives and technologies. Interact with technical staff to improve the design of efficient annotation tools. Choose problems from economics fields that align with expertise, focusing on macroeconomics, microeconomics, and behavioral economics. Regularly interpret, analyze, and execute tasks based on given instructions. Provide services including labeling and annotating data in text, voice, and video formats to support AI model training, sometimes involving recording audio or video sessions.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.