Responsibilities
-
Design and implement evaluation pipelines to measure the performance and reliability of AI models.
-
Develop automated testing frameworks to assess model outputs at scale.
-
Analyze model performance using both traditional statistical metrics and AI-specific evaluation methods.
-
Evaluate AI systems built on modern architectures such as LLM-based applications and Retrieval-Augmented Generation (RAG).
-
Identify potential issues related to accuracy, hallucinations, bias, safety, and model drift.
-
Conduct adversarial testing to uncover vulnerabilities and ensure safe model behavior.
-
Collaborate with engineering and AI teams to improve prompt design, model outputs, and system performance.
-
Monitor model performance in production and help define best practices for AI evaluation and observability.
Requirements
-
Proficiency in Python and experience building scripts or pipelines to evaluate model outputs.
-
Experience working with AI/ML systems, particularly large language models (LLMs) or generative AI applications.
-
Familiarity with concepts such as prompt engineering, prompt optimization, and LLM evaluation.
-
Understanding of evaluation metrics such as precision, recall, F1-score, and AI-specific metrics related to model quality and safety.
-
Experience evaluating RAG systems or knowledge retrieval pipelines is a plus.
-
Experience with modern AI evaluation or observability tools is a plus (e.g., DeepEval, Promptfoo, RAGAS, LangSmith, Arize, Weights & Biases).
-
Strong analytical mindset with the ability to interpret model behavior and propose improvements.
Nice to Have
-
Experience performing adversarial testing or red-teaming of AI systems.
-
Familiarity with AI safety, bias detection, and model alignment practices.
-
Experience working in production environments deploying or monitoring AI systems.





