Custom Evaluators with AI Foundry

Table of Contents

In this post, we will explore how to create custom evaluators to evaluate your Generative AI application locally with the Azure AI Evaluation SDK.

The results of the evaluation can be uploaded to your Azure AI Foundry project where you can visualize and track the results.

Prerequisites
#

Before you begin, ensure you have the following:

An Azure subscription.
An Azure AI Foundry workspace.
An Azure AI Foundry project.
An Azure OpenAI resource.

Install the required packages
#

Install the necessary packages by running the following command:

pip install ipykernel
pip install promptflow
pip install promptflow.core
pip install azure-ai-evaluation

Environment Variables
#

Create the following environment variables or add them to an .env file:

AZURE_OPENAI_ENDPOINT=<your-azure-openai-endpoint>
AZURE_OPENAI_API_KEY=<your-azure-openai-api-key>
AZURE_OPENAI_DEPLOYMENT=<your-azure-openai-deployment>
AZURE_OPENAI_API_VERSION=<your-azure-openai-api-version>
AZURE_SUBSCRIPTION_ID=<your-azure-subscription-id>
AZURE_RESOURCE_GROUP=<your-azure-resource-group>
AZURE_AI_FOUNDRY_PROJECT=<your-azure-ai-foundry-project>

Imports
#

Import the necessary libraries:

import os
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from promptflow.core import AzureOpenAIModelConfiguration
from promptflow.tracing import start_trace

if "AZURE_OPENAI_API_KEY" not in os.environ:
    # load environment variables from .env file
    load_dotenv()

# start a trace session, and print a url for user to check trace
start_trace()

Setup Credentials and Configuration
#

Initialize Azure credentials and create the necessary configurations:

# Initialize Azure credentials
credential = DefaultAzureCredential()

# Create an Azure project configuration
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_AI_FOUNDRY_PROJECT"),
}

# Create a model configuration
model_config = {
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

# Create an Azure OpenAI model configuration
configuration = AzureOpenAIModelConfiguration(
    azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT"],
)

Groundedness Evaluator (Test a built-in evaluator)
#

Initialize and use the Groundedness evaluator:

from azure.ai.evaluation import GroundednessEvaluator

# Initializing Groundedness evaluator
groundedness_eval = GroundednessEvaluator(model_config)

query_response = dict(
    query="Which tent is the most waterproof?",
    context="The Alpine Explorer Tent is the second most water-proof of all tents available.",
    response="The Alpine Explorer Tent is the most waterproof."
)

# Running Groundedness Evaluator on a query and response pair
groundedness_score = groundedness_eval(
    **query_response
)
print(groundedness_score)

Answer Length Custom Evaluator
#

Create the following class in the answer_length folder:

class AnswerLengthEvaluator:
    def __init__(self):
        pass
    # A class is made a callable my implementing the special method __call__
    def __call__(self, *, answer: str, **kwargs):
        return {"answer_length": len(answer)}

Initialize and use the Answer Length evaluator:

from answer_length.answer_length import AnswerLengthEvaluator
answer_length_evaluator = AnswerLengthEvaluator()
answer_length = answer_length_evaluator(answer="What is the speed of light?")

print(answer_length)

Friendliness Custom Evaluator
#

Create the following class in the friendliness folder:

import os
import json
from promptflow.core import Prompty
from promptflow.core import AzureOpenAIModelConfiguration

class FriendlinessEvaluator:
    def __init__(self, configuration: AzureOpenAIModelConfiguration):
        current_dir = os.path.dirname(__file__)
        prompty_path = os.path.join(current_dir, "friendliness.prompty")

       
        override_model = {"configuration": configuration, "parameters": {"max_tokens": 512}}

        self.prompty = Prompty.load(source=prompty_path, model=override_model)

    def __call__(self, *, response: str, **kwargs):
        llm_response = self.prompty(response=response)
        try:
            response = json.loads(llm_response)
        except Exception as ex:
            response = llm_response
        return response

Create a friendliness.prompty file in the friendliness folder with the following content:

---
name: Friendliness Evaluator
description: Friendliness Evaluator to measure warmth and approachability of answers.
model:
  api: chat
  parameters:
    temperature: 0.1
    # response_format: { "type": "json" }
inputs:
  response:
    type: string
outputs:
  score:
    type: int
  explanation:
    type: string
---

system:
Friendliness assesses the warmth and approachability of the answer. Rate the friendliness of the response between one to five stars using the following scale:

One star: the answer is unfriendly or hostile

Two stars: the answer is mostly unfriendly

Three stars: the answer is neutral

Four stars: the answer is mostly friendly

Five stars: the answer is very friendly

Please assign a rating between 1 and 5 based on the tone and demeanor of the response.

**Example 1**
generated_query: I just dont feel like helping you! Your questions are getting very annoying.
output:
{"score": 1, "reason": "The response is not warm and is resisting to be providing helpful information."}
**Example 2**
generated_query: I'm sorry this watch is not working for you. Very happy to assist you with a replacement.
output:
{"score": 5, "reason": "The response is warm and empathetic, offering a resolution with care."}


**Here the actual conversation to be scored:**
generated_query: {{response}}
output:

Initialize and use the Friendliness evaluator:

from friendliness.friendliness import FriendlinessEvaluator

friendliness_eval = FriendlinessEvaluator(configuration)

friendliness_score = friendliness_eval(response="I will not apologize for my behavior!")
print(friendliness_score)

Evaluate with both built-in and custom evaluators
#

Evaluate data using both built-in and custom evaluators:

from azure.ai.evaluation import evaluate

result = evaluate(
    data="./data/data.csv", # provide your data here
    evaluators={
        "groundedness": groundedness_eval,
        "answer_length": answer_length_evaluator,
        "friendliness": friendliness_eval
    },
    # column mapping
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}"
            } 
        },
        "answer_length": {
            "column_mapping": {
                "answer": "${data.response}"
            }
        },
        "friendliness": {
            "column_mapping": {
                "response": "${data.response}"
            }
        }
    },
    # Provide your Azure AI project information to track your evaluation results in your Azure AI project
    azure_ai_project = azure_ai_project,
    # Provide an output path to dump a json of metric summary, row level data and metric and Azure AI project URL
    output_path="./results.json"
)

print(result)

Please find the complete code and Jupyter notebook here

Hope it helps!

References:

Prerequisites#

Install the required packages#

Environment Variables#

Imports#

Setup Credentials and Configuration#

Groundedness Evaluator (Test a built-in evaluator)#

Answer Length Custom Evaluator#

Friendliness Custom Evaluator#

Evaluate with both built-in and custom evaluators#