AI Model evaluation with Amazon Bedrock: metrics, methods, and best practices

6 min readAug 21, 2024

Evaluating a model is essential for ensuring its accuracy, quality, and effectiveness. It provides insights into the model’s performance, helps improve its capabilities, and ensures it meets the required standards for real-world applications. Evaluation supports informed decision-making, risk management, and compliance, ultimately contributing to better user experiences and stakeholder communication.

Amazon Bedrock offers a feature known as automatic evaluation to facilitate this process. This approach helps ensure that generative models (such as those used for text summarization, question answering, text classification, or open-ended text generation) meet high-quality standards.

Automatic evaluation

Automatic evaluation of AI models is crucial for ensuring that models perform as expected and produce high-quality results. This process typically involves the following steps:

Specify the tasks the model should perform (such as text classification or question answering) and the evaluation metrics to measure performance.
Preparing test datasets which can be custom-made or pre-defined. These datasets contain prompts or questions for the model to process.
Test the model by having it generate responses to the prompts from the test datasets.
Compare the generated responses with reference answers (benchmark answers) using specific evaluation metrics.
Evaluate the generated responses using a judgment model, which assigns a score based (Grading score) on how similar the responses are to the benchmark answers. This score can be computed using various metrics, such as BERT score and F1 score.

https://www.udemy.com/course/aws-ai-practitioner-certified

BERT Score

The BERT score is a metric developed to assess the quality of generated responses compared to a set of reference responses. It uses pre-trained models to calculate semantic similarity between the generated responses and reference answers. Here’s how it works:

BERT converts words into embedding vectors, capturing contextual and semantic information.
For each word in the generated and reference responses, BERT calculates the similarity between the word embeddings. This is done at both the word level and sentence level.
The similarity scores for individual words are aggregated to obtain an overall score reflecting how well the generated response matches the reference answer.

Monitoring Text-Based Generative AI Models Using Metrics Like Bleu Score (arize.com)

BERT score is particularly useful because it considers semantics and context, improving the evaluation over methods based solely on string or word matching.

F1 Score

The F1 score is a metric commonly used to evaluate the performance of classification models, including text classification models. It is the harmonic mean of precision and recall and is calculated as follows:

Precision: the proportion of true positives among all instances classified as positive. In other words, how many of the model’s positive results are actually correct.

Recall: the proportion of true positives among all instances that are actually positive. In other words, how many of the actual positive results were identified by the model.

F1 Score: the harmonic mean of precision and recall, providing a single score to evaluate the balance between the two. The F1 score is particularly useful when there is an imbalance between classes.

This example illustrates the concept more clearly:

What is Accuracy, Precision, Recall and F1 Score? (labelf.ai)

Human evaluation of AI Models

Human evaluation is another critical approach to assessing the quality of AI models. Unlike automatic evaluations, which rely on algorithms and predefined metrics, this one involves subjective assessment by individuals, such as employees or subject matter experts (SMEs). Here’s how it works:

Similar to automatic, human evaluations begin with benchmark questions and ideal benchmark answers. These provide a standard against which the AI-generated responses will be assessed.
Individuals review the benchmark answers alongside the responses generated by the AI model. They assess the quality of these responses based on their own judgment.
There are several ways human evaluators can assess the responses;
a simple binary assessment where the evaluator indicates whether the response is acceptable or not, ranking, qualitative feedback on why a response is considered correct or incorrect.
After the evaluation, a grading score is calculated based on the human assessments. This score reflects the overall quality of the AI-generated responses from the human perspective.

Human evaluation is preferable when the task requires deep contextual understanding, subjectivity, or expertise that AI may not possess. It’s also useful in cases where the quality and nuance of the output are critical.

In many cases, a hybrid approach might be the most effective, combining the efficiency of AI with the nuanced judgment of human evaluators to achieve the best results.

Business metrics

In addition to traditional grading metrics for evaluating a foundation model, it’s important to consider business metrics that reflect the model’s impact on real-world outcomes. These metrics are often more challenging to assess but are crucial for understanding the model’s effectiveness in a practical, business-oriented context.

User satisfaction: you can collect and analyze user feedback to gauge satisfaction with the model’s responses. This is particularly important in customer-facing applications, such as chatbots or e-commerce platforms.

Example: In an e-commerce setting, user satisfaction could be measured by how well the AI assists users in finding products, resolving issues, or providing accurate information.

Average Revenue Per User (ARPU): by evaluating the average revenue generated per user, you can assess the economic impact of the AI model. If the model is effectively driving sales or enhancing user engagement, you would expect this metric to increase.

Example: a successful generative AI application in an online store might improve product recommendations, leading to higher sales and an increase in ARPU.

Versatility: this metric measures the model’s ability to perform well across different tasks and domains. A strong foundation model should adapt to various contexts, maintaining high performance even when the task changes.

Example: if an AI model performs well not only in generating product descriptions but also in customer service and marketing copy, it demonstrates strong cross-domain performance.

Conversion rates are critical in assessing whether the AI model is achieving desired outcomes, such as turning website visitors into buyers or leads into customers.

Example: if the AI-driven features on a website, like personalized content or chatbot assistance, result in higher conversion rates, it indicates the model’s effectiveness in driving business goals.

Evaluating the efficiency of the model involves looking at how much it costs to run and how well it utilizes computational resources. This is important for maintaining a balance between performance and operational costs.

Example: if an AI model delivers high-quality outputs but requires extensive computing power, you’ll need to assess whether the benefits outweigh the costs. Efficient models provide strong performance without excessive resource consumption.