Hugging Face Evaluate

Hugging Face Evaluate Hugging Face Evaluate

Hugging Face evaluatelink image 16

The Evaluate library of Hugging Face is a library to easily evaluate models and datasets.

With a single line of code, you have access to dozens of evaluation methods for different domains (NLP, computer vision, reinforcement learning and more). Whether on your local machine, or in a distributed training setup, you can evaluate models in a consistent and reproducible manner.

A complete list of available metrics can be obtained from the evaluate page in Hugging Face. Each metric has a dedicated Hugging Face Space with an interactive demonstration on how to use the metric and a documentation card detailing the limitations and use of the metrics.

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Installationlink image 17

To install the library it is necessary to do the following

pip install evaluate

Type of evaluationslink image 18

Several types of evaluations are available

  • Metric: A metric is used to evaluate the performance of a model and usually includes model predictions and ground truth labels.
  • comparison: Used to compare two models. This can be done, for example, by comparing their predictions with ground truth labels.
  • Measurement: The dataset is as important as the model trained on it. With measurements the properties of a dataset can be investigated.

Loadlink image 19

Each metric, comparison or measurement can be loaded with the method load

import evaluate
accuracy = evaluate.load("accuracy")
EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
predictions (`list` of `int`): Predicted labels.
references (`list` of `int`): Ground truth labels.
normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
sample_weight (`list` of `float`): Sample weights Defaults to None.
accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.
Example 1-A simple example
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
>>> print(results)
{'accuracy': 0.5}
Example 2-The same as Example 1, except with `normalize` set to `False`.
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
>>> print(results)
{'accuracy': 3.0}
Example 3-The same as Example 1, except with `sample_weight` set.
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
>>> print(results)
{'accuracy': 0.8778625954198473}
""", stored examples: 0)

If you want to be sure to load the type of metric you want, whether metric, comparison or measurement type, you can do it by adding the module_type parameter

import evaluate
accuracy = evaluate.load("accuracy", module_type="metric")
word_length = evaluate.load("word_length", module_type="measurement")
[nltk_data] Downloading package punkt to
[nltk_data] /home/maximo.fernandez/nltk_data...
[nltk_data] Package punkt is already up-to-date!

Community module loadinglink image 20

In addition to the modules offered by the library, you can also upload models uploaded by someone else to the Hugging Face hub.

element_count = evaluate.load("lvwerra/element_count", module_type="measurement")

List of available moduleslink image 21

If we want to get a list of all the available modules we have to use the list_evaluation_modules method, in which we can put search filters

element_count = evaluate.load("lvwerra/element_count", module_type="measurement")
[{'name': 'ncoop57/levenshtein_distance',
'type': 'comparison',
'community': True,
'likes': 0},
{'name': 'kaleidophon/almost_stochastic_order',
'type': 'comparison',
'community': True,
'likes': 1}]

Module attributeslink image 22

All evaluation modules come with a variety of useful attributes that help you use the module.

Attribute Description
description A brief description of the evaluation module.
citation A BibTex string to cite when available.
features A Features object that defines the input format.
inputs_description This is equivalent to the module documentation string.
homepage The home page of the module.
license The module license.
codebase_urls Link to the code behind the module.
reference_urls additional reference URLs.

Let's take a look at some of them

accuracy = evaluate.load("accuracy")
accuracy = evaluate.load("accuracy")
print(f"description: {accuracy.description}")
print(f"\ncitation: {accuracy.citation}")
print(f"\nfeatures: {accuracy.features}")
print(f"\ninputs_description: {accuracy.inputs_description}")
print(f"\nhomepage: {accuracy.homepage}")
print(f"\nlicense: {accuracy.license}")
print(f"\ncodebase_urls: {accuracy.codebase_urls}")
print(f"\nreference_urls: {accuracy.reference_urls}")
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
TP: True positive
TN: True negative
FP: False positive
FN: False negative
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}
predictions (`list` of `int`): Predicted labels.
references (`list` of `int`): Ground truth labels.
normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
sample_weight (`list` of `float`): Sample weights Defaults to None.
accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.
Example 1-A simple example
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
>>> print(results)
{'accuracy': 0.5}
Example 2-The same as Example 1, except with `normalize` set to `False`.
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
>>> print(results)
{'accuracy': 3.0}
Example 3-The same as Example 1, except with `sample_weight` set.
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
>>> print(results)
{'accuracy': 0.8778625954198473}
codebase_urls: []
reference_urls: ['']

Executionlink image 23

Now that we know how the evaluation module works and what it should contain, let's use it. When it comes to calculating the evaluation, there are two main ways to do it:

  • All in one
  • Incremental

In the incremental approach, the required entries are added to the module with EvaluationModule.add() or EvaluationModule.add_batch() and the score is computed at the end with EvaluationModule.compute(). Alternatively, all entries can be passed at once to compute().

Let's look at these two approaches.

All in onelink image 24

Once we have all the predictions and ground truth we can calculate the metric. Once we have a module defined, we pass it the predictions and ground truth using the compute() method.

accuracy = evaluate.load("accuracy")
accuracy = evaluate.load("accuracy")
predictions = [1, 0, 0, 1]
targets = [0, 1, 0, 1]
accuracy_value = accuracy.compute(predictions=predictions, references=targets)
{'accuracy': 0.5}

Incrementallink image 25

In many evaluation processes, predictions are built iteratively, as in a for loop. In that case, you could store the predictions and ground truth in a list and at the end pass them to compute().

However with the add() and add_batch() methods you can avoid the step of storing the predictions.

If you have all the predictions of a single batch you must use the add() method.

for ref, pred in zip([0,1,0,1], [1,0,0,1]):
accuracy.add(references=ref, predictions=pred)
accuracy_value = accuracy.compute()
{'accuracy': 0.5}

However, when you have predictions of several batches you have to use the add_batch() method.

for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]):
accuracy.add_batch(references=refs, predictions=preds)
accuracy_value = accuracy.compute()
{'accuracy': 0.5}

Combination of several evaluationslink image 26

Often, one wants to evaluate not only a single metric, but also a variety of different metrics that capture different aspects of a model. For example, for classification it is often a good idea to calculate F1, recall and accuracy in addition to accuracy to get a better picture of model performance. Evaluate allows one to load a bunch of metrics and call them sequentially. However, the most convenient way is to use the combine() function to group them.

clasification_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
clasification_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
predictions=[0, 1, 0]
targets=[0, 1, 1]
clasification_metrics.compute(predictions=predictions, references=targets)
{'accuracy': 0.6666666666666666,
'f1': 0.6666666666666666,
'precision': 1.0,
'recall': 0.5}

Save resultslink image 27

We can save the evaluation results in a file with the save() method by passing a file name. We can pass parameters such as the experiment number

result = accuracy.compute(references=references, predictions=targets)
hyperparams = {"model": "bert-base-uncased"}"./results/", experiment="run 42", **result, **hyperparams)

As we can see we have had to create a hyperparams variable to pass it to the save() method. This normally will not be necessary because we will already have those of the model that we are training.

This will create a json with all the information

import pathlib
path = pathlib.Path("./results/")
files = list(path.glob("*"))
import json
result_file = files[0]
result_json = pathlib.Path(result_file).read_text()
result_dict = json.loads(result_json)
{'experiment': 'run 42',
'accuracy': 0.5,
'model': 'bert-base-uncased',
'_timestamp': '2024-04-25T17:45:41.218287',
'_git_commit_hash': '8725338b6bf9c97274685df41b2ee6e11319a735',
'_evaluate_version': '0.4.1',
'_python_version': '3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0]',
'_interpreter_path': '/home/maximo.fernandez/miniconda3/envs/nlp/bin/python'}

Upload results to the hublink image 28

In case we are training a model, we can upload to the model card of the model the results of the evaluation with the push_to_hub() method. In this way they will appear in the model page.

Evaluatorlink image 29

If we have a model, a dataset and a metric, we can do inference for the whole dataset and pass the predictions and the actual labels to the evaluator to return the metric and thus obtain the model metrics.

Or we can give everything to the library and let it do the work for us. Using the evaluator() method, we pass it the model, the dataset and the metric and the method does everything for us.

First we define the model, the dataset and the metric

from transformers import pipeline
from datasets import load_dataset
from evaluate import evaluator
import evaluate
model_pipeline = pipeline("text-classification", model="lvwerra/distilbert-imdb", device=0)
dataset = load_dataset("imdb", split="test").shuffle().select(range(1000))
metric = evaluate.load("accuracy")

Now we pass everything to evaluator().

from transformers import pipeline
from datasets import load_dataset
from evaluate import evaluator
import evaluate
model_pipeline = pipeline("text-classification", model="lvwerra/distilbert-imdb", device=0)
dataset = load_dataset("imdb", split="test").shuffle().select(range(1000))
metric = evaluate.load("accuracy")
task_evaluator = evaluator("text-classification")
results = task_evaluator.compute(model_or_pipeline=model_pipeline, data=dataset, metric=metric,
label_mapping={"NEGATIVE": 0, "POSITIVE": 1},)
{'accuracy': 0.933,
'total_time_in_seconds': 29.43192940400013,
'samples_per_second': 33.97670557962431,
'latency_in_seconds': 0.02943192940400013}

Thanks to the evaluator we were able to obtain the model metrics without having to make the inference ourselves.

Displaylink image 30

Sometimes we get different metrics for different models, which makes it difficult to compare them, so graphs make it easier.

The Evaluate library offers different visualizations through the visualization() method. We have to pass the data to it as a list of dictionaries, where each dictionary has to have the same keys

In order to use this function it is necessary to have the matplotlib library installed.

pip install matplotlib
import evaluate
      from evaluate.visualization import radar_plot
      data = [
         {"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6},
         {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2},
         {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6}, 
         {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6}
      model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]
      plot = radar_plot(data=data, model_names=model_names)
/tmp/ipykernel_10271/ UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
image hugging-face-evaluate 1

We can now visually compare the 4 models and choose the optimal one, based on one or several metrics

Evaluating the model on a set of taskslink image 31

We can evaluate a model, for example, for different datasets. For this we can use the evaluation_suite method. For example we are going to create an evaluator that evaluates a model on the imdb and sst2 datasets. We are going to see these datasets, for that we use the load_dataset_builder method so we don't have to download the complete dataset.

import evaluate
from evaluate.visualization import radar_plot
data = [
{"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6},
{"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2},
{"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6},
{"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6}
model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]
plot = radar_plot(data=data, model_names=model_names)
from datasets import load_dataset_builder
imdb = load_dataset_builder("imdb")
/tmp/ipykernel_10271/ UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
{'text': Value(dtype='string', id=None),
'label': ClassLabel(names=['neg', 'pos'], id=None)}
from datasets import load_dataset_builder
sst2 = load_dataset_builder("sst2")
{'idx': Value(dtype='int32', id=None),
'sentence': Value(dtype='string', id=None),
'label': ClassLabel(names=['negative', 'positive'], id=None)}

As we can see, with the imdb dataset we need to take the text column to get the text and the label column to get the target. With the sst2 dataset we need to take the sentence column to get the text and the label column to get the target.

We create the evaluator for the two datasets

import evaluate
from evaluate.evaluation_suite import SubTask
class Suite(evaluate.EvaluationSuite):
def __init__(self, name):
self.suite = [
"metric": "accuracy",
"input_column": "text",
"label_column": "label",
"label_mapping": {
"LABEL_0": 0.0,
"LABEL_1": 1.0
"metric": "accuracy",
"input_column": "sentence",
"label_column": "label",
"label_mapping": {
"LABEL_0": 0.0,
"LABEL_1": 1.0

It can be seen in split="test[:1]", that we only take one example of the subset of test for this notebook and that the execution does not take too long

Ahora evaluamos con el modelo huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli

from evaluate import EvaluationSuite
      suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite')
      results ="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")
`data` is a preloaded Dataset! Ignoring `subset` and `split`.
      `data` is a preloaded Dataset! Ignoring `subset` and `split`.
[{'accuracy': 0.3,
        'total_time_in_seconds': 1.4153412349987775,
        'samples_per_second': 7.06543394110088,
        'latency_in_seconds': 0.14153412349987776,
        'task_name': 'imdb',
        'data_preprocessor': '<function Suite.__init__.<locals>.<lambda> at 0x7f3ff27a5080>'},
       {'accuracy': 0.0,
        'total_time_in_seconds': 0.1323430729971733,
        'samples_per_second': 75.56118936586572,
        'latency_in_seconds': 0.013234307299717328,
        'task_name': 'sst2',
        'data_preprocessor': '<function Suite.__init__.<locals>.<lambda> at 0x7f3f2a9cc720>'}]

Continue reading

Last posts -->

Have you seen these projects?


Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit


FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?


Dataset with jokes in English


Dataset with translations from English to Spanish


Dataset with Netflix movies and series

View more datasets -->