Hugging Face Evaluate

29 of april of 2021

Hugging Face evaluate

The Evaluate library of Hugging Face is a library to easily evaluate models and datasets.

With a single line of code, you have access to dozens of evaluation methods for different domains (NLP, computer vision, reinforcement learning and more). Whether on your local machine, or in a distributed training setup, you can evaluate models in a consistent and reproducible manner.

A complete list of available metrics can be obtained from the evaluate page in Hugging Face. Each metric has a dedicated Hugging Face Space with an interactive demonstration on how to use the metric and a documentation card detailing the limitations and use of the metrics.

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Installation

To install the library it is necessary to do the following

pip install evaluate

Type of evaluations

Several types of evaluations are available

Metric: A metric is used to evaluate the performance of a model and usually includes model predictions and ground truth labels.
comparison: Used to compare two models. This can be done, for example, by comparing their predictions with ground truth labels.
Measurement: The dataset is as important as the model trained on it. With measurements the properties of a dataset can be investigated.

Load

Each metric, comparison or measurement can be loaded with the method load

	
		import evaluate
accuracy = evaluate.load("accuracy")
accuracy

	
		EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.
Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.
Examples:
    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
        {'accuracy': 0.5}
    Example 2-The same as Example 1, except with `normalize` set to `False`.
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
        >>> print(results)
        {'accuracy': 3.0}
    Example 3-The same as Example 1, except with `sample_weight` set.
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
        >>> print(results)
        {'accuracy': 0.8778625954198473}
""", stored examples: 0)

If you want to be sure to load the type of metric you want, whether metric, comparison or measurement type, you can do it by adding the module_type parameter

	
		import evaluate
accuracy = evaluate.load("accuracy", module_type="metric")
word_length = evaluate.load("word_length", module_type="measurement")

	
		[nltk_data] Downloading package punkt to
[nltk_data]     /home/maximo.fernandez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Community module loading

In addition to the modules offered by the library, you can also upload models uploaded by someone else to the Hugging Face hub.

	
		element_count = evaluate.load("lvwerra/element_count", module_type="measurement")

List of available modules

If we want to get a list of all the available modules we have to use the list_evaluation_modules method, in which we can put search filters

	
		element_count = evaluate.load("lvwerra/element_count", module_type="measurement")
evaluate.list_evaluation_modules(
  module_type="comparison",
  include_community=True,
  with_details=True)

	
		[{'name': 'ncoop57/levenshtein_distance',
  'type': 'comparison',
  'community': True,
  'likes': 0},
 {'name': 'kaleidophon/almost_stochastic_order',
  'type': 'comparison',
  'community': True,
  'likes': 1}]

Module attributes

All evaluation modules come with a variety of useful attributes that help you use the module.

Attribute	Description
description	A brief description of the evaluation module.
citation	A BibTex string to cite when available.
features	A Features object that defines the input format.
inputs_description	This is equivalent to the module documentation string.
homepage	The home page of the module.
license	The module license.
codebase_urls	Link to the code behind the module.
reference_urls	additional reference URLs.

Let's take a look at some of them

	
		accuracy = evaluate.load("accuracy")

	
		accuracy = evaluate.load("accuracy")
print(f"description: {accuracy.description}")
print(f"\ncitation: {accuracy.citation}")
print(f"\nfeatures: {accuracy.features}")
print(f"\ninputs_description: {accuracy.inputs_description}")
print(f"\nhomepage: {accuracy.homepage}")
print(f"\nlicense: {accuracy.license}")
print(f"\ncodebase_urls: {accuracy.codebase_urls}")
print(f"\nreference_urls: {accuracy.reference_urls}")

	
		description: 
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative
citation: 
@article{scikit-learn,
  title={Scikit-learn: Machine Learning in {P}ython},
  author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
  journal={Journal of Machine Learning Research},
  volume={12},
  pages={2825--2830},
  year={2011}
}
features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}
inputs_description: 
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.
Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.
Examples:
    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
        {'accuracy': 0.5}
    Example 2-The same as Example 1, except with `normalize` set to `False`.
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
        >>> print(results)
        {'accuracy': 3.0}
    Example 3-The same as Example 1, except with `sample_weight` set.
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
        >>> print(results)
        {'accuracy': 0.8778625954198473}
homepage: 
license: 
codebase_urls: []
reference_urls: ['https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html']

Execution

Now that we know how the evaluation module works and what it should contain, let's use it. When it comes to calculating the evaluation, there are two main ways to do it:

All in one
Incremental

In the incremental approach, the required entries are added to the module with EvaluationModule.add() or EvaluationModule.add_batch() and the score is computed at the end with EvaluationModule.compute(). Alternatively, all entries can be passed at once to compute().

Let's look at these two approaches.

All in one

Once we have all the predictions and ground truth we can calculate the metric. Once we have a module defined, we pass it the predictions and ground truth using the compute() method.

	
		accuracy = evaluate.load("accuracy")

	
		accuracy = evaluate.load("accuracy")
predictions = [1, 0, 0, 1]
targets = [0, 1, 0, 1]
accuracy_value = accuracy.compute(predictions=predictions, references=targets)
accuracy_value

	
		{'accuracy': 0.5}

Incremental

In many evaluation processes, predictions are built iteratively, as in a for loop. In that case, you could store the predictions and ground truth in a list and at the end pass them to compute().

However with the add() and add_batch() methods you can avoid the step of storing the predictions.

If you have all the predictions of a single batch you must use the add() method.

	
		for ref, pred in zip([0,1,0,1], [1,0,0,1]):
    accuracy.add(references=ref, predictions=pred)
accuracy_value = accuracy.compute()
accuracy_value

	
		{'accuracy': 0.5}

However, when you have predictions of several batches you have to use the add_batch() method.

	
		for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]):
    accuracy.add_batch(references=refs, predictions=preds)
accuracy_value = accuracy.compute()
accuracy_value

	
		{'accuracy': 0.5}

Combination of several evaluations

Often, one wants to evaluate not only a single metric, but also a variety of different metrics that capture different aspects of a model. For example, for classification it is often a good idea to calculate F1, recall and accuracy in addition to accuracy to get a better picture of model performance. Evaluate allows one to load a bunch of metrics and call them sequentially. However, the most convenient way is to use the combine() function to group them.

	
		clasification_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

	
		clasification_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
predictions=[0, 1, 0]
targets=[0, 1, 1]
clasification_metrics.compute(predictions=predictions, references=targets)

	
		{'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 1.0,
 'recall': 0.5}

Save results

We can save the evaluation results in a file with the save() method by passing a file name. We can pass parameters such as the experiment number

	
		references=[0,1,0,1]
targets=[1,0,0,1]
result = accuracy.compute(references=references, predictions=targets)
hyperparams = {"model": "bert-base-uncased"}
evaluate.save("./results/", experiment="run 42", **result, **hyperparams)

	
		PosixPath('results/result-2024_04_25-17_45_41.json')

As we can see we have had to create a hyperparams variable to pass it to the save() method. This normally will not be necessary because we will already have those of the model that we are training.

This will create a json with all the information

	
		import pathlib
path = pathlib.Path("./results/")
files = list(path.glob("*"))
files

	
		[PosixPath('results/result-2024_04_25-17_45_41.json')]

	
		import json
result_file = files[0]
result_json = pathlib.Path(result_file).read_text()
result_dict = json.loads(result_json)
result_dict

	
		{'experiment': 'run 42',
 'accuracy': 0.5,
 'model': 'bert-base-uncased',
 '_timestamp': '2024-04-25T17:45:41.218287',
 '_git_commit_hash': '8725338b6bf9c97274685df41b2ee6e11319a735',
 '_evaluate_version': '0.4.1',
 '_python_version': '3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0]',
 '_interpreter_path': '/home/maximo.fernandez/miniconda3/envs/nlp/bin/python'}

Upload results to the hub

In case we are training a model, we can upload to the model card of the model the results of the evaluation with the push_to_hub() method. In this way they will appear in the model page.

Evaluator

If we have a model, a dataset and a metric, we can do inference for the whole dataset and pass the predictions and the actual labels to the evaluator to return the metric and thus obtain the model metrics.

Or we can give everything to the library and let it do the work for us. Using the evaluator() method, we pass it the model, the dataset and the metric and the method does everything for us.

First we define the model, the dataset and the metric

	
		from transformers import pipeline
      from datasets import load_dataset
      from evaluate import evaluator
      import evaluate
      
      model_pipeline = pipeline("text-classification", model="lvwerra/distilbert-imdb", device=0)
      dataset = load_dataset("imdb", split="test").shuffle().select(range(1000))
      metric = evaluate.load("accuracy")

Now we pass everything to evaluator().

	
		from transformers import pipeline
from datasets import load_dataset
from evaluate import evaluator
import evaluate
model_pipeline = pipeline("text-classification", model="lvwerra/distilbert-imdb", device=0)
dataset = load_dataset("imdb", split="test").shuffle().select(range(1000))
metric = evaluate.load("accuracy")
task_evaluator = evaluator("text-classification")
results = task_evaluator.compute(model_or_pipeline=model_pipeline, data=dataset, metric=metric,
                       label_mapping={"NEGATIVE": 0, "POSITIVE": 1},)
results

	
		{'accuracy': 0.933,
 'total_time_in_seconds': 29.43192940400013,
 'samples_per_second': 33.97670557962431,
 'latency_in_seconds': 0.02943192940400013}

Thanks to the evaluator we were able to obtain the model metrics without having to make the inference ourselves.

Display

Sometimes we get different metrics for different models, which makes it difficult to compare them, so graphs make it easier.

The Evaluate library offers different visualizations through the visualization() method. We have to pass the data to it as a list of dictionaries, where each dictionary has to have the same keys

In order to use this function it is necessary to have the matplotlib library installed.

pip install matplotlib

import evaluate
      from evaluate.visualization import radar_plot
      
      data = [
         {"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6},
         {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2},
         {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6}, 
         {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6}
         ]
      
      model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]
      
      plot = radar_plot(data=data, model_names=model_names)
      plot.show()

/tmp/ipykernel_10271/263559674.py:14: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
        plot.show()

We can now visually compare the 4 models and choose the optimal one, based on one or several metrics

Evaluating the model on a set of tasks

We can evaluate a model, for example, for different datasets. For this we can use the evaluation_suite method. For example we are going to create an evaluator that evaluates a model on the imdb and sst2 datasets. We are going to see these datasets, for that we use the load_dataset_builder method so we don't have to download the complete dataset.

	
		import evaluate
from evaluate.visualization import radar_plot
data = [
   {"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6},
   {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2},
   {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6}, 
   {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6}
   ]
model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]
plot = radar_plot(data=data, model_names=model_names)
plot.show()
from datasets import load_dataset_builder
imdb = load_dataset_builder("imdb")
imdb.info.features

	
		/tmp/ipykernel_10271/263559674.py:14: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plot.show()
{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

	
		from datasets import load_dataset_builder
sst2 = load_dataset_builder("sst2")
sst2.info.features

	
		{'idx': Value(dtype='int32', id=None),
 'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None)}

As we can see, with the imdb dataset we need to take the text column to get the text and the label column to get the target. With the sst2 dataset we need to take the sentence column to get the text and the label column to get the target.

We create the evaluator for the two datasets

	
		import evaluate
      from evaluate.evaluation_suite import SubTask
      
      class Suite(evaluate.EvaluationSuite):
      
          def __init__(self, name):
              super().__init__(name)
      
              self.suite = [
                  SubTask(
                      task_type="text-classification",
                      data="imdb",
                      split="test[:1]",
                      args_for_task={
                          "metric": "accuracy",
                          "input_column": "text",
                          "label_column": "label",
                          "label_mapping": {
                              "LABEL_0": 0.0,
                              "LABEL_1": 1.0
                          }
                      }
                  ),
                  SubTask(
                      task_type="text-classification",
                      data="sst2",
                      split="test[:1]",
                      args_for_task={
                          "metric": "accuracy",
                          "input_column": "sentence",
                          "label_column": "label",
                          "label_mapping": {
                              "LABEL_0": 0.0,
                              "LABEL_1": 1.0
                          }
                      }
                  )
              ]

It can be seen in split="test[:1]", that we only take one example of the subset of test for this notebook and that the execution does not take too long

Ahora evaluamos con el modelo huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli

from evaluate import EvaluationSuite
      suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite')
      results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")
      results

`data` is a preloaded Dataset! Ignoring `subset` and `split`.
      `data` is a preloaded Dataset! Ignoring `subset` and `split`.

Out[4]:

[{'accuracy': 0.3,
        'total_time_in_seconds': 1.4153412349987775,
        'samples_per_second': 7.06543394110088,
        'latency_in_seconds': 0.14153412349987776,
        'task_name': 'imdb',
        'data_preprocessor': '<function Suite.__init__.<locals>.<lambda> at 0x7f3ff27a5080>'},
       {'accuracy': 0.0,
        'total_time_in_seconds': 0.1323430729971733,
        'samples_per_second': 75.56118936586572,
        'latency_in_seconds': 0.013234307299717328,
        'task_name': 'sst2',
        'data_preprocessor': '<function Suite.__init__.<locals>.<lambda> at 0x7f3f2a9cc720>'}]

Continue reading

Agents patterns

Are your agents falling short? Elevate your AI projects with advanced patterns: ReAct, planning, multi-agents, and more. Practical guide with code!

LangGraph: Revolutionize your AI agents

🚀 Revolutionize your AI agents! 🧠 LangGraph is not just another library, it's the orchestration framework that gives you total control to build complex agents, with long-term memory and even human intervention! Say goodbye to basic chatbots, it's time to create true intelligence. Dive into this post and discover it!

Create virtual environments with uv

Learn how to create virtual environments with uv, a package manager and environment for Python written in Rust, which makes it very fast. If you have had problems with the waiting times using conda, or want a faster and easier alternative to venv, enter and see how to use uv.

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.