Hugging Face evaluate
The Evaluate
library of Hugging Face
is a library to easily evaluate models and datasets.
With a single line of code, you have access to dozens of evaluation methods for different domains (NLP, computer vision, reinforcement learning and more). Whether on your local machine, or in a distributed training setup, you can evaluate models in a consistent and reproducible manner.
A complete list of available metrics can be obtained from the evaluate page in Hugging Face. Each metric has a dedicated Hugging Face Space
with an interactive demonstration on how to use the metric and a documentation card detailing the limitations and use of the metrics.
This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.
Installation
To install the library it is necessary to do the following
pip install evaluate
Type of evaluations
Several types of evaluations are available
- Metric: A metric is used to evaluate the performance of a model and usually includes model predictions and ground truth labels.
comparison
: Used to compare two models. This can be done, for example, by comparing their predictions with ground truth labels.- Measurement: The dataset is as important as the model trained on it. With measurements the properties of a dataset can be investigated.
Load
Each metric
, comparison
or measurement
can be loaded with the method load
import evaluateaccuracy = evaluate.load("accuracy")accuracy
EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """Args:predictions (`list` of `int`): Predicted labels.references (`list` of `int`): Ground truth labels.normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.sample_weight (`list` of `float`): Sample weights Defaults to None.Returns:accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.Examples:Example 1-A simple example>>> accuracy_metric = evaluate.load("accuracy")>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])>>> print(results){'accuracy': 0.5}Example 2-The same as Example 1, except with `normalize` set to `False`.>>> accuracy_metric = evaluate.load("accuracy")>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)>>> print(results){'accuracy': 3.0}Example 3-The same as Example 1, except with `sample_weight` set.>>> accuracy_metric = evaluate.load("accuracy")>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])>>> print(results){'accuracy': 0.8778625954198473}""", stored examples: 0)
If you want to be sure to load the type of metric you want, whether metric
, comparison
or measurement
type, you can do it by adding the module_type
parameter
import evaluateaccuracy = evaluate.load("accuracy", module_type="metric")word_length = evaluate.load("word_length", module_type="measurement")
[nltk_data] Downloading package punkt to[nltk_data] /home/maximo.fernandez/nltk_data...[nltk_data] Package punkt is already up-to-date!
Community module loading
In addition to the modules offered by the library, you can also upload models uploaded by someone else to the Hugging Face hub.
element_count = evaluate.load("lvwerra/element_count", module_type="measurement")
List of available modules
If we want to get a list of all the available modules we have to use the list_evaluation_modules
method, in which we can put search filters
element_count = evaluate.load("lvwerra/element_count", module_type="measurement")evaluate.list_evaluation_modules(module_type="comparison",include_community=True,with_details=True)
[{'name': 'ncoop57/levenshtein_distance','type': 'comparison','community': True,'likes': 0},{'name': 'kaleidophon/almost_stochastic_order','type': 'comparison','community': True,'likes': 1}]
Module attributes
All evaluation modules come with a variety of useful attributes that help you use the module.
Attribute | Description |
---|---|
description | A brief description of the evaluation module. |
citation | A BibTex string to cite when available. |
features | A Features object that defines the input format. |
inputs_description | This is equivalent to the module documentation string. |
homepage | The home page of the module. |
license | The module license. |
codebase_urls | Link to the code behind the module. |
reference_urls | additional reference URLs. |
Let's take a look at some of them
accuracy = evaluate.load("accuracy")
accuracy = evaluate.load("accuracy")print(f"description: {accuracy.description}")print(f"\ncitation: {accuracy.citation}")print(f"\nfeatures: {accuracy.features}")print(f"\ninputs_description: {accuracy.inputs_description}")print(f"\nhomepage: {accuracy.homepage}")print(f"\nlicense: {accuracy.license}")print(f"\ncodebase_urls: {accuracy.codebase_urls}")print(f"\nreference_urls: {accuracy.reference_urls}")
description:Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:Accuracy = (TP + TN) / (TP + TN + FP + FN)Where:TP: True positiveTN: True negativeFP: False positiveFN: False negativecitation:@article{scikit-learn,title={Scikit-learn: Machine Learning in {P}ython},author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. andCournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},journal={Journal of Machine Learning Research},volume={12},pages={2825--2830},year={2011}}features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}inputs_description:Args:predictions (`list` of `int`): Predicted labels.references (`list` of `int`): Ground truth labels.normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.sample_weight (`list` of `float`): Sample weights Defaults to None.Returns:accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.Examples:Example 1-A simple example>>> accuracy_metric = evaluate.load("accuracy")>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])>>> print(results){'accuracy': 0.5}Example 2-The same as Example 1, except with `normalize` set to `False`.>>> accuracy_metric = evaluate.load("accuracy")>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)>>> print(results){'accuracy': 3.0}Example 3-The same as Example 1, except with `sample_weight` set.>>> accuracy_metric = evaluate.load("accuracy")>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])>>> print(results){'accuracy': 0.8778625954198473}homepage:license:codebase_urls: []reference_urls: ['https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html']
Execution
Now that we know how the evaluation module works and what it should contain, let's use it. When it comes to calculating the evaluation, there are two main ways to do it:
- All in one
- Incremental
In the incremental approach, the required entries are added to the module with EvaluationModule.add()
or EvaluationModule.add_batch()
and the score is computed at the end with EvaluationModule.compute()
. Alternatively, all entries can be passed at once to compute()
.
Let's look at these two approaches.
All in one
Once we have all the predictions and ground truth we can calculate the metric. Once we have a module defined, we pass it the predictions and ground truth using the compute()
method.
accuracy = evaluate.load("accuracy")
accuracy = evaluate.load("accuracy")predictions = [1, 0, 0, 1]targets = [0, 1, 0, 1]accuracy_value = accuracy.compute(predictions=predictions, references=targets)accuracy_value
{'accuracy': 0.5}
Incremental
In many evaluation processes, predictions are built iteratively, as in a for loop. In that case, you could store the predictions and ground truth in a list and at the end pass them to compute()
.
However with the add()
and add_batch()
methods you can avoid the step of storing the predictions.
If you have all the predictions of a single batch you must use the add()
method.
for ref, pred in zip([0,1,0,1], [1,0,0,1]):accuracy.add(references=ref, predictions=pred)accuracy_value = accuracy.compute()accuracy_value
{'accuracy': 0.5}
However, when you have predictions of several batches you have to use the add_batch()
method.
for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]):accuracy.add_batch(references=refs, predictions=preds)accuracy_value = accuracy.compute()accuracy_value
{'accuracy': 0.5}
Combination of several evaluations
Often, one wants to evaluate not only a single metric, but also a variety of different metrics that capture different aspects of a model. For example, for classification it is often a good idea to calculate F1
, recall
and accuracy
in addition to accuracy
to get a better picture of model performance. Evaluate
allows one to load a bunch of metrics and call them sequentially. However, the most convenient way is to use the combine()
function to group them.
clasification_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
clasification_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])predictions=[0, 1, 0]targets=[0, 1, 1]clasification_metrics.compute(predictions=predictions, references=targets)
{'accuracy': 0.6666666666666666,'f1': 0.6666666666666666,'precision': 1.0,'recall': 0.5}
Save results
We can save the evaluation results in a file with the save()
method by passing a file name. We can pass parameters such as the experiment number
references=[0,1,0,1]targets=[1,0,0,1]result = accuracy.compute(references=references, predictions=targets)hyperparams = {"model": "bert-base-uncased"}evaluate.save("./results/", experiment="run 42", **result, **hyperparams)
PosixPath('results/result-2024_04_25-17_45_41.json')
As we can see we have had to create a hyperparams
variable to pass it to the save()
method. This normally will not be necessary because we will already have those of the model that we are training.
This will create a json
with all the information
import pathlibpath = pathlib.Path("./results/")files = list(path.glob("*"))files
[PosixPath('results/result-2024_04_25-17_45_41.json')]
import jsonresult_file = files[0]result_json = pathlib.Path(result_file).read_text()result_dict = json.loads(result_json)result_dict
{'experiment': 'run 42','accuracy': 0.5,'model': 'bert-base-uncased','_timestamp': '2024-04-25T17:45:41.218287','_git_commit_hash': '8725338b6bf9c97274685df41b2ee6e11319a735','_evaluate_version': '0.4.1','_python_version': '3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0]','_interpreter_path': '/home/maximo.fernandez/miniconda3/envs/nlp/bin/python'}
Upload results to the hub
In case we are training a model, we can upload to the model card of the model the results of the evaluation with the push_to_hub()
method. In this way they will appear in the model page.
Evaluator
If we have a model, a dataset and a metric, we can do inference for the whole dataset and pass the predictions and the actual labels to the evaluator to return the metric and thus obtain the model metrics.
Or we can give everything to the library and let it do the work for us. Using the evaluator()
method, we pass it the model, the dataset and the metric and the method does everything for us.
First we define the model, the dataset and the metric
from transformers import pipelinefrom datasets import load_datasetfrom evaluate import evaluatorimport evaluatemodel_pipeline = pipeline("text-classification", model="lvwerra/distilbert-imdb", device=0)dataset = load_dataset("imdb", split="test").shuffle().select(range(1000))metric = evaluate.load("accuracy")
Now we pass everything to evaluator()
.
from transformers import pipelinefrom datasets import load_datasetfrom evaluate import evaluatorimport evaluatemodel_pipeline = pipeline("text-classification", model="lvwerra/distilbert-imdb", device=0)dataset = load_dataset("imdb", split="test").shuffle().select(range(1000))metric = evaluate.load("accuracy")task_evaluator = evaluator("text-classification")results = task_evaluator.compute(model_or_pipeline=model_pipeline, data=dataset, metric=metric,label_mapping={"NEGATIVE": 0, "POSITIVE": 1},)results
{'accuracy': 0.933,'total_time_in_seconds': 29.43192940400013,'samples_per_second': 33.97670557962431,'latency_in_seconds': 0.02943192940400013}
Thanks to the evaluator we were able to obtain the model metrics without having to make the inference ourselves.
Display
Sometimes we get different metrics for different models, which makes it difficult to compare them, so graphs make it easier.
The Evaluate
library offers different visualizations through the visualization()
method. We have to pass the data to it as a list of dictionaries, where each dictionary has to have the same keys
In order to use this function it is necessary to have the matplotlib
library installed.
pip install matplotlib
import evaluate
from evaluate.visualization import radar_plot
data = [
{"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6},
{"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2},
{"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6},
{"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6}
]
model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]
plot = radar_plot(data=data, model_names=model_names)
plot.show()
We can now visually compare the 4 models and choose the optimal one, based on one or several metrics
Evaluating the model on a set of tasks
We can evaluate a model, for example, for different datasets. For this we can use the evaluation_suite
method. For example we are going to create an evaluator that evaluates a model on the imdb
and sst2
datasets. We are going to see these datasets, for that we use the load_dataset_builder
method so we don't have to download the complete dataset.
import evaluatefrom evaluate.visualization import radar_plotdata = [{"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6},{"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2},{"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6},{"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6}]model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]plot = radar_plot(data=data, model_names=model_names)plot.show()from datasets import load_dataset_builderimdb = load_dataset_builder("imdb")imdb.info.features
/tmp/ipykernel_10271/263559674.py:14: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shownplot.show(){'text': Value(dtype='string', id=None),'label': ClassLabel(names=['neg', 'pos'], id=None)}
from datasets import load_dataset_buildersst2 = load_dataset_builder("sst2")sst2.info.features
{'idx': Value(dtype='int32', id=None),'sentence': Value(dtype='string', id=None),'label': ClassLabel(names=['negative', 'positive'], id=None)}
As we can see, with the imdb
dataset we need to take the text
column to get the text and the label
column to get the target. With the sst2
dataset we need to take the sentence
column to get the text and the label
column to get the target.
We create the evaluator for the two datasets
import evaluatefrom evaluate.evaluation_suite import SubTaskclass Suite(evaluate.EvaluationSuite):def __init__(self, name):super().__init__(name)self.suite = [SubTask(task_type="text-classification",data="imdb",split="test[:1]",args_for_task={"metric": "accuracy","input_column": "text","label_column": "label","label_mapping": {"LABEL_0": 0.0,"LABEL_1": 1.0}}),SubTask(task_type="text-classification",data="sst2",split="test[:1]",args_for_task={"metric": "accuracy","input_column": "sentence","label_column": "label","label_mapping": {"LABEL_0": 0.0,"LABEL_1": 1.0}})]
It can be seen in split="test[:1]",
that we only take one example of the subset of test for this notebook and that the execution does not take too long
Ahora evaluamos con el modelo huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli
from evaluate import EvaluationSuite
suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite')
results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")
results