# bleuEvaluationScore

Evaluate translation or summarization with BLEU similarity score

## Syntax

``score = bleuEvaluationScore(candidate,references)``
``score = bleuEvaluationScore(candidate,references,'NgramWeights',ngramWeights)``

## Description

The BiLingual Evaluation Understudy (BLEU) scoring algorithm evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.

example

````score = bleuEvaluationScore(candidate,references)` returns the BLEU similarity score between the specified candidate document and the reference documents. The function computes n-gram overlaps between `candidate` and `references` for n-gram lengths one through four, with equal weighting. For more information, see BLEU Score.```

example

````score = bleuEvaluationScore(candidate,references,'NgramWeights',ngramWeights)` uses the specified n-gram weighting, where `ngramWeights(i)` corresponds to the weight for n-grams of length `i`. The length of the weight vector determines the range of n-gram lengths to use for the BLEU score evaluation.```

## Examples

collapse all

Create an array of tokenized documents and extract a summary using the `extractSummary` function.

```str = [ "The fox jumped over the dog." "The fast brown fox jumped over the lazy dog." "The lazy dog saw a fox jumping." "There seem to be animals jumping other animals." "There are quick animals and lazy animals"]; documents = tokenizedDocument(str); summary = extractSummary(documents)```
```summary = tokenizedDocument: 10 tokens: The fast brown fox jumped over the lazy dog . ```

Specify the reference documents as a `tokenizedDocument` array.

```str = [ "The quick brown animal jumped over the lazy dog." "The quick brown fox jumped over the lazy dog."]; references = tokenizedDocument(str);```

Calculate the BLEU score between the summary and the reference documents using the `bleuEvaluationScore` function.

`score = bleuEvaluationScore(summary,references)`
```score = 0.7825 ```

This score indicates a fairly good similarity. A BLEU score close to one indicates strong similarity.

Create an array of tokenized documents and extract a summary using the `extractSummary` function.

```str = [ "The fox jumped over the dog." "The fast brown fox jumped over the lazy dog." "The lazy dog saw a fox jumping." "There seem to be animals jumping other animals." "There are quick animals and lazy animals"]; documents = tokenizedDocument(str); summary = extractSummary(documents)```
```summary = tokenizedDocument: 10 tokens: The fast brown fox jumped over the lazy dog . ```

Specify the reference documents as a `tokenizedDocument` array.

```str = [ "The quick brown animal jumped over the lazy dog." "The quick brown fox jumped over the lazy dog."]; references = tokenizedDocument(str);```

Calculate the BLEU score between the candidate document and the reference documents using the default options. The `bleuEvaluationScore` function, by default, uses n-grams of length one through four with equal weights.

`score = bleuEvaluationScore(summary,references)`
```score = 0.7825 ```

Given that the summary document differs only by one word to one of the reference documents, this score might suggest a lower similarity than might be expected. This behavior is due to the function using n-grams which are too large for the short document length.

To address this, use shorter n-grams by setting the `'NgramWeights'` option to a shorter vector. Calculate the BLEU score again using only unigrams and bigrams by setting the `'NgramWeights'` option to a two-element vector. Treat unigrams and bigrams equally by specifying equal weights.

`score = bleuEvaluationScore(summary,references,'NgramWeights',[0.5 0.5])`
```score = 0.8367 ```

This score suggests a better similarity than before.

## Input Arguments

collapse all

Candidate document, specified as a `tokenizedDocument` scalar, a string array, or a cell array of character vectors. If `candidate` is not a `tokenizedDocument` scalar, then it must be a row vector representing a single document, where each element is a word.

Reference documents, specified as a `tokenizedDocument` array, a string array, or a cell array of character vectors. If `references` is not a `tokenizedDocument` array, then it must be a row vector representing a single document, where each element is a word. To evaluate against multiple reference documents, use a `tokenizedDocument` array.

N-gram weights, specified as a row vector of finite nonnegative values, where `ngramWeights(i)` corresponds to the weight for n-grams of length `i`. The length of the weight vector determines the range of n-gram lengths to use for the BLEU score evaluation. The function normalizes the n-gram weights to sum to one.

Tip

If the number of words in `candidate` is smaller than the number of elements in `ngramWeights`, then the resulting BLEU score is zero. To ensure that `bleuEvaluationScore` returns nonzero scores for very short documents, set `ngramWeights` to a vector with fewer elements than the number of words in `candidate`.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`

## Output Arguments

collapse all

BLEU score, returned as a scalar value in the range [0,1] or `NaN`.

A BLEU score close to zero indicates poor similarity between `candidate` and `references`. A BLEU score close to one indicates strong similarity. If `candidate` is identical to one of the reference documents, then `score` is 1. If `candidate` and `references` are both empty documents, then `score` is `NaN`. For more information, see BLEU Score.

Tip

If the number of words in `candidate` is smaller than the number of elements in `ngramWeights`, then the resulting BLEU score is zero. To ensure that `bleuEvaluationScore` returns nonzero scores for very short documents, set `ngramWeights` to a vector with fewer elements than the number of words in `candidate`.

## Algorithms

collapse all

### BLEU Score

The BiLingual Evaluation Understudy (BLEU) scoring algorithm [1] evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.

To compute the BLEU score, the algorithm uses n-gram counts, clipped n-gram counts, modified n-gram precision scores, and a brevity penalty.

The clipped n-gram counts function ${\text{Count}}_{\text{clip}}$, if necessary, truncates the n-gram count for each n-gram so that it does not exceed the largest count observed in any single reference for that n-gram. The clipped counts function is given by

`${\text{Count}}_{\text{clip}}\left(\text{n-gram}\right)=\text{min}\left(\text{Count}\left(\text{n-gram}\right),\text{MaxRefCount}\left(\text{n-gram}\right)\right),$`

where $\text{Count}\left(\text{n-gram}\right)$ denotes the n-gram counts and $\text{MaxRefCount}\left(\text{n-gram}\right)$ is the largest n-gram count observed in a single reference document for that n-gram.

The modified n-gram precision scores are given by

`${p}_{n}=\frac{\sum _{C\in \left\{\text{Candidates}\right\}}\sum _{\text{n-gram}\in C}{\text{Count}}_{\text{clip}}\left(\text{n-gram}\right)}{\sum _{C\text{'}\in \left\{\text{Candidates}\right\}}\sum _{{\text{n-gram}}^{\prime }\in {C}^{\prime }}\text{Count}\left({\text{n-gram}}^{\prime }\right)},$`

where n corresponds to the n-gram length and $\left\{\text{candidates}\right\}$ is the set of sentences in the candidate documents.

Given a vector of n-gram weights w, the BLEU score is given by

`$\text{bleuScore}=\text{BP}·\mathrm{exp}\left(\sum _{n=1}^{N}{w}_{n}\mathrm{log}{\overline{p}}_{n}\right),$`

where N is the largest n-gram length, the entries in $\overline{p}$ correspond to the geometric averages of the modified n-gram precisions, and $\text{BP}$ is the brevity penalty given by

where c is the length of the candidate document and r is the length of the reference document with length closest to the candidate length.

## References

[1] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "BLEU: A Method for Automatic Evaluation of Machine Translation." In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318. Association for Computational Linguistics, 2002.