Evaluations

BLEU

sentence_bleu

texar.torch.evals.sentence_bleu(references: List[Union[str, List[str]]], hypothesis: Union[str, List[str]], max_order: int = 4, lowercase: bool = False, smooth: bool = False, use_bp: bool = True, return_all: bool = False) → Union[float, List[float]][source]

Calculates BLEU score of a hypothesis sentence.

Parameters:
  • references – A list of reference for the hypothesis. Each reference can be either a list of string tokens, or a string containing tokenized tokens separated with whitespaces. List can also be numpy array.
  • hypothesis – A hypothesis sentence. Each hypothesis can be either a list of string tokens, or a string containing tokenized tokens separated with whitespaces. List can also be numpy array.
  • lowercase (bool) – If True, lowercase reference and hypothesis tokens.
  • max_order (int) – Maximum n-gram order to use when computing BLEU score.
  • smooth (bool) – Whether or not to apply (Lin et al. 2004) smoothing.
  • use_bp (bool) – Whether to apply brevity penalty.
  • return_all (bool) – If True, returns BLEU and all n-gram precisions.
Returns:

If return_all is False (default), returns a float32 BLEU score.

If return_all is True, returns a list of float32 [BLEU] + n-gram precisions, which is of length max_order +1.

corpus_bleu

texar.torch.evals.corpus_bleu(list_of_references: List[List[Union[str, List[str]]]], hypotheses: List[Union[str, List[str]]], max_order: int = 4, lowercase: bool = False, smooth: bool = False, use_bp: bool = True, return_all: bool = False) → Union[float, List[float]][source]

Computes corpus-level BLEU score.

Parameters:
  • list_of_references – A list of lists of references for each hypothesis. Each reference can be either a list of string tokens, or a string containing tokenized tokens separated with whitespaces. List can also be numpy array.
  • hypotheses – A list of hypothesis sentences. Each hypothesis can be either a list of string tokens, or a string containing tokenized tokens separated with whitespaces. List can also be numpy array.
  • lowercase (bool) – If True, lowercase reference and hypothesis tokens.
  • max_order (int) – Maximum n-gram order to use when computing BLEU score.
  • smooth (bool) – Whether or not to apply (Lin et al. 2004) smoothing.
  • use_bp (bool) – Whether to apply brevity penalty.
  • return_all (bool) – If True, returns BLEU and all n-gram precisions.
Returns:

If return_all is False (default), returns a float32 BLEU score.

If return_all is True, returns a list of float32 scores: [BLEU] + n-gram precisions, which is of length max_order +1.

sentence_bleu_moses

texar.torch.evals.sentence_bleu_moses(references: List[Union[str, List[str]]], hypothesis: Union[str, List[str]], lowercase: bool = False, return_all: bool = False) → Union[float, List[float]][source]

Calculates BLEU score of a hypothesis sentence using the MOSES `multi-bleu.perl` script.

Parameters:
  • references – A list of reference for the hypothesis. Each reference can be either a string, or a list of string tokens. List can also be numpy array.
  • hypothesis – A hypothesis sentence. The hypothesis can be either a string, or a list of string tokens. List can also be numpy array.
  • lowercase (bool) – If True, pass the "-lc" flag to the multi-bleu script.
  • return_all (bool) – If True, returns BLEU and all n-gram precisions.
Returns:

If return_all is False (default), returns a float32 BLEU score.

If return_all is True, returns a list of 5 float32 scores: [BLEU, 1-gram precision, ..., 4-gram precision].

corpus_bleu_moses

texar.torch.evals.corpus_bleu_moses(list_of_references: List[List[Union[str, List[str]]]], hypotheses: List[Union[str, List[str]]], lowercase: bool = False, return_all: bool = False) → Union[float, List[float]][source]

Calculates corpus-level BLEU score using the MOSES `multi-bleu.perl` script.

Parameters:
  • list_of_references – A list of lists of references for each hypothesis. Each reference can be either a string, or a list of string tokens. List can also be numpy array.
  • hypotheses – A list of hypothesis sentences. Each hypothesis can be either a string, or a list of string tokens. List can also be numpy array.
  • lowercase (bool) – If True, pass the "-lc" flag to the multi-bleu script.
  • return_all (bool) – If True, returns BLEU and all n-gram precisions.
Returns:

If return_all is False (default), returns a float32 BLEU score.

If return_all is True, returns a list of 5 float32 scores: [BLEU, 1-gram precision, ..., 4-gram precision].

corpus_bleu_transformer

texar.torch.evals.corpus_bleu_transformer(reference_corpus: List[List[str]], translation_corpus: List[List[str]], max_order: int = 4, use_bp: bool = True) → float[source]

Computes BLEU score of translated segments against references.

This BLEU has been used in evaluating Transformer (Vaswani et al.) “Attention is all you need” for machine translation. The resulting BLEU score are usually a bit higher than that in texar.torch.evals.corpus_bleu and texar.torch.evals.corpus_bleu_moses.

Parameters:
  • reference_corpus – list of references for each translation. Each reference should be tokenized into a list of tokens.
  • translation_corpus – list of translations to score. Each translation should be tokenized into a list of tokens.
  • max_order – Maximum n-gram order to use when computing BLEU score.
  • use_bp – boolean, whether to apply brevity penalty.
Returns:

BLEU score.

bleu_transformer_tokenize

texar.torch.evals.bleu_transformer_tokenize(string: str) → List[str][source]

Tokenize a string following the official BLEU implementation.

The BLEU scores from multi-bleu.perl depend on your tokenizer, which is unlikely to be reproducible from your experiment or consistent across different users. This function provides a standard tokenization following mteval-v14.pl.

See https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v14.pl#L954-L983. In our case, the input string is expected to be just one line and no HTML entities de-escaping is needed. So we just tokenize on punctuation and symbols, except when a punctuation is preceded and followed by a digit (e.g. a comma/dot as a thousand/decimal separator).

Note that a number (e.g. a year) followed by a dot at the end of sentence is NOT tokenized, i.e. the dot stays with the number because s/(p{P})(P{N})/ $1 $2/g does not match this case (unless we add a space after each sentence). However, this error is already in the original mteval-v14.pl and we want to be consistent with it.

Parameters:string – the input string
Returns:a list of tokens

file_bleu

texar.torch.evals.file_bleu(ref_filename: str, hyp_filename: str, bleu_version: str = 'corpus_bleu_transformer', case_sensitive: bool = False) → float[source]

Compute BLEU for two files (reference and hypothesis translation).

Parameters:
  • ref_filename – Reference file path.
  • hyp_filename – Hypothesis file path.
  • bleu_version – A str with the name of a BLEU computing method selected in the list of: corpus_bleu, corpus_bleu_moses, corpus_bleu_transformer.
  • case_sensitive – If False, lowercase reference and hypothesis tokens.
Returns:

BLEU score.

Accuracy

accuracy

texar.torch.evals.accuracy(labels: torch.Tensor, preds: torch.Tensor) → torch.Tensor[source]

Calculates the accuracy of predictions.

Parameters:
  • labels – The ground truth values. A Tensor of the same shape of preds.
  • preds – A Tensor of any shape containing the predicted values.
Returns:

A float scalar Tensor containing the accuracy.

binary_clas_accurac

texar.torch.evals.binary_clas_accuracy(pos_preds: Optional[torch.Tensor] = None, neg_preds: Optional[torch.Tensor] = None) → Optional[torch.Tensor][source]

Calculates the accuracy of binary predictions.

Parameters:
  • pos_preds (optional) – A Tensor of any shape containing the predicted values on positive data (i.e., ground truth labels are 1).
  • neg_preds (optional) – A Tensor of any shape containing the predicted values on negative data (i.e., ground truth labels are 0).
Returns:

A float scalar Tensor containing the accuracy.