mbrs.metrics package#
Submodules#
Module contents#
- class mbrs.metrics.Metric(cfg: Config)[source]#
Bases:
MetricBaseBase metric class.
- corpus_score(hypotheses: list[str], references_lists: list[list[str]], sources: list[str] | None = None) float[source]#
Calculate the corpus-level score.
- expected_scores(hypotheses: list[str], references: list[str], source: str | None = None, reference_lprobs: Tensor | None = None) Tensor[source]#
Calculate the expected scores for each hypothesis.
- Parameters:
- Returns:
The expected scores for each hypothesis.
- Return type:
Tensor
- pairwise_scores(hypotheses: list[str], references: list[str], source: str | None = None) Tensor[source]#
Calculate the pairwise scores.
- abstract score(hypothesis: str, reference: str, source: str | None = None) float[source]#
Calculate the score of the given hypothesis.
- class mbrs.metrics.MetricAggregatable(cfg: Config)[source]#
Bases:
MetricBase class for aggregatable metrics.
This class supports reference aggregation.
- abstract expected_scores_reference_aggregation(hypotheses: list[str], references: list[str], source: str | None = None, reference_lprobs: Tensor | None = None) Tensor[source]#
Calculate the expected scores for each hypothesis.
- Parameters:
- Returns:
The expected scores for each hypothesis.
- Return type:
Tensor
- class mbrs.metrics.MetricAggregatableCache(cfg: Config)[source]#
Bases:
MetricAggregatable,MetricCacheableBase class for metrics that can aggregate the cache.
This class supports to aggregate intermediate representations of sentences.
- class Cache[source]#
Bases:
CacheIntermediate representations of sentences.
- abstract aggregate(reference_lprobs: Tensor | None = None) Cache[source]#
Aggregate the cached representations.
- Parameters:
reference_lprobs (Tensor, optional) – Log-probabilities for each reference sample. The shape must be (len(references),). See https://arxiv.org/abs/2311.05263.
- Returns:
An aggregated representation.
- Return type:
- class mbrs.metrics.MetricBERTScore(cfg: Config)[source]#
Bases:
MetricCacheableBERTScore metric class.
- class Cache(embeddings: list[Tensor], idf_weights: list[Tensor])[source]#
Bases:
CacheIntermediate representations of sentences.
- embeddings (list[Tensor]): A list of token embeddings of shape (T, D),
where T is the length of sequence, and D is a size of the embedding.
idf_weights (list[Tensor]): A list of IDF weights of shape (T,).
- class Config(score_type: BERTScoreScoreType = BERTScoreScoreType.f1, model_type: str | None = None, num_layers: int | None = None, batch_size: int = 64, nthreads: int = 4, idf: bool = False, idf_sents: list[str] | None = None, lang: str | None = None, rescale_with_baseline: bool = False, baseline_path: str | None = None, use_fast_tokenizer: bool = False, fp16: bool = False, bf16: bool = False, cpu: bool = False)[source]#
Bases:
ConfigBERTScore metric configuration.
- score_type (BERTScoreScoreType): The output score type, i.e.,
precision, recall, or f1.
- model_type (str): Contexual embedding model specification, default using the
suggested model for the target langauge; has to specify at least one of model_type or lang.
- num_layers (int): The layer of representation to use. Default using the number
of layer tuned on WMT16 correlation data.
- idf (bool): A booling to specify whether to use idf or not. (This should be
True even if idf_sents is given.)
idf_sents (list[str]): List of sentences used to compute the idf weights.
batch_size (int): Bert score processing batch size
nthreads (int): Number of threads.
- lang (str): Language of the sentences; has to specify at least one of
model_type or lang. lang needs to be specified when rescale_with_baseline is True.
rescale_with_baseline (bool): Rescale bertscore with pre-computed baseline.
baseline_path (str): Customized baseline file.
use_fast_tokenizer (bool): use_fast parameter passed to HF tokenizer.
fp16 (bool): Use float16 for the forward computation.
bf16 (bool): Use bfloat16 for the forward computation.
cpu (bool): Use CPU for the forward computation.
- score_type: BERTScoreScoreType = 2#
- corpus_score(hypotheses: list[str], references_lists: list[list[str]], sources: list[str] | None = None) float[source]#
Calculate the corpus-level score.
- property device: device#
Returns the device of the model.
- encode(sentences: list[str]) Cache[source]#
Encode the given sentences into their intermediate representations.
- out_proj(hypotheses_ir: Cache, references_ir: Cache, sources_ir: Cache | None = None) Tensor[source]#
Forward the output projection layer.
- class mbrs.metrics.MetricBLEU(cfg: Config)[source]#
Bases:
MetricAggregatableBLEU metric class.
- class AggregatedReference(ngrams: Counter[tuple[str, ...]], length: float)[source]#
Bases:
objectAggregated reference representation.
ngrams (Counter[tuple[str, …]]): Bags of expected n-gram counts.
length (float): Expected length of references.
- class Config(lowercase: bool = False, force: bool = False, tokenize: str | None = None, smooth_method: str = 'exp', smooth_value: float | None = None, max_ngram_order: int = 4, effective_order: bool = True, trg_lang: str = '', num_workers: int = 8)[source]#
Bases:
ConfigBLEU metric configuration.
lowercase (bool): If True, lowercased BLEU is computed.
force (bool): Ignore data that looks already tokenized.
tokenize (str, optional): The tokenizer to use. If None, defaults to language-specific tokenizers with ‘13a’ as the fallback default.
smooth_method (str): The smoothing method to use (‘floor’, ‘add-k’, ‘exp’ or ‘none’).
smooth_value (float, optional): The smoothing value for floor and add-k methods. None falls back to default value.
max_ngram_order (int): If given, it overrides the maximum n-gram order (default: 4) when computing precisions.
effective_order (bool): If True, stop including n-gram orders for which precision is 0. This should be True, if sentence-level BLEU will be computed. (default: True)
trg_lang (str): An optional language code to raise potential tokenizer warnings.
num_workers (int): Number of workers for multiprocessing.
- corpus_score(hypotheses: list[str], references_lists: list[list[str]], sources: list[str] | None = None) float[source]#
Calculate the corpus-level score.
- expected_scores_reference_aggregation(hypotheses: list[str], references: list[str], source: str | None = None, reference_lprobs: Tensor | None = None) Tensor[source]#
Calculate the expected scores for each hypothesis.
- Parameters:
- Returns:
The expected scores for each hypothesis.
- Return type:
Tensor
- pairwise_scores(hypotheses: list[str], references: list[str], *_, **__) Tensor[source]#
Calculate the pairwise scores.
- class mbrs.metrics.MetricBLEURT(cfg: Config)[source]#
Bases:
MetricBLEURT metric class.
We employ the PyTorch port version to implement this metric instead of the original version: lucadiliello/bleurt-pytorch (thanks to @lucadiliello)
Available checkpoints:
lucadiliello/BLEURT-20
lucadiliello/BLEURT-20-D12
lucadiliello/BLEURT-20-D3
lucadiliello/BLEURT-20-D6
lucadiliello/bleurt-base-128
lucadiliello/bleurt-base-512
lucadiliello/bleurt-large-128
lucadiliello/bleurt-large-512
lucadiliello/bleurt-tiny-128
lucadiliello/bleurt-tiny-512
- class Config(model: str = 'lucadiliello/BLEURT-20-D12', batch_size: int = 64, fp16: bool = False, bf16: bool = False, cpu: bool = False)[source]#
Bases:
ConfigBLEURT metric configuration.
model (str): Model name or path.
batch_size (int): Batch size.
fp16 (bool): Use float16 for the forward computation.
bf16 (bool): Use bfloat16 for the forward computation.
cpu (bool): Use CPU for the forward computation.
- corpus_score(hypotheses: list[str], references_lists: list[list[str]], sources: list[str] | None = None) float[source]#
Calculate the corpus-level score.
- property device: device#
Returns the device of the model.
- pairwise_scores(hypotheses: list[str], references: list[str], *_, **__) Tensor[source]#
Calculate the pairwise scores.
- score(hypothesis: str, reference: str, *_, **__) float[source]#
Calculate the score of the given hypothesis.
- scorer: BleurtForSequenceClassification#
- class mbrs.metrics.MetricBase(cfg: Config)[source]#
Bases:
ABCBase metric class.
- property device: device#
Returns the device of the metric object.
- class mbrs.metrics.MetricCOMET(cfg: Config)[source]#
Bases:
MetricAggregatableCacheCOMET metric class.
- class Cache(embeddings: Tensor)[source]#
Bases:
CacheIntermediate representations of sentences.
- embeddings (Tensor): Sentence embeddings of shape (N, D), where N
is the number of sentences and D is a size of the embedding dimension.
- aggregate(reference_lprobs: Tensor | None = None) Cache[source]#
Aggregate the cached representations.
- Parameters:
reference_lprobs (Tensor, optional) – Log-probabilities for each reference sample. The shape must be (len(references),). See https://arxiv.org/abs/2311.05263.
- Returns:
An aggregated representation.
- Return type:
- embeddings: Tensor#
- class Config(model: str = 'Unbabel/wmt22-comet-da', batch_size: int = 64, fp16: bool = False, bf16: bool = False, cpu: bool = False)[source]#
Bases:
ConfigCOMET metric configuration.
model (str): Model name or path.
batch_size (int): Batch size.
fp16 (bool): Use float16 for the forward computation.
bf16 (bool): Use bfloat16 for the forward computation.
cpu (bool): Use CPU for the forward computation.
- corpus_score(hypotheses: list[str], references_lists: list[list[str]], sources: list[str] | None = None) float[source]#
Calculate the corpus-level score.
- property device: device#
Returns the device of the model.
- encode(sentences: list[str]) Cache[source]#
Encode the given sentences into their intermediate representations.
- Parameters:
- Returns:
Intermediate representations.
- Return type:
- class mbrs.metrics.MetricCOMETkiwi(cfg: Config)[source]#
Bases:
MetricReferencelessCOMETkiwi metric class.
- class Config(model: str = 'Unbabel/wmt22-cometkiwi-da', batch_size: int = 64, fp16: bool = False, bf16: bool = False, cpu: bool = False)[source]#
Bases:
ConfigCOMETkiwi metric configuration.
model (str): Model name or path.
batch_size (int): Batch size.
fp16 (bool): Use float16 for the forward computation.
bf16 (bool): Use bfloat16 for the forward computation.
cpu (bool): Use CPU for the forward computation.
- corpus_score(hypotheses: list[str], sources: list[str]) float[source]#
Calculate the corpus-level score.
- property device: device#
Returns the device of the model.
- class mbrs.metrics.MetricCacheable(cfg: Config)[source]#
Bases:
MetricBase class for cacheable metrics.
This class supports to cache intermediate representations of sentences.
- abstract encode(sentences: list[str]) Cache[source]#
Encode the given sentences into their intermediate representations.
- Parameters:
- Returns:
Intermediate representations.
- Return type:
- abstract out_proj(hypotheses_ir: Cache, references_ir: Cache, sources_ir: Cache | None = None) Tensor[source]#
Forward the output projection layer.
- pairwise_scores(hypotheses: list[str], references: list[str], source: str | None = None) Tensor[source]#
Calculate the pairwise scores.
- pairwise_scores_from_ir(hypotheses_ir: Cache, references_ir: Cache, source_ir: Cache | None = None) Tensor[source]#
Calculate the pairwise scores from the intermediate representations.
- score(hypothesis: str, reference: str, source: str | None = None) float[source]#
Calculate the score of the given hypothesis.
- scores(hypotheses: list[str], references: list[str], sources: list[str] | None = None) Tensor[source]#
Calculate the scores of the given hypotheses.
- class mbrs.metrics.MetricChrF(cfg: Config)[source]#
Bases:
MetricAggregatableChrF metric class.
- class AggregatedReference(ngrams: list[Counter])[source]#
Bases:
objectAggregated reference representation.
ngrams (list[Counter]]): Bags of n-grams for each order.
- class Config(char_order: int = 6, word_order: int = 0, beta: int = 2, lowercase: bool = False, whitespace: bool = False, eps_smoothing: bool = False, num_workers: int = 8, fastchrf: bool = False)[source]#
Bases:
ConfigChrF metric configuration.
char_order (int): Character n-gram order.
word_order (int): Word n-gram order. If equals to 2, the metric is referred to as chrF++.
beta (int): Determine the importance of recall w.r.t precision.
lowercase (bool): Enable case-insensitivity.
whitespace (bool): If True, include whitespaces when extracting character n-grams.
- eps_smoothing (bool): If True, applies epsilon smoothing similar to reference chrF++.py, NLTK and Moses implementations.
Otherwise, it takes into account effective match order similar to sacreBLEU < 2.0.0.
num_workers (int): Number of workers for multiprocessing.
fastchrf (bool): Use the rust implementation of chrF.
- corpus_score(hypotheses: list[str], references_lists: list[list[str]], sources: list[str] | None = None) float[source]#
Calculate the corpus-level score.
- expected_scores_reference_aggregation(hypotheses: list[str], references: list[str], source: str | None = None, reference_lprobs: Tensor | None = None) Tensor[source]#
Calculate the expected scores for each hypothesis.
- Parameters:
- Returns:
The expected scores for each hypothesis.
- Return type:
Tensor
- pairwise_scores(hypotheses: list[str], references: list[str], *_, **__) Tensor[source]#
Calculate the pairwise scores.
- class mbrs.metrics.MetricMetricX(cfg: Config)[source]#
Bases:
MetricMetricX metric class.
References: - MetricX-23: https://aclanthology.org/2023.wmt-1.63 - MetricX-24: https://aclanthology.org/2024.wmt-1.35
Available checkpoints:
google/metricx-24-hybrid-xxl-v2p6
google/metricx-24-hybrid-xl-v2p6
google/metricx-24-hybrid-large-v2p6
google/metricx-23-xxl-v2p0
google/metricx-23-xl-v2p0
google/metricx-23-large-v2p0
google/metricx-23-qe-xxl-v2p0
google/metricx-23-qe-xl-v2p0
google/metricx-23-qe-large-v2p0
- class Config(model: str = 'google/metricx-24-hybrid-xxl-v2p6', batch_size: int = 8, fp16: bool = False, bf16: bool = False, cpu: bool = False)[source]#
Bases:
ConfigMetricX metric configuration.
model (str): Model name or path.
batch_size (int): Batch size.
fp16 (bool): Use float16 for the forward computation.
bf16 (bool): Use bfloat16 for the forward computation.
cpu (bool): Use CPU for the forward computation.
- METRICX23_QE_MODELS = {'google/metricx-23-qe-large-v2p0', 'google/metricx-23-qe-xl-v2p0', 'google/metricx-23-qe-xxl-v2p0'}#
- METRICX_INPUT_LENGTH_MAP = {MetricXVersion.metricx_23: 1024, MetricXVersion.metricx_24: 1536}#
- METRICX_INPUT_PREFIX_MAP = {MetricXVersion.metricx_23: MetricMetricX.InputPrefix(hypothesis='candidate: ', reference=' reference: ', source=' source: '), MetricXVersion.metricx_24: MetricMetricX.InputPrefix(hypothesis=' candidate: ', reference=' reference: ', source='source: ')}#
- METRICX_VERSION_MAP = {'google/metricx-23-large-v2p0': MetricXVersion.metricx_23, 'google/metricx-23-qe-large-v2p0': MetricXVersion.metricx_23, 'google/metricx-23-qe-xl-v2p0': MetricXVersion.metricx_23, 'google/metricx-23-qe-xxl-v2p0': MetricXVersion.metricx_23, 'google/metricx-23-xl-v2p0': MetricXVersion.metricx_23, 'google/metricx-23-xxl-v2p0': MetricXVersion.metricx_23, 'google/metricx-24-hybrid-large-v2p6': MetricXVersion.metricx_24, 'google/metricx-24-hybrid-xl-v2p6': MetricXVersion.metricx_24, 'google/metricx-24-hybrid-xxl-v2p6': MetricXVersion.metricx_24}#
- class MetricXVersion(value)[source]#
-
An enumeration.
- metricx_23 = 'metricx_23'#
- metricx_24 = 'metricx_24'#
- corpus_score(hypotheses: list[str], references_lists: list[list[str]] | None = None, sources: list[str] | None = None) float[source]#
Calculate the corpus-level score.
- property device: device#
Returns the device of the model.
- pairwise_scores(hypotheses: list[str], references: list[str], source: str | None = None) Tensor[source]#
Calculate the pairwise scores.
- score(hypothesis: str, reference: str | None = None, source: str | None = None) float[source]#
Calculate the score of the given hypothesis.
- scorer: MT5ForRegression#
- class mbrs.metrics.MetricReferenceless(cfg: Config)[source]#
Bases:
MetricBaseBase class for reference-less metrics like quality estimation.
- corpus_score(hypotheses: list[str], sources: list[str]) float[source]#
Calculate the corpus-level score.
- class mbrs.metrics.MetricTER(cfg: Config)[source]#
Bases:
MetricTER metric class.
- class Config(normalized: bool = False, no_punct: bool = False, asian_support: bool = False, case_sensitive: bool = False, num_workers: int = 8)[source]#
Bases:
ConfigTER metric configuration.
normalized (bool): Enable character normalization. By default, normalizes a couple of things such as newlines being stripped, retrieving XML encoded characters, and fixing tokenization for punctuation. When ‘asian_support’ is enabled, also normalizes specific Asian (CJK) character sequences, i.e. split them down to the character level.
no_punct (bool): Remove punctuation. Can be used in conjunction with ‘asian_support’ to also remove typical punctuation markers in Asian languages (CJK).
asian_support (bool): Enable special treatment of Asian characters. This option only has an effect when ‘normalized’ and/or ‘no_punct’ is enabled. If ‘normalized’ is also enabled, then Asian (CJK) characters are split down to the character level. If ‘no_punct’ is enabled alongside ‘asian_support’, specific unicode ranges for CJK and full-width punctuations are also removed.
case_sensitive (bool): If True, does not lowercase sentences.
num_workers (int): Number of workers for multiprocessing.
- corpus_score(hypotheses: list[str], references_lists: list[list[str]], sources: list[str] | None = None) float[source]#
Calculate the corpus-level score.
- pairwise_scores(hypotheses: list[str], references: list[str], *_, **__) Tensor[source]#
Calculate the pairwise scores.
- class mbrs.metrics.MetricXCOMET(cfg: Config)[source]#
Bases:
MetricXCOMET metric class.
Both XCOMET (Guerreiro et al., 2024) and XCOMET-lite (Larionov et al., 2024) are supported.
- Supported models:
Unbabel/XCOMET-XL
Unbabel/XCOMET-XXL
myyycroft/XCOMET-lite
- class Config(model: str = 'Unbabel/XCOMET-XL', batch_size: int = 8, fp16: bool = False, bf16: bool = False, cpu: bool = False)[source]#
Bases:
ConfigXCOMET metric configuration.
model (str): Model name or path.
batch_size (int): Batch size.
fp16 (bool): Use float16 for the forward computation.
bf16 (bool): Use bfloat16 for the forward computation.
cpu (bool): Use CPU for the forward computation.
- corpus_score(hypotheses: list[str], references_lists: list[list[str]] | None = None, sources: list[str] | None = None) float[source]#
Calculate the corpus-level score.
- property device: device#
Returns the device of the model.
- pairwise_scores(hypotheses: list[str], references: list[str], source: str | None = None) Tensor[source]#
Calculate the pairwise scores.
- score(hypothesis: str, reference: str | None = None, source: str | None = None) float[source]#
Calculate the score of the given hypothesis.
- scorer: XCOMETMetric#