Skip to content

FScore metrics

BinaryF1Score

Bases: BinaryFBetaScore

Computes F-1 Score on binary structures.

Formula:

f1_score = 2 * (precision * recall) / (precision + recall)

This is the harmonic mean of precision and recall. Its output range is [0, 1]. It operates at a field level and can be used for multi-class and multi-label classification.

Each field of y_true and y_pred should booleans or floats between [0, 1]. If the fields are floats, it uses the threshold for deciding if the values are 0 or 1.

Parameters:

Name Type Description Default
average str

Type of averaging to be performed across per-class results in the multi-class case. Acceptable values are None, "micro", "macro" and "weighted". Defaults to None. If None, no averaging is performed and result() will return the score for each class. If "micro", compute metrics globally by counting the total true positives, false negatives and false positives. If "macro", compute metrics for each label, and return their unweighted mean. This does not take label imbalance into account. If "weighted", compute metrics for each label, and return their average weighted by support (the number of true instances for each label). This alters "macro" to account for label imbalance. It can result in an score that is not between precision and recall.

None
threshold float

(Optional) Float representing the threshold for deciding whether prediction values are 1 or 0. Elements of y_pred and y_true greater than threshold are converted to be 1, and the rest 0.

0.5
name str

(Optional) string name of the metric instance.

'binary_f1_score'
in_mask list

(Optional) list of keys to keep to compute the metric.

None
out_mask list

(Optional) list of keys to remove to compute the metric.

None
Source code in synalinks/src/metrics/f_score_metrics.py
@synalinks_export("synalinks.metrics.BinaryF1Score")
class BinaryF1Score(BinaryFBetaScore):
    """Computes F-1 Score on binary structures.

    Formula:

    ```python
    f1_score = 2 * (precision * recall) / (precision + recall)
    ```

    This is the harmonic mean of precision and recall.
    Its output range is `[0, 1]`. It operates at a field level
    and can be used for **multi-class and multi-label classification**.

    Each field of `y_true` and `y_pred` should booleans or floats between [0, 1].
    If the fields are floats, it uses the threshold for deciding
    if the values are 0 or 1.

    Args:
        average (str): Type of averaging to be performed across per-class results
            in the multi-class case.
            Acceptable values are `None`, `"micro"`, `"macro"` and
            `"weighted"`. Defaults to `None`.
            If `None`, no averaging is performed and `result()` will return
            the score for each class.
            If `"micro"`, compute metrics globally by counting the total
            true positives, false negatives and false positives.
            If `"macro"`, compute metrics for each label,
            and return their unweighted mean.
            This does not take label imbalance into account.
            If `"weighted"`, compute metrics for each label,
            and return their average weighted by support
            (the number of true instances for each label).
            This alters `"macro"` to account for label imbalance.
            It can result in an score that is not between precision and recall.
        threshold (float): (Optional) Float representing the threshold for deciding
            whether prediction values are 1 or 0. Elements of `y_pred` and `y_true`
            greater than `threshold` are converted to be 1, and the rest 0.
        name (str): (Optional) string name of the metric instance.
        in_mask (list): (Optional) list of keys to keep to compute the metric.
        out_mask (list): (Optional) list of keys to remove to compute the metric.
    """

    def __init__(
        self,
        average=None,
        threshold=0.5,
        name="binary_f1_score",
        in_mask=None,
        out_mask=None,
    ):
        super().__init__(
            average=average,
            beta=1.0,
            threshold=threshold,
            name=name,
            in_mask=in_mask,
            out_mask=out_mask,
        )

    def get_config(self):
        """Return the serializable config of the metric.

        Returns:
            (dict): The config dict.
        """
        base_config = super().get_config()
        del base_config["beta"]
        return base_config

get_config()

Return the serializable config of the metric.

Returns:

Type Description
dict

The config dict.

Source code in synalinks/src/metrics/f_score_metrics.py
def get_config(self):
    """Return the serializable config of the metric.

    Returns:
        (dict): The config dict.
    """
    base_config = super().get_config()
    del base_config["beta"]
    return base_config

BinaryFBetaScore

Bases: FBetaScore

Computes F-Beta score on binary structures.

Formula:

b2 = beta ** 2
f_beta_score = (1 + b2) * (precision * recall) / (precision * b2 + recall)

This is the weighted harmonic mean of precision and recall. Its output range is [0, 1]. It operates at a field level and can be used for multi-class and multi-label classification.

Each field of y_true and y_pred should be booleans or floats between [0, 1]. If the fields are floats, it uses the threshold for deciding if the values are 0 or 1.

Parameters:

Name Type Description Default
average str

Type of averaging to be performed across per-class results in the multi-class case. Acceptable values are None, "micro", "macro" and "weighted". Defaults to None. If None, no averaging is performed and result() will return the score for each class. If "micro", compute metrics globally by counting the total true positives, false negatives and false positives. If "macro", compute metrics for each label, and return their unweighted mean. This does not take label imbalance into account. If "weighted", compute metrics for each label, and return their average weighted by support (the number of true instances for each label). This alters "macro" to account for label imbalance. It can result in an score that is not between precision and recall.

None
beta float

Determines the weight of given to recall in the harmonic mean between precision and recall (see pseudocode equation above). Defaults to 1.

1.0
threshold float

(Optional) Float representing the threshold for deciding whether prediction values are 1 or 0. Elements of y_pred and y_true greater than threshold are converted to be 1, and the rest 0.

0.5
name str

(Optional) string name of the metric instance.

'binary_fbeta_score'
in_mask list

(Optional) list of keys to keep to compute the metric.

None
out_mask list

(Optional) list of keys to remove to compute the metric.

None
Source code in synalinks/src/metrics/f_score_metrics.py
@synalinks_export("synalinks.metrics.BinaryFBetaScore")
class BinaryFBetaScore(FBetaScore):
    """Computes F-Beta score on binary structures.

    Formula:

    ```python
    b2 = beta ** 2
    f_beta_score = (1 + b2) * (precision * recall) / (precision * b2 + recall)
    ```

    This is the weighted harmonic mean of precision and recall.
    Its output range is `[0, 1]`. It operates at a field level
    and can be used for **multi-class and multi-label classification**.

    Each field of `y_true` and `y_pred` should be booleans or floats between [0, 1].
    If the fields are floats, it uses the threshold for deciding
    if the values are 0 or 1.

    Args:
        average (str): Type of averaging to be performed across per-class results
            in the multi-class case.
            Acceptable values are `None`, `"micro"`, `"macro"` and
            `"weighted"`. Defaults to `None`.
            If `None`, no averaging is performed and `result()` will return
            the score for each class.
            If `"micro"`, compute metrics globally by counting the total
            true positives, false negatives and false positives.
            If `"macro"`, compute metrics for each label,
            and return their unweighted mean.
            This does not take label imbalance into account.
            If `"weighted"`, compute metrics for each label,
            and return their average weighted by support
            (the number of true instances for each label).
            This alters `"macro"` to account for label imbalance.
            It can result in an score that is not between precision and recall.
        beta (float): Determines the weight of given to recall
            in the harmonic mean between precision and recall (see pseudocode
            equation above). Defaults to `1`.
        threshold (float): (Optional) Float representing the threshold for deciding
            whether prediction values are 1 or 0. Elements of `y_pred` and `y_true`
            greater than `threshold` are converted to be 1, and the rest 0.
        name (str): (Optional) string name of the metric instance.
        in_mask (list): (Optional) list of keys to keep to compute the metric.
        out_mask (list): (Optional) list of keys to remove to compute the metric.
    """

    def __init__(
        self,
        average=None,
        beta=1.0,
        threshold=0.5,
        name="binary_fbeta_score",
        in_mask=None,
        out_mask=None,
    ):
        super().__init__(
            average=average,
            beta=beta,
            name=name,
            in_mask=in_mask,
            out_mask=out_mask,
        )
        if not isinstance(threshold, float):
            raise ValueError(
                "Invalid `threshold` argument value. "
                "It should be a Python float. "
                f"Received: threshold={threshold} "
                f"of type '{type(threshold)}'"
            )
        if threshold > 1.0 or threshold <= 0.0:
            raise ValueError(
                "Invalid `threshold` argument value. "
                "It should verify 0 < threshold <= 1. "
                f"Received: threshold={threshold}"
            )
        self.threshold = threshold

    async def update_state(self, y_true, y_pred):
        y_pred = tree.map_structure(lambda x: ops.convert_to_json_data_model(x), y_pred)
        y_true = tree.map_structure(lambda x: ops.convert_to_json_data_model(x), y_true)

        if self.in_mask:
            y_pred = tree.map_structure(lambda x: x.in_mask(mask=self.in_mask), y_pred)
            y_true = tree.map_structure(lambda x: x.in_mask(mask=self.in_mask), y_true)
        if self.out_mask:
            y_pred = tree.map_structure(lambda x: x.out_mask(mask=self.out_mask), y_pred)
            y_true = tree.map_structure(lambda x: x.out_mask(mask=self.out_mask), y_true)

        def convert_to_binary(x):
            if isinstance(x, bool):
                return 1.0 if x is True else 0.0
            elif isinstance(x, float):
                return 1.0 if x > self.threshold else 0.0
            else:
                raise ValueError(
                    "All `y_true` and y_pred` fields should be booleans or floats. "
                    "Use `in_mask` or `out_mask` to remove the other fields."
                )

        y_true = tree.flatten(
            tree.map_structure(lambda x: convert_to_binary(x), y_true.json())
        )
        y_pred = tree.flatten(
            tree.map_structure(lambda x: convert_to_binary(x), y_pred.json())
        )
        y_true = np.convert_to_tensor(y_true)
        y_pred = np.convert_to_tensor(y_pred)

        true_positives = y_pred * y_true
        false_positives = y_pred * (1 - y_true)
        false_negatives = (1 - y_pred) * y_true
        intermediate_weights = y_true

        current_true_positives = self.state.get("true_positives")
        if current_true_positives:
            true_positives = np.add(current_true_positives, true_positives)

        current_false_positives = self.state.get("false_positives")
        if current_false_positives:
            false_positives = np.add(current_false_positives, false_positives)

        current_false_negatives = self.state.get("false_negatives")
        if current_false_negatives:
            false_negatives = np.add(current_false_negatives, false_negatives)

        current_intermediate_weights = self.state.get("intermediate_weights")
        if current_intermediate_weights:
            intermediate_weights = np.add(
                current_intermediate_weights, intermediate_weights
            )

        self.state.update(
            {
                "true_positives": true_positives.tolist(),
                "false_positives": false_positives.tolist(),
                "false_negatives": false_negatives.tolist(),
                "intermediate_weights": intermediate_weights.tolist(),
            }
        )

    def get_config(self):
        """Return the serializable config of the metric.

        Returns:
            (dict): The config dict.
        """
        config = {
            "beta": self.beta,
            "threshold": self.threshold,
            "name": self.name,
        }
        base_config = super().get_config()
        return {**base_config, **config}

get_config()

Return the serializable config of the metric.

Returns:

Type Description
dict

The config dict.

Source code in synalinks/src/metrics/f_score_metrics.py
def get_config(self):
    """Return the serializable config of the metric.

    Returns:
        (dict): The config dict.
    """
    config = {
        "beta": self.beta,
        "threshold": self.threshold,
        "name": self.name,
    }
    base_config = super().get_config()
    return {**base_config, **config}

F1Score

Bases: FBetaScore

Computes F-1 Score.

Formula:

f1_score = 2 * (precision * recall) / (precision + recall)

This is the harmonic mean of precision and recall. Its output range is [0, 1]. It operates at a word level and can be used for QA systems.

If y_true and y_pred contains multiple fields The JSON object's fields are flattened and the score computed for each one independently before being averaged.

Parameters:

Name Type Description Default
average str

Type of averaging to be performed across per-field results in the multi-field case. Acceptable values are None, "micro", "macro" and "weighted". Defaults to None. If None, no averaging is performed and result() will return the score for each class. If "micro", compute metrics globally by counting the total true positives, false negatives and false positives. If "macro", compute metrics for each label, and return their unweighted mean. This does not take label imbalance into account. If "weighted", compute metrics for each label, and return their average weighted by support (the number of true instances for each label). This alters "macro" to account for label imbalance. It can result in an score that is not between precision and recall.

None
name str

(Optional) string name of the metric instance.

'f1_score'
in_mask list

(Optional) list of keys to keep to compute the metric.

None
out_mask list

(Optional) list of keys to remove to compute the metric.

None
Source code in synalinks/src/metrics/f_score_metrics.py
@synalinks_export("synalinks.metrics.F1Score")
class F1Score(FBetaScore):
    """Computes F-1 Score.

    Formula:

    ```python
    f1_score = 2 * (precision * recall) / (precision + recall)
    ```

    This is the harmonic mean of precision and recall.
    Its output range is `[0, 1]`. It operates at a word level
    and can be used for **QA systems**.

    If `y_true` and `y_pred` contains multiple fields
    The JSON object's fields are flattened and the score
    computed for each one independently before being averaged.

    Args:
        average (str): Type of averaging to be performed across per-field results
            in the multi-field case.
            Acceptable values are `None`, `"micro"`, `"macro"` and
            `"weighted"`. Defaults to `None`.
            If `None`, no averaging is performed and `result()` will return
            the score for each class.
            If `"micro"`, compute metrics globally by counting the total
            true positives, false negatives and false positives.
            If `"macro"`, compute metrics for each label,
            and return their unweighted mean.
            This does not take label imbalance into account.
            If `"weighted"`, compute metrics for each label,
            and return their average weighted by support
            (the number of true instances for each label).
            This alters `"macro"` to account for label imbalance.
            It can result in an score that is not between precision and recall.
        name (str): (Optional) string name of the metric instance.
        in_mask (list): (Optional) list of keys to keep to compute the metric.
        out_mask (list): (Optional) list of keys to remove to compute the metric.
    """

    def __init__(
        self,
        average=None,
        name="f1_score",
        in_mask=None,
        out_mask=None,
    ):
        super().__init__(
            average=average,
            beta=1.0,
            name=name,
            in_mask=in_mask,
            out_mask=out_mask,
        )

    def get_config(self):
        """Return the serializable config of the metric.

        Returns:
            (dict): The config dict.
        """
        base_config = super().get_config()
        del base_config["beta"]
        return base_config

get_config()

Return the serializable config of the metric.

Returns:

Type Description
dict

The config dict.

Source code in synalinks/src/metrics/f_score_metrics.py
def get_config(self):
    """Return the serializable config of the metric.

    Returns:
        (dict): The config dict.
    """
    base_config = super().get_config()
    del base_config["beta"]
    return base_config

FBetaScore

Bases: Metric

Computes F-Beta score.

Formula:

b2 = beta ** 2
f_beta_score = (1 + b2) * (precision * recall) / (precision * b2 + recall)

This is the weighted harmonic mean of precision and recall. Its output range is [0, 1]. It operates at a word level and can be used for QA systems.

If y_true and y_pred contains multiple fields The JSON object's fields are flattened and the score computed for each one independently.

Parameters:

Name Type Description Default
average str

Type of averaging to be performed across per-field results in the multi-field case. Acceptable values are None, "micro", "macro" and "weighted". Defaults to None. If None, no averaging is performed and result() will return the score for each class. If "micro", compute metrics globally by counting the total true positives, false negatives and false positives. If "macro", compute metrics for each label, and return their unweighted mean. This does not take label imbalance into account. If "weighted", compute metrics for each label, and return their average weighted by support (the number of true instances for each label). This alters "macro" to account for label imbalance. It can result in an score that is not between precision and recall.

None
beta float

Determines the weight of given to recall in the harmonic mean between precision and recall (see pseudocode equation above). Defaults to 1.

1.0
name str

(Optional) string name of the metric instance.

'fbeta_score'
in_mask list

(Optional) list of keys to keep to compute the metric.

None
out_mask list

(Optional) list of keys to remove to compute the metric.

None
Source code in synalinks/src/metrics/f_score_metrics.py
@synalinks_export("synalinks.metrics.FBetaScore")
class FBetaScore(Metric):
    """Computes F-Beta score.

    Formula:

    ```python
    b2 = beta ** 2
    f_beta_score = (1 + b2) * (precision * recall) / (precision * b2 + recall)
    ```

    This is the weighted harmonic mean of precision and recall.
    Its output range is `[0, 1]`. It operates at a word level
    and can be used for **QA systems**.

    If `y_true` and `y_pred` contains multiple fields
    The JSON object's fields are flattened and the score
    computed for each one independently.

    Args:
        average (str): Type of averaging to be performed across per-field results
            in the multi-field case.
            Acceptable values are `None`, `"micro"`, `"macro"` and
            `"weighted"`. Defaults to `None`.
            If `None`, no averaging is performed and `result()` will return
            the score for each class.
            If `"micro"`, compute metrics globally by counting the total
            true positives, false negatives and false positives.
            If `"macro"`, compute metrics for each label,
            and return their unweighted mean.
            This does not take label imbalance into account.
            If `"weighted"`, compute metrics for each label,
            and return their average weighted by support
            (the number of true instances for each label).
            This alters `"macro"` to account for label imbalance.
            It can result in an score that is not between precision and recall.
        beta (float): Determines the weight of given to recall
            in the harmonic mean between precision and recall (see pseudocode
            equation above). Defaults to `1`.
        name (str): (Optional) string name of the metric instance.
        in_mask (list): (Optional) list of keys to keep to compute the metric.
        out_mask (list): (Optional) list of keys to remove to compute the metric.
    """

    def __init__(
        self,
        average=None,
        beta=1.0,
        name="fbeta_score",
        in_mask=None,
        out_mask=None,
    ):
        super().__init__(
            name=name,
            in_mask=in_mask,
            out_mask=out_mask,
        )
        if average not in (None, "micro", "macro", "weighted"):
            raise ValueError(
                "Invalid `average` argument value. Expected one of: "
                "[None, 'micro', 'macro', 'weighted']. "
                f"Received: average={average}"
            )

        if not isinstance(beta, float):
            raise ValueError(
                "Invalid `beta` argument value. "
                "It should be a Python float. "
                f"Received: beta={beta} of type '{type(beta)}'"
            )
        self.state = self.add_variable(
            data_model=FBetaState,
            name=self.name + "_state",
        )
        self.average = average
        self.beta = beta
        self.axis = None
        if self.average != "micro":
            self.axis = 0

    async def update_state(self, y_true, y_pred):
        y_pred = tree.map_structure(lambda x: ops.convert_to_json_data_model(x), y_pred)
        y_true = tree.map_structure(lambda x: ops.convert_to_json_data_model(x), y_true)

        if self.in_mask:
            y_pred = tree.map_structure(lambda x: x.in_mask(mask=self.in_mask), y_pred)
            y_true = tree.map_structure(lambda x: x.in_mask(mask=self.in_mask), y_true)
        if self.out_mask:
            y_pred = tree.map_structure(lambda x: x.out_mask(mask=self.out_mask), y_pred)
            y_true = tree.map_structure(lambda x: x.out_mask(mask=self.out_mask), y_true)

        y_true = tree.flatten(tree.map_structure(lambda x: str(x), y_true.json()))
        y_pred = tree.flatten(tree.map_structure(lambda x: str(x), y_pred.json()))

        true_positives = []
        false_positives = []
        false_negatives = []
        intermediate_weights = []
        # For each field of y_true and y_pred
        for yt, yp in zip(y_true, y_pred):
            y_true_tokens = nlp_utils.normalize_and_tokenize(yt)
            y_pred_tokens = nlp_utils.normalize_and_tokenize(yp)
            common_tokens = set(y_true_tokens) & set(y_pred_tokens)
            true_positives.append(len(common_tokens))
            false_positives.append(len(y_pred_tokens) - len(common_tokens))
            false_negatives.append(len(y_true_tokens) - len(common_tokens))
            intermediate_weights.append(len(y_true_tokens))

        true_positives = np.convert_to_numpy(true_positives)
        false_positives = np.convert_to_numpy(false_positives)
        false_negatives = np.convert_to_numpy(false_negatives)
        intermediate_weights = np.convert_to_numpy(intermediate_weights)

        current_true_positives = self.state.get("true_positives")
        if current_true_positives:
            true_positives = np.add(current_true_positives, true_positives)

        current_false_positives = self.state.get("false_positives")
        if current_false_positives:
            false_positives = np.add(current_false_positives, false_positives)

        current_false_negatives = self.state.get("false_negatives")
        if current_false_negatives:
            false_negatives = np.add(current_false_negatives, false_negatives)

        current_intermediate_weights = self.state.get("intermediate_weights")
        if current_intermediate_weights:
            intermediate_weights = np.add(
                current_intermediate_weights, intermediate_weights
            )

        self.state.update(
            {
                "true_positives": true_positives.tolist(),
                "false_positives": false_positives.tolist(),
                "false_negatives": false_negatives.tolist(),
                "intermediate_weights": intermediate_weights.tolist(),
            }
        )

    def result(self):
        if (
            self.state.get("true_positives") is None
            and self.state.get("false_positives") is None
            and self.state.get("false_negatives") is None
        ):
            return 0.0
        precision = np.divide(
            self.state.get("true_positives"),
            np.add(
                self.state.get("true_positives"),
                self.state.get("false_positives"),
            )
            + backend.epsilon(),
        )
        recall = np.divide(
            self.state.get("true_positives"),
            np.add(
                self.state.get("true_positives"),
                self.state.get("false_negatives"),
            )
            + backend.epsilon(),
        )
        precision = np.convert_to_tensor(precision)
        recall = np.convert_to_tensor(recall)

        mul_value = precision * recall
        add_value = ((self.beta**2) * precision) + recall
        mean = np.divide(mul_value, add_value + backend.epsilon())
        f1_score = mean * (1 + (self.beta**2))
        if self.average == "weighted":
            intermediate_weights = self.state.get("intermediate_weights")
            weights = np.divide(
                intermediate_weights,
                np.sum(intermediate_weights) + backend.epsilon(),
            )
            f1_score = np.sum(f1_score * weights)

        elif self.average is not None:  # [micro, macro]
            f1_score = np.mean(f1_score, self.axis)

        try:
            return float(f1_score)
        except Exception:
            return list(f1_score)

    def get_config(self):
        """Return the serializable config of the metric.

        Returns:
            (dict): The config dict.
        """
        config = {
            "name": self.name,
            "beta": self.beta,
        }
        base_config = super().get_config()
        return {**base_config, **config}

get_config()

Return the serializable config of the metric.

Returns:

Type Description
dict

The config dict.

Source code in synalinks/src/metrics/f_score_metrics.py
def get_config(self):
    """Return the serializable config of the metric.

    Returns:
        (dict): The config dict.
    """
    config = {
        "name": self.name,
        "beta": self.beta,
    }
    base_config = super().get_config()
    return {**base_config, **config}