Skip to content

GSM8K

get_input_data_model()

Returns GSM8K input data_model for pipeline configurations.

Returns:

Type Description
DataModel

The GSM8K input data_model

Source code in synalinks/src/datasets/built_in/gsm8k.py
@synalinks_export("synalinks.datasets.gsm8k.get_input_data_model")
def get_input_data_model():
    """
    Returns GSM8K input data_model for pipeline configurations.

    Returns:
        (DataModel): The GSM8K input data_model
    """
    return MathQuestion

get_output_data_model()

Returns GSM8K output data_model for pipeline configurations.

Returns:

Type Description
DataModel

The GSM8K output data_model

Source code in synalinks/src/datasets/built_in/gsm8k.py
@synalinks_export("synalinks.datasets.gsm8k.get_output_data_model")
def get_output_data_model():
    """
    Returns GSM8K output data_model for pipeline configurations.

    Returns:
        (DataModel): The GSM8K output data_model
    """
    return NumericalAnswerWithThinking

iterable_dataset(repeat=1, batch_size=1, limit=None, split='train')

Streaming dataset for RL-style training.

Parameters:

Name Type Description Default
repeat int

Number of consecutive copies of each row — set equal to batch_size for GRPO-style rollouts.

1
batch_size int

Examples per yielded batch.

1
limit int

Optional cap on raw rows (useful for smoke tests).

None
split str

HF split to stream. Defaults to "train".

'train'

Returns:

Type Description
HuggingFaceDataset

A streaming, iterable dataset.

Source code in synalinks/src/datasets/built_in/gsm8k.py
@synalinks_export("synalinks.datasets.gsm8k.iterable_dataset")
def iterable_dataset(repeat=1, batch_size=1, limit=None, split="train"):
    """
    Streaming dataset for RL-style training.

    Args:
        repeat (int): Number of consecutive copies of each row — set
            equal to ``batch_size`` for GRPO-style rollouts.
        batch_size (int): Examples per yielded batch.
        limit (int): Optional cap on raw rows (useful for smoke tests).
        split (str): HF split to stream. Defaults to ``"train"``.

    Returns:
        (HuggingFaceDataset): A streaming, iterable dataset.
    """
    return HuggingFaceDataset(
        path="gsm8k",
        name="main",
        split=split,
        streaming=True,
        input_data_model=MathQuestion,
        input_template=_INPUT_TEMPLATE,
        output_data_model=NumericalAnswerWithThinking,
        output_template=_OUTPUT_TEMPLATE,
        batch_size=batch_size,
        limit=limit,
        repeat=repeat,
    )

load_data()

Load and format data from HuggingFace.

Example:

(x_train, y_train), (x_test, y_test) = synalinks.datasets.gsm8k.load_data()

Returns:

Type Description
tuple

The train and test data ready for training

Source code in synalinks/src/datasets/built_in/gsm8k.py
@synalinks_export("synalinks.datasets.gsm8k.load_data")
def load_data():
    """
    Load and format data from HuggingFace.

    Example:

    ```python
    (x_train, y_train), (x_test, y_test) = synalinks.datasets.gsm8k.load_data()
    ```

    Returns:
        (tuple): The train and test data ready for training
    """
    x_train, y_train = _split_dataset("train")
    x_test, y_test = _split_dataset("test")
    return (x_train, y_train), (x_test, y_test)