MMLU
get_input_data_model()
Returns MMLU input data model for pipeline configurations.
Returns:
| Type | Description |
|---|---|
DataModel
|
The MMLU input data model |
Source code in synalinks/src/datasets/built_in/mmlu.py
get_output_data_model()
Returns MMLU output data model for pipeline configurations.
Returns:
| Type | Description |
|---|---|
DataModel
|
The MMLU output data model |
Source code in synalinks/src/datasets/built_in/mmlu.py
iterable_dataset(repeat=1, batch_size=1, limit=None, split='validation')
Streaming dataset for RL-style training.
Default split is "validation" (1.5k rows) — the conventional
training pool for MMLU. Pass split="auxiliary_train" for the
larger 99k auxiliary corpus, or split="test" for evaluation.
Returns:
| Type | Description |
|---|---|
HuggingFaceDataset
|
A streaming, iterable dataset. |
Source code in synalinks/src/datasets/built_in/mmlu.py
load_data()
Load and format data from HuggingFace.
MMLU is an evaluation-only benchmark; the conventional split here is
the validation set (1.5k examples, useful for few-shot prompt
tuning) as train and the test set (14k examples) as test.
Example:
Returns:
| Type | Description |
|---|---|
tuple
|
The train and test data ready for training |