Datasets
Datasets
Up to this point we have hand-built every input the program sees —
typing Query(question="...") in a script, or pasting examples
into a numpy array. That works for a tutorial. It does not work
when your training set is ten thousand rows, your data lives on
disk or on the Hugging Face Hub, and your validation split is
another five thousand rows you do not want to load into memory all
at once.
Synalinks' answer is the Dataset class — a small but principled
streaming interface that hands batches of (x, y) pairs to the
trainer one at a time, in exactly the shape program.fit(...)
expects. By the end of this guide you will have:
- loaded a public Hugging Face dataset into a Synalinks program without any ad-hoc parsing,
- understood how Jinja2 templates convert raw rows into your
DataModeltypes, - chosen between streaming and materialized loading and known when each one is right,
- learned the convenience helpers (
load_split,split_train_test) and the catalog of built-in datasets Synalinks ships out of the box.
The Picture: What the Trainer Actually Wants
Before we look at HuggingFaceDataset, it helps to understand
what program.fit(...) consumes. The trainer takes a Python
generator that yields one batch at a time. A batch is either:
- a one-tuple
(x,)for inference-only use, or - a two-tuple
(x, y)for training.
Both x and y are NumPy object arrays whose elements are
DataModel instances. In Guide 14 (Training) we will build such
arrays by hand — e.g.
np.array([Question(...), Question(...)], dtype="object") — to
keep the training example self-contained. A Dataset produces
arrays of exactly the same shape, just one batch at a time and
from a real source instead of a Python literal.
flowchart LR
SRC["raw rows<br/>(HF / CSV / API / ...)"] --> T["Jinja2 template renders<br/>each row to JSON"]
T --> V["Pydantic validates<br/>JSON → DataModel"]
V --> B["Buffered into batches"]
B --> P["program.fit(x=dataset())"]
A Dataset subclass plugs in at the leftmost box (where do the
rows come from?). The rest — templating, validation, batching — is
inherited from the synalinks.Dataset base class and is the same
across every source.
Loading a Hugging Face Dataset
The Hugging Face Hub is the largest public catalog of text
datasets — millions of rows on tens of thousands of tasks, all
accessible through one library. Synalinks wraps it in
synalinks.HuggingFaceDataset. You give the wrapper a path on the
Hub, two Jinja2 templates that explain how to convert one HF row
into your input/output DataModels, and a batch size; it yields
fit-ready batches.
Here is the minimal recipe, against the gsm8k math-word-problem
dataset:
import synalinks
class MathQuestion(synalinks.DataModel):
question: str = synalinks.Field(description="The math word problem")
class NumericalAnswer(synalinks.DataModel):
answer: float = synalinks.Field(description="The numerical answer")
ds = synalinks.HuggingFaceDataset(
path="gsm8k",
name="main",
split="train",
input_data_model=MathQuestion,
input_template='{"question": {{ question | tojson }}}',
output_data_model=NumericalAnswer,
output_template=(
'{"answer": {{ answer.split("####")[-1].strip().replace(",", "")'
" | float }}}"
),
batch_size=8,
)
# `ds()` returns a fresh generator each time the trainer asks for one.
# program.fit(x=ds(), ...)
Walking through the arguments:
pathis the dataset's repo name on the Hub (the first positional argument ofdatasets.load_dataset)."gsm8k"here; for a community dataset you would use the full"owner/name"form, as in"dair-ai/emotion"(Guide 17).nameis the configuration name when a dataset ships several variants.gsm8khas a configuration called"main"and one called"socratic"; we pick"main".splitis the slice of the dataset you want — typically"train","validation", or"test". PassingNoneiterates every split in order.input_data_model+input_templatedescribe thexside of each batch.output_data_model+output_templatedescribe theyside. Omit both for an inputs-only dataset (useful at inference time).batch_sizeis the number of examples accumulated before yielding one batch. Default1— bump it up to give the trainer larger batches and the optimizer better statistics.
The two template arguments are doing the heavy lifting. Let's
look at them more closely.
Templates: From Raw Row to DataModel
A Hugging Face row arrives as a plain Python dict whose keys are
whatever the dataset chose to name them. For gsm8k those keys
are question (a string) and answer (a string in the awkward
form "<chain of thought>\n#### 42"). Your DataModel does not
care about that shape — it just wants typed Python fields. The
two templates convert from one to the other.
Jinja2 is the standard Python templating language; the
double-curly-brace syntax {{ ... }} evaluates an expression
against the row's keys and substitutes its value. Each template
should render to valid JSON that matches the corresponding
DataModel schema, because under the hood Synalinks runs
DataModel.model_validate_json(rendered_string).
Two filters you will use over and over:
| tojson— the Jinja2 filter that quotes and escapes a Python value into a JSON literal. Always use it around any string field. Skippingtojsonis the templating equivalent of forgetting to parameterize a SQL query (Guide 6) — quotes, backslashes, and Unicode in the source row will quietly break your output.| float(and| int, etc.) — coerce the value to a number before it lands in JSON, so Pydantic can validate a numeric field.
A complete gsm8k input template:
The output template is more interesting because gsm8k encodes
its answer as text:
We split on "####", take the last piece, strip whitespace,
remove thousands-separator commas, and coerce to float:
The whole rendered string then validates cleanly against
NumericalAnswer. No bespoke parser, no try/except, no
post-processing in your training code.
Streaming vs Materialized
HuggingFaceDataset accepts a streaming= flag (default
True). The two modes have very different trade-offs:
- Streaming (
streaming=True). Rows are downloaded on demand from the Hub. The generator naturally terminates when the source is exhausted. Required when the dataset does not fit on disk (e.g.c4,RedPajama). Length is unknown ahead of time, solen(ds)raises — unless you also passlimit=N, in which case the size is capped and known. - Materialized (
streaming=False). The entire split is downloaded once, then iterated locally. Use it for small benchmark datasets where you want fast random access, reliablelen, and reproducibility across runs.
For a 24-row evaluation slice, materialized is usually fine. For
a 1 M-row pretraining shard, streaming is the only option that
won't fill your disk. When in doubt, start with streaming plus
limit= and only switch to materialized if you measure a real
benefit.
Three Convenience Helpers
Three pieces of API turn the common streaming-to-arrays patterns
into one-liners. The first lives on every Dataset, the other
two are module-level functions.
ds.materialize() — stream → in-memory arrays
For evaluation or a small experiment, you usually want the
whole dataset sitting in memory as one NumPy object array, not
a stream of batches. The Dataset.materialize() method does
exactly that: iterate to exhaustion, concatenate every batch,
return a single (x, y) pair.
ds = synalinks.HuggingFaceDataset(
path="gsm8k",
name="main",
split="test",
streaming=False,
input_data_model=MathQuestion,
input_template='{"question": {{ question | tojson }}}',
output_data_model=NumericalAnswer,
output_template=(
'{"answer": {{ answer.split("####")[-1].strip().replace(",", "")'
" | float }}}"
),
limit=200,
)
x, y = ds.materialize()
# x and y are now NumPy object arrays you can hand straight to
# program.evaluate(x=x, y=y).
materialize() works for any Dataset subclass — HuggingFaceDataset,
your own CSV loader, anything — because it is defined on the base
class. Use it for small benchmark datasets that fit in memory;
for huge sources, iterate via ds() instead so rows stream on
demand.
synalinks.datasets.load_split — one HF split → one (x, y)
When the source is Hugging Face, the construct-and-materialize pattern above is so common that Synalinks ships a one-line convenience around it:
x, y = synalinks.datasets.load_split(
path="gsm8k",
name="main",
split="test",
input_data_model=MathQuestion,
input_template='{"question": {{ question | tojson }}}',
output_data_model=NumericalAnswer,
output_template=(
'{"answer": {{ answer.split("####")[-1].strip().replace(",", "")'
" | float }}}"
),
limit=200,
)
This is exactly equivalent to constructing the dataset with
streaming=False and calling materialize() on it; under the
hood, that is precisely what load_split does.
synalinks.datasets.split_train_test — head/tail split
Some benchmark datasets ship a single labeled split (HumanEval,
IFEval, BBH, BBQ, TruthfulQA, ...). When you need a train/eval
cut from one such split, a deterministic head/tail slice is the
standard recipe — the same convention Keras uses with
validation_split=:
(x_train, y_train), (x_test, y_test) = synalinks.datasets.split_train_test(
x, y, validation_split=0.2,
)
There is no shuffling here; the trade-off is reproducibility
across runs in exchange for the risk that the head/tail order is
biased. If the source dataset is already shuffled (most HF
benchmark splits are), head/tail is fine; if not, shuffle x
and y together with a fixed seed before calling this helper.
Built-in Datasets: a Pre-built Catalog
For the standard LM-evaluation benchmarks, Synalinks ships
ready-made loaders under synalinks.datasets.* so you do not
have to write the templates yourself. Each one wraps
HuggingFaceDataset and exposes a load_data() function plus
get_input_data_model() / get_output_data_model() helpers.
(x_train, y_train), (x_test, y_test) = synalinks.datasets.gsm8k.load_data()
print(x_train.shape, y_train.shape)
# (7473,) (7473,) — NumPy object arrays of DataModels
The catalog at the time of writing — most of these are the canonical reasoning/QA benchmarks you will see in the LM literature:
synalinks.datasets.gsm8k— grade-school math word problemssynalinks.datasets.hotpotqa— multi-hop question answeringsynalinks.datasets.squad— reading-comprehension QAsynalinks.datasets.mmlu— multitask multiple-choicesynalinks.datasets.bbh— Big-Bench-Hardsynalinks.datasets.hellaswag— commonsense completionsynalinks.datasets.humaneval— code-generationsynalinks.datasets.ifeval— instruction-followingsynalinks.datasets.truthfulqa,synalinks.datasets.bbq,synalinks.datasets.arc_challenge,synalinks.datasets.arcagi,synalinks.datasets.boolq,synalinks.datasets.drop,synalinks.datasets.lambada,synalinks.datasets.logiqa,synalinks.datasets.winogrande
If your task is on this list, prefer the built-in loader. If it
is not, write a HuggingFaceDataset directly — the same
machinery, just with templates you choose.
Other Knobs Worth Knowing
A few HuggingFaceDataset arguments you may need later:
limit=N— cap how many raw rows are consumed across all splits. Useful for smoke tests; also makeslen(ds)available on streaming datasets.repeat=K— emit each raw exampleKtimes in a row. Settingrepeat == batch_sizeproduces "group of K rollouts of the same prompt" batches, which is the layout GRPO-style RL training expects.revision=...— pin to a specific dataset commit or branch. Important for reproducibility on a moving Hub.**kwargsare forwarded straight todatasets.load_dataset, so anything that library accepts (data_files,token,trust_remote_code, ...) works here too.
Custom Sources: Subclassing Dataset
When the data does not live on the Hub — your own SQLite
database, a CSV file, an internal API — subclass
synalinks.Dataset and implement one method, _iter_rows(),
which yields raw row dicts. The base class handles templates,
validation, batching, and the repeat / limit knobs. A
sketch:
class CsvDataset(synalinks.Dataset):
def __init__(self, csv_path, **kwargs):
super().__init__(**kwargs)
self.csv_path = csv_path
def _iter_rows(self):
import csv
with open(self.csv_path) as f:
for row in csv.DictReader(f):
yield row # dict keyed by column name
ds = CsvDataset(
csv_path="data.csv",
input_data_model=Question,
input_template='{"question": {{ question | tojson }}}',
output_data_model=Answer,
output_template='{"answer": {{ answer | tojson }}}',
batch_size=8,
)
HuggingFaceDataset is itself just such a subclass — its
_iter_rows() is a tiny wrapper around the HF library.
Failure Modes Worth Watching For
- Template rendering errors. Synalinks uses
jinja2.StrictUndefined, so a typo in a template variable raises a cleanUndefinedErrorrather than producing silently-wrong JSON. Read the error; the missing variable name is in it. - Schema validation errors. If your template renders JSON
that does not match the declared
DataModel, Pydantic raises on the first bad row. Print one rendered string before starting a long run to confirm shapes match. tojsonomissions. A field that contains quotes, backslashes, or newlines will tear your output JSON apart if you skiptojson. The error usually surfaces as aJSONDecodeError. Always wrap string fields in| tojson.- Forgetting that
ds()returns a fresh generator. Passx=ds()tofit(), notx=ds. ADatasetinstance is the configuration; calling it produces the iterator the trainer consumes.
Take-Home Summary
- A
Datasetfeeds batches of(x, y)DataModelarrays to the trainer. The shape is identical to the hand-built NumPy arrays you have used so far; the dataset just produces them lazily. HuggingFaceDatasetis the standard wrapper for the Hugging Face Hub. You give it a path, two Jinja2 templates, and a batch size; it yields fit-ready batches.- Templates render raw rows to JSON that validates against
your
DataModel. Use| tojsonfor safe string escaping and| float/| intfor type coercion. - Streaming vs materialized: streaming downloads on demand
and is required for huge datasets; materialized loads the
whole split once and gives reliable
len. When in doubt, start streaming withlimit=. - Three convenience helpers wrap the common patterns:
Dataset.materialize()(any source → in-memory arrays),synalinks.datasets.load_split(one HF split in one call), andsynalinks.datasets.split_train_test(deterministic head/tail split). - The built-in catalog (
synalinks.datasets.gsm8k, ...) ships pre-templated loaders for the standard LM benchmarks. Prefer it when your task is on the list. - Pass
x=ds()(calling the dataset to get a fresh generator), notx=ds(the configuration object).
API References
- synalinks.Dataset
- synalinks.HuggingFaceDataset
- Built-in datasets
- Hugging Face
datasetslibrary - Jinja2 template language