Multi Objective LM Selection

Picking the Best Language Model: Multi-Objective Search

Guide 17 introduced hyperparameter search and tuned three knobs of a single language model: chain-of-thought on or off, sampling temperature, and reasoning effort. In this guide we change the kind of question we ask. Instead of tuning one model, we compare several models on the same task — and we score each one on two metrics at once, not just a single reward.

This is a very practical question. Modern LM providers ship whole families of models — a small "lite" version, a medium "flash" version, a large "pro" version. Before you deploy, you usually have to pick one. The cheap model is fast and inexpensive but may miss edge cases; the expensive model is accurate but slow and pricey. What you actually want is the trade-off curve, not just the single winner. The winner depends on how you weight cost against accuracy, and that weighting is a decision you should make with the data in front of you.

Two ideas show up in this guide for the first time:

Multi-objective optimization. Instead of a single Objective("val_reward", "max"), we hand the tuner a list of objectives. Keras-Tuner aggregates them into a single scalar per trial and ranks trials by that aggregate, while keeping each individual metric recorded so you can inspect the trade-off after the fact.
Grid search. When the search space is small and discrete — for example, "try each of these three models exactly once" — GridSearch enumerates the combinations deterministically. RandomSearch would sample with replacement and could visit the same model twice while skipping another; GridSearch is guaranteed not to.

The Task: Six-Way Emotion Classification

The dataset we use is dair-ai/emotion: short tweets each labeled with one of six emotions — sadness, joy, love, anger, fear, or surprise. What makes this dataset interesting is that the label distribution is imbalanced: joy and sadness together cover most of the corpus, while love and surprise are rare.

That imbalance is the entire point. A lazy model that always predicts the majority class can post a respectable accuracy and still be useless — it gets the easy half right by accident, and flunks the rare half completely. To surface that failure mode we need a metric that scores per class, not in aggregate.

Why Two Metrics, and Which Two

We compile each candidate with two metrics:

ExactMatch — strict accuracy. The reward is 1.0 if and only if every class field on the predicted output matches the truth, else 0.0. It captures "did we get the label exactly right?" but it does not distinguish between a model that gets every class right occasionally and one that gets only the majority class right but does so reliably.
BinaryF1Score(average="macro") — averaged per-class F1. F1 is the harmonic mean of precision and recall; intuitively, it punishes a classifier that wins on one of those at the expense of the other. Macro averaging computes the F1 of each class separately and then averages, so a class with only a handful of examples (like love or surprise) contributes the same weight as a frequent one (like joy). A model that collapses onto the majority class will score high on accuracy and low on macro-F1 — exactly the trade-off we are trying to see.

A footnote you will want before reading the source: synalinks.metrics.F1Score (without Binary) is token-level and oriented at open-text QA tasks. For a single categorical label it collapses to accuracy and tells you nothing new. Whenever your output has one boolean per class — the layout we are about to introduce — the right metric is BinaryF1Score.

The Output Schema Is Doing Real Work

Because BinaryF1Score operates per class, the output DataModel needs one boolean field per class. The LM emits a 0/1 for each:

class Emotion(synalinks.DataModel):
    sadness:  bool = synalinks.Field(description="True if the dominant emotion is sadness")
    joy:      bool = synalinks.Field(description="True if the dominant emotion is joy")
    love:     bool = synalinks.Field(description="True if the dominant emotion is love")
    anger:    bool = synalinks.Field(description="True if the dominant emotion is anger")
    fear:     bool = synalinks.Field(description="True if the dominant emotion is fear")
    surprise: bool = synalinks.Field(description="True if the dominant emotion is surprise")

It is tempting to call this layout "one-hot" — and the ground truth is one-hot in this dataset, because dair-ai/emotion assigns exactly one class label per tweet. But the schema itself does not enforce that. Six independent booleans can take any of 2⁶ = 64 combinations; the LM is free to mark two as True if two emotions seem present, or none at all if it cannot decide. That is a feature, not a bug: the same layout doubles as multi-label classification — many real-world tasks (genre tagging, content moderation, symptom checklists) genuinely have more than one positive class at once.

If you actually need to enforce "exactly one label," that is a different schema: a single field with a Literal of the class names, paired with the Categorical* F1 family (which compares label sets instead of per-field booleans). Sketch:

from typing import Literal

class EmotionCategorical(synalinks.DataModel):
    label: Literal["sadness", "joy", "love", "anger", "fear", "surprise"] = (
        synalinks.Field(description="The dominant emotion")
    )

# Pair with the Categorical F1 family in compile():
program.compile(
    reward=synalinks.rewards.ExactMatch(),
    metrics=[synalinks.metrics.CategoricalF1Score(average="macro")],
    optimizer=synalinks.optimizers.RandomFewShot(),
)

So there are three reasonable layouts for a classification task, each pairing with a different metric family:

Layout	Schema	Metric family	When to use
Booleans	one `bool` per class	`Binary*`	Multi-label; or multi-class where multiple labels are allowed in principle.
Scores	one `Score` per class	`Binary*(threshold=…)`	Same as booleans, plus you want the LM to express confidence.
Categorical	single `Literal` field	`Categorical*`	Strictly one label per example, enforced by the schema.

The take-away worth internalizing: in Synalinks, the schema you declare and the metric you measure have to be designed together. This guide uses the boolean layout because it makes the F1-vs-accuracy trade-off vivid even on a single-label dataset; the variant section below shows the Score version of the same story.

A Variant: Confidence Scores Instead of Booleans

BinaryF1Score accepts not just bool fields but also floats in [0, 1] — it just thresholds them at runtime to decide which side of 0/1 each prediction lands on. That opens up a second output layout: instead of independent booleans per class, ask the LM for a confidence per class. synalinks.Score (Guide 2) is the natural type — a discretized [0, 1] enum that the LM is constrained to emit one of eleven values from (0.0, 0.1, ..., 1.0).

class EmotionScore(synalinks.DataModel):
    sadness:  synalinks.Score = synalinks.Field(description="Confidence that the dominant emotion is sadness")
    joy:      synalinks.Score = synalinks.Field(description="Confidence that the dominant emotion is joy")
    love:     synalinks.Score = synalinks.Field(description="Confidence that the dominant emotion is love")
    anger:    synalinks.Score = synalinks.Field(description="Confidence that the dominant emotion is anger")
    fear:     synalinks.Score = synalinks.Field(description="Confidence that the dominant emotion is fear")
    surprise: synalinks.Score = synalinks.Field(description="Confidence that the dominant emotion is surprise")

The ground-truth template now emits the literal floats 1.0 / 0.0 instead of the JSON booleans true / false — both are valid Score values, so Pydantic accepts them:

OUTPUT_TEMPLATE_SCORE = (
    "{"
    + ", ".join(
        f'"{name}": {{{{ 1.0 if label == {i} else 0.0 }}}}'
        for i, name in enumerate(EMOTION_LABELS)
    )
    + "}"
)

The metric line changes by exactly one keyword argument — add threshold=0.5 so BinaryF1Score knows where to cut the confidence values:

program.compile(
    reward=synalinks.rewards.ExactMatch(),    # see caveat below
    metrics=[synalinks.metrics.BinaryF1Score(average="macro", threshold=0.5)],
    optimizer=synalinks.optimizers.RandomFewShot(),
)

When to prefer which layout. A short decision guide:

Booleans when the task really is multi-class with one winner. The LM has nothing to express beyond "this one." The reward ExactMatch is the natural strict accuracy.
Scores when you want the LM to express uncertainty ("joy: 0.7, love: 0.3"), when more than one class can be simultaneously true (multi-label), or when you plan to use the confidence values downstream (e.g. as a ranking signal). The reward ExactMatch is too strict here because 0.9 ≠ 1.0; BinaryF1Score itself (with a threshold) is the usual reward choice in this regime.

A small reward-side caveat worth flagging: if you switch to Score-typed labels, ExactMatch will give you 0.0 whenever the LM's confidence is anything other than the exact ground-truth value — even 0.99 against 1.0. Use BinaryF1Score as both the reward and the metric in that case (or write your own threshold-based reward via RewardFunctionWrapper, Guide 13).

The runnable example below demonstrates both layouts behind a USE_SCORE_LABELS toggle so you can see the wiring of each.

Loading a Hugging Face Dataset Through Templates

The dair-ai/emotion dataset ships its labels as integers 0 through 5. Each row has exactly one label, so the rendered ground-truth record happens to have exactly one True field — the dataset itself is single-label, even though our schema would tolerate multi-label. Before we can train on it, we have to convert each row into the boolean record our Emotion data model expects. synalinks.HuggingFaceDataset handles this with two Jinja2 templates — one for the input side of each row, one for the output side — that render each Hugging Face row into a JSON snippet matching the target DataModel.

(Jinja2 is the standard Python templating language; you can think of it as "string formatting on steroids." The double-curly-brace syntax {{ ... }} evaluates an expression and substitutes its value into the surrounding text.)

INPUT_TEMPLATE = '{"text": {{ text | tojson }}}'
OUTPUT_TEMPLATE = (
    "{"
    + ",".join(
        f'"{name}": {{{{ (label == {i}) | tojson }}}}'
        for i, name in enumerate(EMOTION_LABELS)
    )
    + "}"
)

The tojson filter is the safe way to embed a value into JSON — it escapes quotes, backslashes, and control characters so they cannot accidentally break the JSON output. Skipping it is the templating equivalent of forgetting to parameterize a SQL query (Guide 7); the bugs it prevents are unpleasant in exactly the same way. For this dataset, the template emits true for the matching class and false for every other one — which yields a single-label record because the source dataset is single-label. The Emotion schema would happily accept multiple True fields if the data ever called for it.

Dataset Helpers: `load_split`, `materialize`, `split_train_test`

Three helpers from synalinks.datasets (introduced in Guide 11) do all of the heavy lifting:

synalinks.datasets.load_split(...) — a one-call shortcut that builds a HuggingFaceDataset with streaming=False, iterates it to exhaustion, and hands you back numpy object arrays. The return shape is (x, y) when an output template is set (as here), or (x,) for inputs-only datasets.
Dataset.materialize() — the underlying method on the Dataset base class that load_split calls into. Any Dataset subclass (HuggingFace, a custom CSV loader, your own generator) gets this method for free. Use it when you want to build the dataset object explicitly — for instance, to inspect it before materializing, or to swap streaming on and off.
synalinks.datasets.split_train_test(x, y, validation_split=0.2) — a deterministic head/tail slicer. It returns ((x_train, y_train), (x_val, y_val)) after cutting at int(n * (1 - validation_split)). It is the standard way to carve a validation slice out of a single labeled split — useful when the source dataset does not ship its own validation split (HumanEval, IFEval, BBH, TruthfulQA, BBQ all fit that pattern).

For this guide we do have a native validation split on dair-ai/emotion, but we use split_train_test against the train split anyway — partly because it is the more general recipe (it works on any single-split source), and partly so you can see the helper in action:

(x_trainval, y_trainval) = synalinks.datasets.load_split(
    path="dair-ai/emotion",
    split="train",
    input_data_model=Tweet,
    input_template=INPUT_TEMPLATE,
    output_data_model=EmotionLabel,
    output_template=OUTPUT_TEMPLATE,
    limit=NB_TRAINVAL_SAMPLES,
)
(x_train, y_train), (x_val, y_val) = synalinks.datasets.split_train_test(
    x_trainval, y_trainval, validation_split=VALIDATION_SPLIT,
)
(x_test, y_test) = synalinks.datasets.load_split(
    path="dair-ai/emotion", split="test", ..., limit=NB_TEST_SAMPLES,
)

The shuffle question is worth flagging: split_train_test is deterministic and order-preserving — it slices the head off for train and the tail off for val. If your source rows are not already shuffled (HumanEval's prompts, for instance, are sorted by task ID), shuffle before you split. The datasets library lets you pass shuffle=True through load_dataset kwargs, and those kwargs forward through synalinks.datasets.load_split.

A Multi-Objective `GridSearch`

The tuner construction is almost the same as in Guide 17, with two differences:

objective is a list. When you pass more than one objective, Keras-Tuner internally wraps the list in a MultiObjective object and aggregates the metrics into a single scalar (a weighted sum, by default with equal weights) for the oracle's ranking. The individual metrics stay recorded on each trial, so you can inspect the trade-off later — the aggregate is just there to let the oracle compare apples to apples.
We use GridSearch instead of RandomSearch, because we want each model evaluated exactly once. (More on this below.)

tuner = synalinks.tuners.GridSearch(
    build_program,
    objective=[
        synalinks.tuners.Objective("val_reward", direction="max"),
        synalinks.tuners.Objective("val_binary_f1_score", direction="max"),
    ],
    max_trials=len(CANDIDATE_MODELS),
    directory="examples",
    project_name="emotion_lm_selection",
    overwrite=True,
)

Both objectives are maximized in this example, but you can mix directions freely. A common third axis is a min objective on cost or latency — for instance, if you record val_tokens_per_request as a metric, you can ask the tuner to maximize accuracy and minimize tokens at the same time, and the ranking will reflect both. That is exactly the "I want a model that is accurate and cheap" question we set out to answer.

Reading the Pareto Trade-off

When the search finishes, the winning trial is simply the one with the highest aggregated score. But the genuinely interesting output of a multi-objective search is the per-model table that lists every model's individual scores side by side. That table is your view of the Pareto frontier — a term worth pinning down.

The Pareto frontier is the set of configurations where you cannot improve one metric without hurting another. A configuration is on the frontier if no other configuration is strictly better than it on both metrics at once. The frontier is not a single point but a curve (or, in higher dimensions, a surface) of "acceptable" trade-offs. Choosing one point on the frontier over another is a business decision — how much accuracy am I willing to give up for a fraction of the cost? — not a math problem.

trial  model                                       reward       f1
0001   gemini/gemini-3.1-flash-lite-preview        0.500    0.402
0002   gemini/gemini-3.1-flash-preview             0.667    0.531
0003   gemini/gemini-3.1-pro-preview               0.750    0.624

(Numbers shown here are illustrative; your run will differ.)

A model that scores high on reward but low on f1 is overfitting to the majority classes — it gets the easy half right and quietly fails on the rare half. A model that scores about the same on both metrics, even at a lower absolute level, is making balanced mistakes — and that is often what you want in practice, because the rare classes are usually the ones the user notices.

For a really honest comparison you would run multiple seeds per model and plot the Pareto frontier in two dimensions, so the trade-off has uncertainty bars on it. The script below does a single seed per model so that one search fits in a few minutes on a laptop — but in production work, do not skip the multiple-seed step.

Take-Home Summary

Multi-objective search passes a list of Objectives. Keras-Tuner aggregates them per trial and ranks accordingly, while keeping the individual metrics recorded so you can inspect the trade-off.
BinaryF1Score with average="macro" is the standard metric for imbalanced multi-class classification. Plain accuracy rewards majority-class collapse; macro-F1 does not.
The output DataModel must have one boolean field per class for BinaryF1Score to read it. The shape of the schema is dictated by the metric.
GridSearch is the right tuner when the search space is a small discrete sweep — for example, "each model in this list, exactly once."
A HuggingFaceDataset plus Jinja2 input/output templates is how you pipe a public dataset's raw rows into your DataModel schemas.
Always rebuild the winner from best_hp and evaluate on a held-out test split; the validation scores were used by the oracle and overstate generalization.
The point of multi-objective search is the Pareto frontier, not a single winner. The frontier is your engineering tool for the cost-vs-accuracy decision.

API References

`Emotion`

Bases: DataModel

Per-class boolean labels.

One independent bool per class — the layout BinaryF1Score expects. The schema does not enforce "exactly one True": six independent booleans can take any of 2⁶ combinations, so this same layout naturally generalizes to multi-label tasks where more than one class can be positive at once. For strict single-label enforcement, use a Literal field with the Categorical* metrics instead (see the guide prose).

On dair-ai/emotion each row happens to have exactly one True field in the ground truth, but the LM is technically free to predict any combination — the field descriptions are a soft hint, not a hard constraint.