Multimodal Data Models
Multimodal content types.
A modality is just a content part of a Message: the OpenAI/litellm
chat-completion content is either a string or a list of parts, and an image
or an audio clip is one of those parts. So multimodal input is plain
chat-completion — Image and Audio are special types you drop straight into
a ChatMessage's content list, mixed with the text that usually goes with
them:
synalinks.ChatMessage(
role="user",
content=[
"What is in this picture? Answer in one sentence.",
synalinks.Image(url="https://example.com/cat.png"),
],
)
ChatMessage normalizes each element to its wire part (a plain str becomes a
text part, an Image/Audio becomes its own part), so the message stays a
strict chat-completion message and flows through the normal generator path.
Two ways a source becomes a payload
A url or path is just a reference; at some point the actual bytes have to
be read and inlined as a base64 data: URI so every provider — including
local ones that cannot fetch a URL themselves — receives the same self-contained
payload. There are two moments that can happen, for two different use cases:
-
At construction, when you build an
Image(url=...)/Audio(path=...)in Python. The bytes are fetched/read immediately, so an unreachable source fails loudly right where it is written and the object is reusable without re-fetching. This is the right default for hand-written, interactive use. -
Per batch, at inference, for content built from raw JSON rather than the Python constructor — most importantly
Datasetrows, which go throughChatMessages.model_validate_json(...). There theimage_url/input_audioparts stay as lightweight references (aurl/path), so a dataset of a million images never inlines a million payloads into memory. Resolution is deferred toresolve_content_media, which the language model calls on the messages it is about to send — i.e. only the current batch's media is ever resolved at once, and it is freed as soon as the request goes out.
A payload already inline — base64 data, or a data: URI — is left untouched
by both paths, so the per-batch resolver is a no-op on construction-resolved
content.
Adding a modality (video, documents, ...) is a new DataModel here with a
to_content_part() method; nothing else in the stack needs to change.
Audio
Bases: DataModel
An audio content part for a chat message.
Provide a source — a url, a local file path, or raw base64 data — and
a container format (e.g. "wav" or "mp3"). The chat-completion
input_audio part carries inline base64 only, so a url/path is fetched
and inlined at construction.
Source code in synalinks/src/backend/pydantic/media.py
to_content_part()
Render this audio as an OpenAI/litellm input_audio content part.
Image
Bases: DataModel
An image content part for a chat message.
Provide exactly one source: a url (http(s):// or a data: URI), a
local file path, or raw base64 data (with its mime_type). Drop it
into a ChatMessage's content list next to the accompanying text. A
url/path is resolved to the actual payload at construction (see the
module docstring).
Source code in synalinks/src/backend/pydantic/media.py
to_content_part()
Render this image as an OpenAI/litellm image_url content part.
A resolved image emits an inline base64 data: URI; a bare data:
URI url is passed through as-is.
Source code in synalinks/src/backend/pydantic/media.py
normalize_content(content)
Normalize a ChatMessage.content value to its chat-completion shape.
A plain string or a non-list value is returned unchanged; a list has each
element mapped through _normalize_content_element.
Source code in synalinks/src/backend/pydantic/media.py
resolve_content_media(messages)
async
Resolve every deferred media reference in messages to an inline payload.
messages is a list of chat-completion wire dicts (one batch's worth). Any
image_url part pointing at an http(s)/file:// source, and any
input_audio part carrying a deferred url/path/marker source, is
fetched/read and inlined as base64. Parts that are already inline — a
data: URI or base64 data, e.g. content built from a constructed
Image/Audio — are left untouched, so construction-resolved messages and
text-only conversations pay nothing.