Skip to content

ImageFolderDataset

ImageFolderDataset

Bases: Dataset

Streaming dataset over a directory of image files.

Walks root (optionally recursively), matches every file whose extension is in extensions (default common image types, case-insensitive), and yields one row per image — shaped, by default, as a ChatMessages with a text prompt next to the image.

Crucially, an image is carried as a reference (a file:// URI), never as bytes: iterating the dataset — even materializing it — does not load a single pixel into memory. The file is read and inlined as a base64 data: URI only when its batch is actually sent to the model, one batch at a time. A folder of a million photos therefore costs a list of paths to iterate, not a million decoded images.

import synalinks

ds = synalinks.ImageFolderDataset(
    root="./photos",
    prompt="Describe this image in one short sentence.",
    batch_size=8,
)
# program.predict(x=ds())  # each batch's files are read on demand

Each raw row exposes four template variables, so a custom input_template / output_template can reshape freely:

  • file_uri: the image as file:///abs/path (drop into an image_url part — the resolver reads it per batch).
  • image_path: the path relative to root.
  • name: the filename without extension.
  • label: the immediate parent directory name ("" for images directly under root) — the torchvision ImageFolder convention, handy as a classification target.

For a supervised (image, label) dataset, pass an output_template:

ds = synalinks.ImageFolderDataset(
    root="./photos",            # e.g. photos/cat/*.jpg, photos/dog/*.jpg
    output_template='{"label": {{ label | tojson }}}',
    output_data_model=MyLabel,
    batch_size=8,
)

Parameters:

Name Type Description Default
root str

Directory to walk. Must exist.

required
prompt str

Text paired with each image by the default input_template. Ignored if you pass your own input_template. Defaults to "Describe this image.".

'Describe this image.'
recursive bool

When True (default), descend into subdirectories. When False, only direct children of root.

True
extensions tuple

Image extensions to match (case-insensitive). Defaults to (".png", ".jpg", ".jpeg", ".gif", ".webp", ".bmp").

_DEFAULT_EXTENSIONS
input_data_model DataModel

See Dataset. Defaults to synalinks.ChatMessages.

None
input_schema dict | str

See Dataset.

None
input_template str

See Dataset. Defaults to a single user turn with the prompt and the image.

None
output_data_model DataModel

See Dataset. Optional target shape.

None
output_schema dict | str

See Dataset.

None
output_template str

See Dataset. Omit for an inputs-only dataset (e.g. captioning at inference); provide it to build a supervised (image, target) dataset for training.

None
batch_size int

Examples per yielded batch. Defaults to 8.

8
limit int

Optional cap on the number of images consumed. With a limit set, __len__ is also available.

None
repeat int

See Dataset.

1
Source code in synalinks/src/datasets/image_folder_dataset.py
@synalinks_export(
    [
        "synalinks.ImageFolderDataset",
        "synalinks.datasets.ImageFolderDataset",
    ]
)
class ImageFolderDataset(Dataset):
    """Streaming dataset over a directory of image files.

    Walks ``root`` (optionally recursively), matches every file whose
    extension is in ``extensions`` (default common image types,
    case-insensitive), and yields one row per image — shaped, by default,
    as a `ChatMessages` with a text ``prompt`` next to the image.

    Crucially, an image is carried as a **reference** (a ``file://`` URI),
    never as bytes: iterating the dataset — even materializing it — does
    not load a single pixel into memory. The file is read and inlined as a
    base64 ``data:`` URI only when its batch is actually sent to the model,
    one batch at a time. A folder of a million photos therefore costs a list
    of paths to iterate, not a million decoded images.

    ```python
    import synalinks

    ds = synalinks.ImageFolderDataset(
        root="./photos",
        prompt="Describe this image in one short sentence.",
        batch_size=8,
    )
    # program.predict(x=ds())  # each batch's files are read on demand
    ```

    Each raw row exposes four template variables, so a custom
    ``input_template`` / ``output_template`` can reshape freely:

    - ``file_uri``: the image as ``file:///abs/path`` (drop into an
      ``image_url`` part — the resolver reads it per batch).
    - ``image_path``: the path relative to ``root``.
    - ``name``: the filename without extension.
    - ``label``: the immediate parent directory name (``""`` for images
      directly under ``root``) — the torchvision ``ImageFolder`` convention,
      handy as a classification target.

    For a supervised ``(image, label)`` dataset, pass an ``output_template``:

    ```python
    ds = synalinks.ImageFolderDataset(
        root="./photos",            # e.g. photos/cat/*.jpg, photos/dog/*.jpg
        output_template='{"label": {{ label | tojson }}}',
        output_data_model=MyLabel,
        batch_size=8,
    )
    ```

    Args:
        root (str): Directory to walk. Must exist.
        prompt (str): Text paired with each image by the default
            ``input_template``. Ignored if you pass your own
            ``input_template``. Defaults to ``"Describe this image."``.
        recursive (bool): When True (default), descend into
            subdirectories. When False, only direct children of ``root``.
        extensions (tuple): Image extensions to match (case-insensitive).
            Defaults to ``(".png", ".jpg", ".jpeg", ".gif", ".webp", ".bmp")``.
        input_data_model (DataModel): See `Dataset`. Defaults to
            `synalinks.ChatMessages`.
        input_schema (dict | str): See `Dataset`.
        input_template (str): See `Dataset`. Defaults to a single user turn
            with the ``prompt`` and the image.
        output_data_model (DataModel): See `Dataset`. Optional target shape.
        output_schema (dict | str): See `Dataset`.
        output_template (str): See `Dataset`. Omit for an inputs-only
            dataset (e.g. captioning at inference); provide it to build a
            supervised ``(image, target)`` dataset for training.
        batch_size (int): Examples per yielded batch. Defaults to 8.
        limit (int): Optional cap on the number of images consumed. With a
            limit set, ``__len__`` is also available.
        repeat (int): See `Dataset`.
    """

    def __init__(
        self,
        root: str,
        *,
        prompt: str = "Describe this image.",
        recursive: bool = True,
        extensions=_DEFAULT_EXTENSIONS,
        input_data_model=None,
        input_schema=None,
        input_template: Optional[str] = None,
        output_data_model=None,
        output_schema=None,
        output_template: Optional[str] = None,
        batch_size: int = 8,
        limit: Optional[int] = None,
        repeat: int = 1,
    ):
        if input_template is None:
            input_template = _DEFAULT_INPUT_TEMPLATE
        super().__init__(
            input_data_model=input_data_model,
            input_schema=input_schema,
            input_template=input_template,
            output_data_model=output_data_model,
            output_schema=output_schema,
            output_template=output_template,
            batch_size=batch_size,
            limit=limit,
            repeat=repeat,
        )

        if not os.path.isdir(root):
            raise FileNotFoundError(f"Image folder not found: {root}")
        self.root = root
        self.prompt = prompt
        self.recursive = recursive
        self.extensions = tuple(ext.lower() for ext in extensions)

    def _iter_files(self) -> Iterator[str]:
        if self.recursive:
            # os.walk's order is filesystem-dependent — sort within each
            # directory so the dataset is deterministic across reruns.
            for dirpath, _, filenames in os.walk(self.root):
                for name in sorted(filenames):
                    if name.lower().endswith(self.extensions):
                        yield os.path.join(dirpath, name)
        else:
            for name in sorted(os.listdir(self.root)):
                full = os.path.join(self.root, name)
                if os.path.isfile(full) and name.lower().endswith(self.extensions):
                    yield full

    def _iter_rows(self):
        for path in self._iter_files():
            abspath = os.path.abspath(path)
            parent = os.path.basename(os.path.dirname(abspath))
            root_name = os.path.basename(os.path.abspath(self.root))
            yield {
                "file_uri": f"file://{abspath}",
                "image_path": os.path.relpath(path, self.root),
                "name": os.path.splitext(os.path.basename(path))[0],
                "label": "" if parent == root_name else parent,
                "prompt": self.prompt,
            }

    def __len__(self):
        if self.limit is None:
            raise NotImplementedError(
                "ImageFolderDataset has unknown length without `limit=...`. "
                "Pass a limit if you need __len__ (e.g. for a progress bar)."
            )
        return self._total_batches(self.limit)