ImageFolderDataset
ImageFolderDataset
Bases: Dataset
Streaming dataset over a directory of image files.
Walks root (optionally recursively), matches every file whose
extension is in extensions (default common image types,
case-insensitive), and yields one row per image — shaped, by default,
as a ChatMessages with a text prompt next to the image.
Crucially, an image is carried as a reference (a file:// URI),
never as bytes: iterating the dataset — even materializing it — does
not load a single pixel into memory. The file is read and inlined as a
base64 data: URI only when its batch is actually sent to the model,
one batch at a time. A folder of a million photos therefore costs a list
of paths to iterate, not a million decoded images.
import synalinks
ds = synalinks.ImageFolderDataset(
root="./photos",
prompt="Describe this image in one short sentence.",
batch_size=8,
)
# program.predict(x=ds()) # each batch's files are read on demand
Each raw row exposes four template variables, so a custom
input_template / output_template can reshape freely:
file_uri: the image asfile:///abs/path(drop into animage_urlpart — the resolver reads it per batch).image_path: the path relative toroot.name: the filename without extension.label: the immediate parent directory name (""for images directly underroot) — the torchvisionImageFolderconvention, handy as a classification target.
For a supervised (image, label) dataset, pass an output_template:
ds = synalinks.ImageFolderDataset(
root="./photos", # e.g. photos/cat/*.jpg, photos/dog/*.jpg
output_template='{"label": {{ label | tojson }}}',
output_data_model=MyLabel,
batch_size=8,
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Directory to walk. Must exist. |
required |
prompt
|
str
|
Text paired with each image by the default
|
'Describe this image.'
|
recursive
|
bool
|
When True (default), descend into
subdirectories. When False, only direct children of |
True
|
extensions
|
tuple
|
Image extensions to match (case-insensitive).
Defaults to |
_DEFAULT_EXTENSIONS
|
input_data_model
|
DataModel
|
See |
None
|
input_schema
|
dict | str
|
See |
None
|
input_template
|
str
|
See |
None
|
output_data_model
|
DataModel
|
See |
None
|
output_schema
|
dict | str
|
See |
None
|
output_template
|
str
|
See |
None
|
batch_size
|
int
|
Examples per yielded batch. Defaults to 8. |
8
|
limit
|
int
|
Optional cap on the number of images consumed. With a
limit set, |
None
|
repeat
|
int
|
See |
1
|
Source code in synalinks/src/datasets/image_folder_dataset.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | |