Skip to content

Knowledge Bases API

KnowledgeBase

Bases: SynalinksSaveable

A knowledge base for storing and retrieving structured data.

The KnowledgeBase provides a unified interface for storing structured data with support for full-text search and optional vector similarity search. It uses DuckDB as the underlying storage engine.

Basic Usage
import synalinks

class Document(synalinks.DataModel):
    id: str
    title: str
    content: str

# Create a knowledge base without embeddings (full-text search only)
knowledge_base = synalinks.KnowledgeBase(
    uri="duckdb://my_database.db",
    data_models=[Document],
)

# Store a document
doc = Document(id="1", title="Hello", content="Hello World!")
await knowledge_base.update(doc.to_json_data_model())

# Retrieve by ID
result = await knowledge_base.get("1", table_name="Document")

# Full-text search
results = await knowledge_base.fulltext_search("Hello", k=10)
embedding_model = synalinks.EmbeddingModel(
    model="ollama/mxbai-embed-large"
)

knowledge_base = synalinks.KnowledgeBase(
    uri="duckdb://./my_database.db",
    data_models=[Document],
    embedding_model=embedding_model,
    metric="cosine",
)

# Hybrid search (combines BM25 fulltext + vector similarity, fused by RRF)
results = await knowledge_base.hybrid_fts_search("semantic query", k=10)
Retrieving Table Definitions
# Get all symbolic data models (table definitions) from the database
symbolic_models = knowledge_base.get_symbolic_data_models()

for model in symbolic_models:
    print(model.get_schema())
    # {'title': 'Document', 'type': 'object', 'properties': {...}, ...}

Parameters:

Name Type Description Default
uri str

The database connection URI. Use "duckdb://path/to/db.db" for DuckDB. If not provided, uses an in-memory database.

None
data_models list

Optional list of DataModel or SymbolicDataModel classes to create tables for.

None
embedding_model EmbeddingModel

Optional embedding model for vector similarity search.

None
metric str

The distance metric for vector search. Options: "cosine", "l2seq", "ip" (default: "cosine").

'cosine'
wipe_on_start bool

Whether to clear the database on initialization (default: False).

False
name str

Optional name for the knowledge base (used for serialization).

None
Source code in synalinks/src/knowledge_bases/knowledge_base.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
@synalinks_export("synalinks.KnowledgeBase")
class KnowledgeBase(SynalinksSaveable):
    """A knowledge base for storing and retrieving structured data.

    The KnowledgeBase provides a unified interface for storing structured data
    with support for full-text search and optional vector similarity search.
    It uses DuckDB as the underlying storage engine.

    ### Basic Usage

    ```python
    import synalinks

    class Document(synalinks.DataModel):
        id: str
        title: str
        content: str

    # Create a knowledge base without embeddings (full-text search only)
    knowledge_base = synalinks.KnowledgeBase(
        uri="duckdb://my_database.db",
        data_models=[Document],
    )

    # Store a document
    doc = Document(id="1", title="Hello", content="Hello World!")
    await knowledge_base.update(doc.to_json_data_model())

    # Retrieve by ID
    result = await knowledge_base.get("1", table_name="Document")

    # Full-text search
    results = await knowledge_base.fulltext_search("Hello", k=10)
    ```

    ### With Vector Similarity Search

    ```python
    embedding_model = synalinks.EmbeddingModel(
        model="ollama/mxbai-embed-large"
    )

    knowledge_base = synalinks.KnowledgeBase(
        uri="duckdb://./my_database.db",
        data_models=[Document],
        embedding_model=embedding_model,
        metric="cosine",
    )

    # Hybrid search (combines BM25 fulltext + vector similarity, fused by RRF)
    results = await knowledge_base.hybrid_fts_search("semantic query", k=10)
    ```

    ### Retrieving Table Definitions

    ```python
    # Get all symbolic data models (table definitions) from the database
    symbolic_models = knowledge_base.get_symbolic_data_models()

    for model in symbolic_models:
        print(model.get_schema())
        # {'title': 'Document', 'type': 'object', 'properties': {...}, ...}
    ```

    Args:
        uri (str): The database connection URI. Use "duckdb://path/to/db.db"
            for DuckDB. If not provided, uses an in-memory database.
        data_models (list): Optional list of DataModel or SymbolicDataModel
            classes to create tables for.
        embedding_model (EmbeddingModel): Optional embedding model for
            vector similarity search.
        metric (str): The distance metric for vector search.
            Options: "cosine", "l2seq", "ip" (default: "cosine").
        wipe_on_start (bool): Whether to clear the database on initialization
            (default: False).
        name (str): Optional name for the knowledge base (used for serialization).
    """

    def __init__(
        self,
        *,
        uri=None,
        data_models=None,
        embedding_model=None,
        metric="cosine",
        wipe_on_start=False,
        name=None,
        encryption_key=None,
        **kwargs,
    ):
        self.adapter = database_adapters.get(uri)(
            uri=uri,
            data_models=data_models,
            embedding_model=embedding_model,
            metric=metric,
            wipe_on_start=wipe_on_start,
            name=name,
            encryption_key=encryption_key,
            **kwargs,
        )
        self.uri = uri
        self.data_models = data_models or []
        self.embedding_model = _get_em(embedding_model)
        self.metric = metric
        self.wipe_on_start = wipe_on_start
        if not name:
            self.name = auto_name("knowledge_base")
        else:
            self.name = name
        # `encryption_key` is deliberately NOT stored on `self` — it
        # lives only inside the adapter, and only as long as the
        # adapter does. This keeps the secret out of `get_config()`,
        # off-screen during repr/print, and unreferenced by any
        # serialization path. Callers must re-supply the key when
        # constructing a new KnowledgeBase against an encrypted file.

    async def update(
        self,
        data_model_or_data_models: Union[Any, List[Any], Dataset],
        *,
        verbose="auto",
    ) -> Union[Any, List[Any]]:
        """Insert or update records in the knowledge base.

        Args:
            data_model_or_data_models (JsonDataModel | List[JsonDataModel] | Dataset):
                A single ``JsonDataModel``, a list of ``JsonDataModel`` /
                ``DataModel`` instances, or a synalinks ``Dataset``.
                The ``Dataset`` form streams the source batch-by-batch
                (one ``adapter.update`` call per yielded batch) so memory
                stays bounded for large CSV / Parquet / HuggingFace
                sources. The dataset must be inputs-only — no
                ``output_template`` — because the knowledge base stores
                records, not ``(input, target)`` pairs; pass a
                labeled dataset and you'll get a ``ValueError``.

                Uses the first field as the primary key for upserts.
            verbose (int | str): ``"auto"``, ``0``, ``1``, or ``2``.
                Verbosity for the ``Dataset`` path; matches the
                trainer's ``fit()`` semantics. ``"auto"`` (default)
                resolves to ``1`` when a ``Dataset`` is passed (a
                per-batch progress bar — same widget ``fit()`` uses,
                with ETA when ``len(dataset)`` is known) and is a
                no-op for the scalar / list forms, which finish in a
                single adapter call.

        Returns:
            The primary key value(s) of the inserted/updated records.
            Scalar in / scalar out; list in / list out; ``Dataset`` in /
            flat list of every batch's ids concatenated.
        """
        if isinstance(data_model_or_data_models, Dataset):
            return await self._update_from_dataset(
                data_model_or_data_models, verbose=verbose
            )
        return await self.adapter.update(data_model_or_data_models)

    async def _update_from_dataset(
        self, dataset: Dataset, *, verbose="auto"
    ) -> List[Any]:
        """Stream a ``Dataset`` into the adapter one batch at a time.

        Each batch yielded by the dataset is converted to a list of
        DataModel / JsonDataModel instances and handed to
        ``adapter.update``. The returned ids from every batch are
        accumulated into one flat list — same order as the dataset
        produced them.

        Inputs-only is enforced: a dataset configured with an
        ``output_template`` represents ``(input, target)`` training
        data, which isn't what the knowledge base stores. The check is
        the dataset's public ``output_template`` attribute, not the
        per-batch tuple length — so the rejection happens upfront,
        before any rows are consumed.
        """
        if dataset.output_template is not None:
            raise ValueError(
                "KnowledgeBase.update accepts only inputs-only datasets "
                "(no `output_template`). The knowledge base stores "
                "records, not (input, target) pairs."
            )

        # "auto" → 1 in the Dataset branch (we know there's iteration to
        # display). Outside this branch verbose is dead anyway.
        if verbose == "auto":
            verbose = 1

        progbar = None
        if verbose:
            try:
                target = len(dataset)
            except (TypeError, NotImplementedError):
                target = None
            progbar = Progbar(target=target, verbose=verbose, unit_name="batch")

        ids: List[Any] = []
        step = 0
        for batch in dataset:
            x = batch[0]
            if len(x) == 0:
                continue
            batch_ids = await self.adapter.update(list(x))
            if isinstance(batch_ids, list):
                ids.extend(batch_ids)
            else:
                ids.append(batch_ids)
            step += 1
            if progbar is not None:
                progbar.update(step, values=[("rows", len(ids))])
        if progbar is not None:
            progbar.update(step, values=[("rows", len(ids))], finalize=True)
        return ids

    async def from_csv(
        self,
        path: str,
        *,
        table_name: Optional[str] = None,
        table_description: Optional[str] = None,
        delimiter: str = ",",
        encoding: str = "utf-8",
        header: bool = True,
    ) -> Any:
        """Bulk-load a CSV file directly into the knowledge base.

        Skips the Python row pipeline entirely (no Pydantic, no Jinja,
        no per-row INSERT) and instead delegates to the database's
        native CSV reader. Roughly two orders of magnitude faster than
        ``update(CSVDataset(...))`` for non-trivial files — see
        ``benchmarks/bench_kb_ingest.py``.

        The target table's schema is inferred directly from the
        file's columns, with the first column promoted to PRIMARY
        KEY. The returned :class:`SymbolicDataModel` is the handle
        you pass to subsequent search / get calls — you don't need
        to pre-declare a ``DataModel`` for this table.

        Use the streaming ``update(<...>Dataset(...))`` path instead
        when source rows need transformation before storage (column
        renames, derived fields, HuggingFace datasets, etc.).

        Args:
            path: Path to the CSV file.
            table_name: Target table name. Defaults to the file's stem
                (``/data/my-docs.csv`` → ``MyDocs``). Whatever value
                lands here is always normalized to PascalCase.
            table_description: Optional natural-language description
                attached to the resulting schema.
            delimiter: Field delimiter. Defaults to ``","``.
            encoding: File encoding. Defaults to ``"utf-8"``.
            header: Whether the first row is a header. Defaults to
                ``True``.

        Returns:
            The :class:`SymbolicDataModel` for the loaded table.
        """
        return await self.adapter.from_csv(
            path,
            table_name=table_name,
            table_description=table_description,
            delimiter=delimiter,
            encoding=encoding,
            header=header,
        )

    async def from_parquet(
        self,
        path: str,
        *,
        table_name: Optional[str] = None,
        table_description: Optional[str] = None,
    ) -> Any:
        """Bulk-load a Parquet file directly into the knowledge base.

        Same trade-offs as :meth:`from_csv` — bypasses the Python row
        pipeline for native database ingestion. Parquet's schema is
        explicit in the file footer so there is no type-inference
        guesswork to worry about.

        Args:
            path: Path to the Parquet file.
            table_name: Target table name. Defaults to the file's stem
                coerced to PascalCase.
            table_description: Optional schema description.

        Returns:
            The :class:`SymbolicDataModel` for the loaded table.
        """
        return await self.adapter.from_parquet(
            path, table_name=table_name, table_description=table_description
        )

    async def from_json(
        self,
        path: str,
        *,
        table_name: Optional[str] = None,
        table_description: Optional[str] = None,
    ) -> Any:
        """Bulk-load a JSON file (top-level array of objects).

        Same trade-offs as :meth:`from_csv` / :meth:`from_parquet` —
        bypasses the Python row pipeline. The file must contain a
        top-level JSON array. Use :meth:`from_jsonl` for the
        one-object-per-line NDJSON format.

        Args:
            path: Path to the JSON file.
            table_name: Target table name. Defaults to the file's stem
                coerced to PascalCase.
            table_description: Optional schema description.

        Returns:
            The :class:`SymbolicDataModel` for the loaded table.
        """
        return await self.adapter.from_json(
            path, table_name=table_name, table_description=table_description
        )

    async def from_jsonl(
        self,
        path: str,
        *,
        table_name: Optional[str] = None,
        table_description: Optional[str] = None,
    ) -> Any:
        """Bulk-load a JSON Lines (NDJSON) file.

        Same trade-offs as :meth:`from_csv` / :meth:`from_parquet`,
        and the right call for very large JSON sources that aren't
        a single array.

        Args:
            path: Path to the JSONL file.
            table_name: Target table name. Defaults to the file's stem
                coerced to PascalCase.
            table_description: Optional schema description.

        Returns:
            The :class:`SymbolicDataModel` for the loaded table.
        """
        return await self.adapter.from_jsonl(
            path, table_name=table_name, table_description=table_description
        )

    async def rename(
        self,
        source: Any,
        *,
        table_name: Optional[str] = None,
        table_description: Optional[str] = None,
    ) -> Any:
        """Rename a table and/or update its description.

        Pass at least one of ``table_name`` / ``table_description``.
        When ``table_name`` is given the underlying table is
        renamed via ``ALTER TABLE …``, the FTS / vector indexes are
        rebuilt under the new name, and the adapter's known-models
        list is updated so subsequent default-table searches find
        the table under its new identity.

        Args:
            source: ``SymbolicDataModel`` or table-name string for
                the table to rename. The string form is itself
                PascalCase-normalized, so callers can pass the
                same input they used in :meth:`from_csv` (e.g.
                ``"my-docs"``).
            table_name: New table name. Always normalized to
                PascalCase.
            table_description: Optional natural-language description
                attached to the resulting schema.

        Returns:
            A fresh :class:`SymbolicDataModel` for the (possibly
            renamed) table.
        """
        return await self.adapter.rename(
            source,
            table_name=table_name,
            table_description=table_description,
        )

    async def get(
        self,
        id_or_ids: Union[Any, List[Any]],
        *,
        table_name: str,
    ) -> Union[Optional[Any], List[Optional[Any]]]:
        """Retrieve one or more records by primary key from a single table.

        Args:
            id_or_ids: A single primary key value, or a list of values.
            table_name: Target table.

        Returns:
            A single JsonDataModel (or ``None``) when called with one id;
            a list of JsonDataModels (with ``None`` in the slots that did
            not match) when called with a list.
        """
        return await self.adapter.get(id_or_ids, table_name=table_name)

    async def getall(
        self,
        *,
        table_name: str,
        limit: int = 50,
        offset: int = 0,
    ) -> List[Any]:
        """Retrieve all records from a table with pagination.

        Args:
            table_name: Target table.
            limit: Maximum number of records to return (default: 50).
            offset: Number of records to skip (default: 0).

        Returns:
            List of JsonDataModels.
        """
        return await self.adapter.getall(
            table_name=table_name, limit=limit, offset=offset
        )

    async def delete(
        self,
        id_or_ids: Union[Any, List[Any]],
        *,
        table_name: str,
    ) -> int:
        """Delete records by primary key from a single table.

        Pass a single id or a list. The FTS / vector indexes for the
        table are rebuilt afterwards so subsequent search calls
        don't return ghost rows.

        Args:
            id_or_ids: Primary key value, or a list of values.
            table_name: Target table.

        Returns:
            The number of rows actually deleted (0 if no id matched).
        """
        return await self.adapter.delete(
            id_or_ids, table_name=table_name
        )

    async def drop_table(self, table_name: str) -> bool:
        """Drop a table from the knowledge base.

        Removes the table's rows, FTS index, and HNSW vector index,
        then drops the table itself. Also forgets the table in the
        adapter's known-models list.

        Args:
            table_name: Target table.

        Returns:
            ``True`` if a table was dropped, ``False`` if it didn't
            exist to begin with.
        """
        return await self.adapter.drop_table(table_name)

    async def query(
        self,
        query: str,
        params: Optional[Dict[str, Any]] = None,
        output_format: str = "json",
        **kwargs,
    ) -> Union[List[Dict[str, Any]], str]:
        """Execute a raw SQL query against the knowledge base.

        Args:
            query (str): The SQL query to execute.
            params (dict): Optional list of parameters for parameterized queries.
            output_format: ``"json"`` (default, list of dicts —
                JSON-shaped Python data) or ``"csv"`` (CSV string,
                useful when handing the result to an LM).
            **kwargs (Any): Additional options. The most important one is
                ``read_only=True/False``. When ``True`` (the DuckDB adapter's
                default) two layers of defence apply:

                1. The SQL is parsed with the engine's own parser and any
                   non-``SELECT`` statement is rejected. This catches
                   multi-statement injection (e.g. ``SELECT 1; DROP TABLE x``),
                   ``COPY ... TO 'file'`` exfiltration, ``ATTACH``, ``EXPORT``,
                   and other side-effecting statements. This is the only
                   layer that blocks writes — the adapter's underlying
                   connection is read-write (one connection per adapter,
                   reused across operations), so the parser check is what
                   keeps untrusted SQL read-only.
                2. ``enable_external_access`` is disabled on that connection
                   at construction time, so ``SELECT`` table functions that
                   touch the host filesystem or network — ``read_csv``,
                   ``read_parquet``, ``read_json``, ``read_blob``,
                   ``read_text``, ``glob`` and the httpfs/S3 variants —
                   return a permission error instead of leaking files.
                   Without this layer,
                   ``SELECT * FROM read_csv('/etc/passwd', ...)`` would pass
                   defence (1) because it is a syntactically valid ``SELECT``.

                Pass ``read_only=False`` only from trusted call sites that
                genuinely need to mutate state. Those paths still run on
                the same sandboxed connection (no external I/O), but they
                bypass the parser check, so any SQL is accepted — keep them
                out of the LM-tool-call surface.

        Returns:
            (Union[List[Dict[str, Any]], str]): A list of dicts when ``output_format="json"``, or a CSV string when ``output_format="csv"``.
        """
        return await self.adapter.query(
            query, params=params, output_format=output_format, **kwargs
        )

    async def similarity_search(
        self,
        text_or_texts: Union[str, List[str]],
        *,
        table_name: str,
        k: int = 10,
        threshold: Optional[float] = None,
        output_format: str = "json",
    ):
        """Vector similarity search against a single table.

        Args:
            text_or_texts: Query text or list of query texts.
            table_name: Target table (single-table search).
            k: Maximum number of results to return.
            threshold: Optional maximum vector-distance threshold.
            output_format: ``"json"`` (default, list of dicts —
                JSON-shaped Python data) or ``"csv"`` (CSV string,
                useful for handing results to an LM since CSV is
                ~30-50% fewer tokens than equivalent JSON).
        """
        return await self.adapter.similarity_search(
            text_or_texts,
            table_name=table_name,
            k=k,
            threshold=threshold,
            output_format=output_format,
        )

    async def fulltext_search(
        self,
        text_or_texts: Union[str, List[str]],
        *,
        table_name: str,
        k: int = 10,
        threshold: Optional[float] = None,
        output_format: str = "json",
    ):
        """BM25 full-text search against a single table.

        Args:
            text_or_texts: Query text or list of query texts.
            table_name: Target table.
            k: Maximum number of results.
            threshold: Optional minimum BM25 score.
            output_format: ``"json"`` (list of dicts, default) / ``"csv"`` (text).
        """
        return await self.adapter.fulltext_search(
            text_or_texts,
            table_name=table_name,
            k=k,
            threshold=threshold,
            output_format=output_format,
        )

    async def regex_search(
        self,
        pattern: str,
        *,
        table_name: str,
        fields: Optional[List[str]] = None,
        case_sensitive: bool = True,
        k: int = 10,
        output_format: str = "json",
    ):
        """Find rows whose string fields match a regular expression.

        DuckDB evaluates regexes with RE2, so patterns are linear-time
        and not vulnerable to catastrophic backtracking.

        Args:
            pattern: The regex pattern (RE2 syntax).
            table_name: Target table.
            fields: Field names to match against. Defaults to every
                string field on the schema. Names are snake_case-
                normalized to match stored column names.
            case_sensitive: When ``False``, match case-insensitively.
            k: Maximum number of results.
            output_format: ``"json"`` (list of dicts, default) / ``"csv"`` (text).
        """
        return await self.adapter.regex_search(
            pattern,
            table_name=table_name,
            fields=fields,
            case_sensitive=case_sensitive,
            k=k,
            output_format=output_format,
        )

    async def hybrid_fts_search(
        self,
        text_or_texts: Union[str, List[str]],
        *,
        table_name: str,
        k: int = 10,
        k_rank: int = 60,
        similarity_threshold: Optional[float] = None,
        fulltext_threshold: Optional[float] = None,
        output_format: str = "json",
    ):
        """Reciprocal-Rank-Fusion of vector similarity + BM25 fulltext.

        Falls back to full-text-only when no embedding model is
        configured. The regex-side sibling is
        :meth:`hybrid_regex_search`.

        Args:
            text_or_texts: Query text or list of query texts.
            table_name: Target table.
            k: Maximum results.
            k_rank: RRF smoothing constant. Lower emphasizes top
                ranks more strongly (default: 60).
            similarity_threshold: Optional vector-distance threshold.
            fulltext_threshold: Optional BM25 threshold.
            output_format: ``"json"`` (list of dicts, default) / ``"csv"`` (text).
        """
        return await self.adapter.hybrid_fts_search(
            text_or_texts,
            table_name=table_name,
            k=k,
            k_rank=k_rank,
            similarity_threshold=similarity_threshold,
            fulltext_threshold=fulltext_threshold,
            output_format=output_format,
        )

    async def hybrid_search(self, *args, **kwargs):
        """Deprecated alias of :meth:`hybrid_fts_search`.

        Kept for backwards compatibility. The new name is symmetric
        with :meth:`hybrid_regex_search`; prefer it in new code.
        """
        return await self.hybrid_fts_search(*args, **kwargs)

    async def hybrid_regex_search(
        self,
        text_or_texts: Union[str, List[str]],
        pattern_or_patterns: Union[str, List[str], None] = None,
        *,
        table_name: str,
        k: int = 10,
        k_rank: int = 60,
        similarity_threshold: Optional[float] = None,
        fields: Optional[List[str]] = None,
        case_sensitive: bool = True,
        output_format: str = "json",
    ):
        """Reciprocal-Rank-Fusion of vector similarity + regex.

        The regex-side counterpart to :meth:`hybrid_fts_search` (which
        pairs vector with BM25 fulltext). The two signals are
        orthogonal: vectors capture semantic similarity, regex
        captures exact textual shape. Ranks are fused with the same
        RRF formula.

        Args:
            text_or_texts: Natural-language query (or list) for the
                vector side.
            pattern_or_patterns: RE2 pattern (or list) for the regex
                side. ``None`` falls back to plain similarity search.
            table_name: Target table.
            k: Maximum results.
            k_rank: RRF smoothing constant.
            similarity_threshold: Vector-distance threshold.
            fields: Forwarded to the regex side.
            case_sensitive: Forwarded to the regex side.
            output_format: ``"json"`` (list of dicts, default) / ``"csv"`` (text).
        """
        return await self.adapter.hybrid_regex_search(
            text_or_texts,
            pattern_or_patterns=pattern_or_patterns,
            table_name=table_name,
            k=k,
            k_rank=k_rank,
            similarity_threshold=similarity_threshold,
            fields=fields,
            case_sensitive=case_sensitive,
            output_format=output_format,
        )

    def get_symbolic_data_models(self) -> List[Any]:
        """Retrieve all symbolic data models (table definitions) from the database.

        Returns a list of SymbolicDataModel objects representing each table
        in the database. This is useful for introspecting the database schema
        or for passing to search methods to limit the search scope.

        Returns:
            list: List of symbolic data models representing the database tables.

        Example:
            ```python
            symbolic_models = knowledge_base.get_symbolic_data_models()
            for model in symbolic_models:
                schema = model.get_schema()
                print(f"Table: {schema['title']}")
                print(f"Fields: {list(schema['properties'].keys())}")
            ```
        """
        return self.adapter.get_symbolic_data_models()

    def get_config(self):
        config = {
            "uri": self.uri,
            "name": self.name,
            "metric": self.metric,
            "wipe_on_start": self.wipe_on_start,
        }
        data_models_config = {
            "data_models": [
                (
                    serialization_lib.serialize_synalinks_object(
                        data_model.to_symbolic_data_model(
                            name="data_model" + (f"_{i}_" if i > 0 else "_") + self.name
                        )
                    )
                    if not is_symbolic_data_model(data_model)
                    else serialization_lib.serialize_synalinks_object(data_model)
                )
                for i, data_model in enumerate(self.data_models)
            ]
        }
        embedding_model_config = {}
        if self.embedding_model:
            embedding_model_config = {
                "embedding_model": serialization_lib.serialize_synalinks_object(
                    self.embedding_model,
                )
            }
        return {
            **data_models_config,
            **embedding_model_config,
            **config,
        }

    @classmethod
    def from_config(cls, config):
        data_models_config = config.pop("data_models", [])
        data_models = [
            serialization_lib.deserialize_synalinks_object(data_model)
            for data_model in data_models_config
        ]
        embedding_model = None
        if "embedding_model" in config:
            embedding_model = serialization_lib.deserialize_synalinks_object(
                config.pop("embedding_model"),
            )
        return cls(
            data_models=data_models,
            embedding_model=embedding_model,
            **config,
        )

delete(id_or_ids, *, table_name) async

Delete records by primary key from a single table.

Pass a single id or a list. The FTS / vector indexes for the table are rebuilt afterwards so subsequent search calls don't return ghost rows.

Parameters:

Name Type Description Default
id_or_ids Union[Any, List[Any]]

Primary key value, or a list of values.

required
table_name str

Target table.

required

Returns:

Type Description
int

The number of rows actually deleted (0 if no id matched).

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def delete(
    self,
    id_or_ids: Union[Any, List[Any]],
    *,
    table_name: str,
) -> int:
    """Delete records by primary key from a single table.

    Pass a single id or a list. The FTS / vector indexes for the
    table are rebuilt afterwards so subsequent search calls
    don't return ghost rows.

    Args:
        id_or_ids: Primary key value, or a list of values.
        table_name: Target table.

    Returns:
        The number of rows actually deleted (0 if no id matched).
    """
    return await self.adapter.delete(
        id_or_ids, table_name=table_name
    )

drop_table(table_name) async

Drop a table from the knowledge base.

Removes the table's rows, FTS index, and HNSW vector index, then drops the table itself. Also forgets the table in the adapter's known-models list.

Parameters:

Name Type Description Default
table_name str

Target table.

required

Returns:

Type Description
bool

True if a table was dropped, False if it didn't

bool

exist to begin with.

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def drop_table(self, table_name: str) -> bool:
    """Drop a table from the knowledge base.

    Removes the table's rows, FTS index, and HNSW vector index,
    then drops the table itself. Also forgets the table in the
    adapter's known-models list.

    Args:
        table_name: Target table.

    Returns:
        ``True`` if a table was dropped, ``False`` if it didn't
        exist to begin with.
    """
    return await self.adapter.drop_table(table_name)

from_csv(path, *, table_name=None, table_description=None, delimiter=',', encoding='utf-8', header=True) async

Bulk-load a CSV file directly into the knowledge base.

Skips the Python row pipeline entirely (no Pydantic, no Jinja, no per-row INSERT) and instead delegates to the database's native CSV reader. Roughly two orders of magnitude faster than update(CSVDataset(...)) for non-trivial files — see benchmarks/bench_kb_ingest.py.

The target table's schema is inferred directly from the file's columns, with the first column promoted to PRIMARY KEY. The returned :class:SymbolicDataModel is the handle you pass to subsequent search / get calls — you don't need to pre-declare a DataModel for this table.

Use the streaming update(<...>Dataset(...)) path instead when source rows need transformation before storage (column renames, derived fields, HuggingFace datasets, etc.).

Parameters:

Name Type Description Default
path str

Path to the CSV file.

required
table_name Optional[str]

Target table name. Defaults to the file's stem (/data/my-docs.csvMyDocs). Whatever value lands here is always normalized to PascalCase.

None
table_description Optional[str]

Optional natural-language description attached to the resulting schema.

None
delimiter str

Field delimiter. Defaults to ",".

','
encoding str

File encoding. Defaults to "utf-8".

'utf-8'
header bool

Whether the first row is a header. Defaults to True.

True

Returns:

Name Type Description
The Any

class:SymbolicDataModel for the loaded table.

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def from_csv(
    self,
    path: str,
    *,
    table_name: Optional[str] = None,
    table_description: Optional[str] = None,
    delimiter: str = ",",
    encoding: str = "utf-8",
    header: bool = True,
) -> Any:
    """Bulk-load a CSV file directly into the knowledge base.

    Skips the Python row pipeline entirely (no Pydantic, no Jinja,
    no per-row INSERT) and instead delegates to the database's
    native CSV reader. Roughly two orders of magnitude faster than
    ``update(CSVDataset(...))`` for non-trivial files — see
    ``benchmarks/bench_kb_ingest.py``.

    The target table's schema is inferred directly from the
    file's columns, with the first column promoted to PRIMARY
    KEY. The returned :class:`SymbolicDataModel` is the handle
    you pass to subsequent search / get calls — you don't need
    to pre-declare a ``DataModel`` for this table.

    Use the streaming ``update(<...>Dataset(...))`` path instead
    when source rows need transformation before storage (column
    renames, derived fields, HuggingFace datasets, etc.).

    Args:
        path: Path to the CSV file.
        table_name: Target table name. Defaults to the file's stem
            (``/data/my-docs.csv`` → ``MyDocs``). Whatever value
            lands here is always normalized to PascalCase.
        table_description: Optional natural-language description
            attached to the resulting schema.
        delimiter: Field delimiter. Defaults to ``","``.
        encoding: File encoding. Defaults to ``"utf-8"``.
        header: Whether the first row is a header. Defaults to
            ``True``.

    Returns:
        The :class:`SymbolicDataModel` for the loaded table.
    """
    return await self.adapter.from_csv(
        path,
        table_name=table_name,
        table_description=table_description,
        delimiter=delimiter,
        encoding=encoding,
        header=header,
    )

from_json(path, *, table_name=None, table_description=None) async

Bulk-load a JSON file (top-level array of objects).

Same trade-offs as :meth:from_csv / :meth:from_parquet — bypasses the Python row pipeline. The file must contain a top-level JSON array. Use :meth:from_jsonl for the one-object-per-line NDJSON format.

Parameters:

Name Type Description Default
path str

Path to the JSON file.

required
table_name Optional[str]

Target table name. Defaults to the file's stem coerced to PascalCase.

None
table_description Optional[str]

Optional schema description.

None

Returns:

Name Type Description
The Any

class:SymbolicDataModel for the loaded table.

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def from_json(
    self,
    path: str,
    *,
    table_name: Optional[str] = None,
    table_description: Optional[str] = None,
) -> Any:
    """Bulk-load a JSON file (top-level array of objects).

    Same trade-offs as :meth:`from_csv` / :meth:`from_parquet` —
    bypasses the Python row pipeline. The file must contain a
    top-level JSON array. Use :meth:`from_jsonl` for the
    one-object-per-line NDJSON format.

    Args:
        path: Path to the JSON file.
        table_name: Target table name. Defaults to the file's stem
            coerced to PascalCase.
        table_description: Optional schema description.

    Returns:
        The :class:`SymbolicDataModel` for the loaded table.
    """
    return await self.adapter.from_json(
        path, table_name=table_name, table_description=table_description
    )

from_jsonl(path, *, table_name=None, table_description=None) async

Bulk-load a JSON Lines (NDJSON) file.

Same trade-offs as :meth:from_csv / :meth:from_parquet, and the right call for very large JSON sources that aren't a single array.

Parameters:

Name Type Description Default
path str

Path to the JSONL file.

required
table_name Optional[str]

Target table name. Defaults to the file's stem coerced to PascalCase.

None
table_description Optional[str]

Optional schema description.

None

Returns:

Name Type Description
The Any

class:SymbolicDataModel for the loaded table.

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def from_jsonl(
    self,
    path: str,
    *,
    table_name: Optional[str] = None,
    table_description: Optional[str] = None,
) -> Any:
    """Bulk-load a JSON Lines (NDJSON) file.

    Same trade-offs as :meth:`from_csv` / :meth:`from_parquet`,
    and the right call for very large JSON sources that aren't
    a single array.

    Args:
        path: Path to the JSONL file.
        table_name: Target table name. Defaults to the file's stem
            coerced to PascalCase.
        table_description: Optional schema description.

    Returns:
        The :class:`SymbolicDataModel` for the loaded table.
    """
    return await self.adapter.from_jsonl(
        path, table_name=table_name, table_description=table_description
    )

from_parquet(path, *, table_name=None, table_description=None) async

Bulk-load a Parquet file directly into the knowledge base.

Same trade-offs as :meth:from_csv — bypasses the Python row pipeline for native database ingestion. Parquet's schema is explicit in the file footer so there is no type-inference guesswork to worry about.

Parameters:

Name Type Description Default
path str

Path to the Parquet file.

required
table_name Optional[str]

Target table name. Defaults to the file's stem coerced to PascalCase.

None
table_description Optional[str]

Optional schema description.

None

Returns:

Name Type Description
The Any

class:SymbolicDataModel for the loaded table.

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def from_parquet(
    self,
    path: str,
    *,
    table_name: Optional[str] = None,
    table_description: Optional[str] = None,
) -> Any:
    """Bulk-load a Parquet file directly into the knowledge base.

    Same trade-offs as :meth:`from_csv` — bypasses the Python row
    pipeline for native database ingestion. Parquet's schema is
    explicit in the file footer so there is no type-inference
    guesswork to worry about.

    Args:
        path: Path to the Parquet file.
        table_name: Target table name. Defaults to the file's stem
            coerced to PascalCase.
        table_description: Optional schema description.

    Returns:
        The :class:`SymbolicDataModel` for the loaded table.
    """
    return await self.adapter.from_parquet(
        path, table_name=table_name, table_description=table_description
    )

BM25 full-text search against a single table.

Parameters:

Name Type Description Default
text_or_texts Union[str, List[str]]

Query text or list of query texts.

required
table_name str

Target table.

required
k int

Maximum number of results.

10
threshold Optional[float]

Optional minimum BM25 score.

None
output_format str

"json" (list of dicts, default) / "csv" (text).

'json'
Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def fulltext_search(
    self,
    text_or_texts: Union[str, List[str]],
    *,
    table_name: str,
    k: int = 10,
    threshold: Optional[float] = None,
    output_format: str = "json",
):
    """BM25 full-text search against a single table.

    Args:
        text_or_texts: Query text or list of query texts.
        table_name: Target table.
        k: Maximum number of results.
        threshold: Optional minimum BM25 score.
        output_format: ``"json"`` (list of dicts, default) / ``"csv"`` (text).
    """
    return await self.adapter.fulltext_search(
        text_or_texts,
        table_name=table_name,
        k=k,
        threshold=threshold,
        output_format=output_format,
    )

get(id_or_ids, *, table_name) async

Retrieve one or more records by primary key from a single table.

Parameters:

Name Type Description Default
id_or_ids Union[Any, List[Any]]

A single primary key value, or a list of values.

required
table_name str

Target table.

required

Returns:

Type Description
Union[Optional[Any], List[Optional[Any]]]

A single JsonDataModel (or None) when called with one id;

Union[Optional[Any], List[Optional[Any]]]

a list of JsonDataModels (with None in the slots that did

Union[Optional[Any], List[Optional[Any]]]

not match) when called with a list.

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def get(
    self,
    id_or_ids: Union[Any, List[Any]],
    *,
    table_name: str,
) -> Union[Optional[Any], List[Optional[Any]]]:
    """Retrieve one or more records by primary key from a single table.

    Args:
        id_or_ids: A single primary key value, or a list of values.
        table_name: Target table.

    Returns:
        A single JsonDataModel (or ``None``) when called with one id;
        a list of JsonDataModels (with ``None`` in the slots that did
        not match) when called with a list.
    """
    return await self.adapter.get(id_or_ids, table_name=table_name)

get_symbolic_data_models()

Retrieve all symbolic data models (table definitions) from the database.

Returns a list of SymbolicDataModel objects representing each table in the database. This is useful for introspecting the database schema or for passing to search methods to limit the search scope.

Returns:

Name Type Description
list List[Any]

List of symbolic data models representing the database tables.

Example
symbolic_models = knowledge_base.get_symbolic_data_models()
for model in symbolic_models:
    schema = model.get_schema()
    print(f"Table: {schema['title']}")
    print(f"Fields: {list(schema['properties'].keys())}")
Source code in synalinks/src/knowledge_bases/knowledge_base.py
def get_symbolic_data_models(self) -> List[Any]:
    """Retrieve all symbolic data models (table definitions) from the database.

    Returns a list of SymbolicDataModel objects representing each table
    in the database. This is useful for introspecting the database schema
    or for passing to search methods to limit the search scope.

    Returns:
        list: List of symbolic data models representing the database tables.

    Example:
        ```python
        symbolic_models = knowledge_base.get_symbolic_data_models()
        for model in symbolic_models:
            schema = model.get_schema()
            print(f"Table: {schema['title']}")
            print(f"Fields: {list(schema['properties'].keys())}")
        ```
    """
    return self.adapter.get_symbolic_data_models()

getall(*, table_name, limit=50, offset=0) async

Retrieve all records from a table with pagination.

Parameters:

Name Type Description Default
table_name str

Target table.

required
limit int

Maximum number of records to return (default: 50).

50
offset int

Number of records to skip (default: 0).

0

Returns:

Type Description
List[Any]

List of JsonDataModels.

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def getall(
    self,
    *,
    table_name: str,
    limit: int = 50,
    offset: int = 0,
) -> List[Any]:
    """Retrieve all records from a table with pagination.

    Args:
        table_name: Target table.
        limit: Maximum number of records to return (default: 50).
        offset: Number of records to skip (default: 0).

    Returns:
        List of JsonDataModels.
    """
    return await self.adapter.getall(
        table_name=table_name, limit=limit, offset=offset
    )

Reciprocal-Rank-Fusion of vector similarity + BM25 fulltext.

Falls back to full-text-only when no embedding model is configured. The regex-side sibling is :meth:hybrid_regex_search.

Parameters:

Name Type Description Default
text_or_texts Union[str, List[str]]

Query text or list of query texts.

required
table_name str

Target table.

required
k int

Maximum results.

10
k_rank int

RRF smoothing constant. Lower emphasizes top ranks more strongly (default: 60).

60
similarity_threshold Optional[float]

Optional vector-distance threshold.

None
fulltext_threshold Optional[float]

Optional BM25 threshold.

None
output_format str

"json" (list of dicts, default) / "csv" (text).

'json'
Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def hybrid_fts_search(
    self,
    text_or_texts: Union[str, List[str]],
    *,
    table_name: str,
    k: int = 10,
    k_rank: int = 60,
    similarity_threshold: Optional[float] = None,
    fulltext_threshold: Optional[float] = None,
    output_format: str = "json",
):
    """Reciprocal-Rank-Fusion of vector similarity + BM25 fulltext.

    Falls back to full-text-only when no embedding model is
    configured. The regex-side sibling is
    :meth:`hybrid_regex_search`.

    Args:
        text_or_texts: Query text or list of query texts.
        table_name: Target table.
        k: Maximum results.
        k_rank: RRF smoothing constant. Lower emphasizes top
            ranks more strongly (default: 60).
        similarity_threshold: Optional vector-distance threshold.
        fulltext_threshold: Optional BM25 threshold.
        output_format: ``"json"`` (list of dicts, default) / ``"csv"`` (text).
    """
    return await self.adapter.hybrid_fts_search(
        text_or_texts,
        table_name=table_name,
        k=k,
        k_rank=k_rank,
        similarity_threshold=similarity_threshold,
        fulltext_threshold=fulltext_threshold,
        output_format=output_format,
    )

Reciprocal-Rank-Fusion of vector similarity + regex.

The regex-side counterpart to :meth:hybrid_fts_search (which pairs vector with BM25 fulltext). The two signals are orthogonal: vectors capture semantic similarity, regex captures exact textual shape. Ranks are fused with the same RRF formula.

Parameters:

Name Type Description Default
text_or_texts Union[str, List[str]]

Natural-language query (or list) for the vector side.

required
pattern_or_patterns Union[str, List[str], None]

RE2 pattern (or list) for the regex side. None falls back to plain similarity search.

None
table_name str

Target table.

required
k int

Maximum results.

10
k_rank int

RRF smoothing constant.

60
similarity_threshold Optional[float]

Vector-distance threshold.

None
fields Optional[List[str]]

Forwarded to the regex side.

None
case_sensitive bool

Forwarded to the regex side.

True
output_format str

"json" (list of dicts, default) / "csv" (text).

'json'
Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def hybrid_regex_search(
    self,
    text_or_texts: Union[str, List[str]],
    pattern_or_patterns: Union[str, List[str], None] = None,
    *,
    table_name: str,
    k: int = 10,
    k_rank: int = 60,
    similarity_threshold: Optional[float] = None,
    fields: Optional[List[str]] = None,
    case_sensitive: bool = True,
    output_format: str = "json",
):
    """Reciprocal-Rank-Fusion of vector similarity + regex.

    The regex-side counterpart to :meth:`hybrid_fts_search` (which
    pairs vector with BM25 fulltext). The two signals are
    orthogonal: vectors capture semantic similarity, regex
    captures exact textual shape. Ranks are fused with the same
    RRF formula.

    Args:
        text_or_texts: Natural-language query (or list) for the
            vector side.
        pattern_or_patterns: RE2 pattern (or list) for the regex
            side. ``None`` falls back to plain similarity search.
        table_name: Target table.
        k: Maximum results.
        k_rank: RRF smoothing constant.
        similarity_threshold: Vector-distance threshold.
        fields: Forwarded to the regex side.
        case_sensitive: Forwarded to the regex side.
        output_format: ``"json"`` (list of dicts, default) / ``"csv"`` (text).
    """
    return await self.adapter.hybrid_regex_search(
        text_or_texts,
        pattern_or_patterns=pattern_or_patterns,
        table_name=table_name,
        k=k,
        k_rank=k_rank,
        similarity_threshold=similarity_threshold,
        fields=fields,
        case_sensitive=case_sensitive,
        output_format=output_format,
    )

Deprecated alias of :meth:hybrid_fts_search.

Kept for backwards compatibility. The new name is symmetric with :meth:hybrid_regex_search; prefer it in new code.

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def hybrid_search(self, *args, **kwargs):
    """Deprecated alias of :meth:`hybrid_fts_search`.

    Kept for backwards compatibility. The new name is symmetric
    with :meth:`hybrid_regex_search`; prefer it in new code.
    """
    return await self.hybrid_fts_search(*args, **kwargs)

query(query, params=None, output_format='json', **kwargs) async

Execute a raw SQL query against the knowledge base.

Parameters:

Name Type Description Default
query str

The SQL query to execute.

required
params dict

Optional list of parameters for parameterized queries.

None
output_format str

"json" (default, list of dicts — JSON-shaped Python data) or "csv" (CSV string, useful when handing the result to an LM).

'json'
**kwargs Any

Additional options. The most important one is read_only=True/False. When True (the DuckDB adapter's default) two layers of defence apply:

  1. The SQL is parsed with the engine's own parser and any non-SELECT statement is rejected. This catches multi-statement injection (e.g. SELECT 1; DROP TABLE x), COPY ... TO 'file' exfiltration, ATTACH, EXPORT, and other side-effecting statements. This is the only layer that blocks writes — the adapter's underlying connection is read-write (one connection per adapter, reused across operations), so the parser check is what keeps untrusted SQL read-only.
  2. enable_external_access is disabled on that connection at construction time, so SELECT table functions that touch the host filesystem or network — read_csv, read_parquet, read_json, read_blob, read_text, glob and the httpfs/S3 variants — return a permission error instead of leaking files. Without this layer, SELECT * FROM read_csv('/etc/passwd', ...) would pass defence (1) because it is a syntactically valid SELECT.

Pass read_only=False only from trusted call sites that genuinely need to mutate state. Those paths still run on the same sandboxed connection (no external I/O), but they bypass the parser check, so any SQL is accepted — keep them out of the LM-tool-call surface.

{}

Returns:

Type Description
Union[List[Dict[str, Any]], str]

A list of dicts when output_format="json", or a CSV string when output_format="csv".

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def query(
    self,
    query: str,
    params: Optional[Dict[str, Any]] = None,
    output_format: str = "json",
    **kwargs,
) -> Union[List[Dict[str, Any]], str]:
    """Execute a raw SQL query against the knowledge base.

    Args:
        query (str): The SQL query to execute.
        params (dict): Optional list of parameters for parameterized queries.
        output_format: ``"json"`` (default, list of dicts —
            JSON-shaped Python data) or ``"csv"`` (CSV string,
            useful when handing the result to an LM).
        **kwargs (Any): Additional options. The most important one is
            ``read_only=True/False``. When ``True`` (the DuckDB adapter's
            default) two layers of defence apply:

            1. The SQL is parsed with the engine's own parser and any
               non-``SELECT`` statement is rejected. This catches
               multi-statement injection (e.g. ``SELECT 1; DROP TABLE x``),
               ``COPY ... TO 'file'`` exfiltration, ``ATTACH``, ``EXPORT``,
               and other side-effecting statements. This is the only
               layer that blocks writes — the adapter's underlying
               connection is read-write (one connection per adapter,
               reused across operations), so the parser check is what
               keeps untrusted SQL read-only.
            2. ``enable_external_access`` is disabled on that connection
               at construction time, so ``SELECT`` table functions that
               touch the host filesystem or network — ``read_csv``,
               ``read_parquet``, ``read_json``, ``read_blob``,
               ``read_text``, ``glob`` and the httpfs/S3 variants —
               return a permission error instead of leaking files.
               Without this layer,
               ``SELECT * FROM read_csv('/etc/passwd', ...)`` would pass
               defence (1) because it is a syntactically valid ``SELECT``.

            Pass ``read_only=False`` only from trusted call sites that
            genuinely need to mutate state. Those paths still run on
            the same sandboxed connection (no external I/O), but they
            bypass the parser check, so any SQL is accepted — keep them
            out of the LM-tool-call surface.

    Returns:
        (Union[List[Dict[str, Any]], str]): A list of dicts when ``output_format="json"``, or a CSV string when ``output_format="csv"``.
    """
    return await self.adapter.query(
        query, params=params, output_format=output_format, **kwargs
    )

Find rows whose string fields match a regular expression.

DuckDB evaluates regexes with RE2, so patterns are linear-time and not vulnerable to catastrophic backtracking.

Parameters:

Name Type Description Default
pattern str

The regex pattern (RE2 syntax).

required
table_name str

Target table.

required
fields Optional[List[str]]

Field names to match against. Defaults to every string field on the schema. Names are snake_case- normalized to match stored column names.

None
case_sensitive bool

When False, match case-insensitively.

True
k int

Maximum number of results.

10
output_format str

"json" (list of dicts, default) / "csv" (text).

'json'
Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def regex_search(
    self,
    pattern: str,
    *,
    table_name: str,
    fields: Optional[List[str]] = None,
    case_sensitive: bool = True,
    k: int = 10,
    output_format: str = "json",
):
    """Find rows whose string fields match a regular expression.

    DuckDB evaluates regexes with RE2, so patterns are linear-time
    and not vulnerable to catastrophic backtracking.

    Args:
        pattern: The regex pattern (RE2 syntax).
        table_name: Target table.
        fields: Field names to match against. Defaults to every
            string field on the schema. Names are snake_case-
            normalized to match stored column names.
        case_sensitive: When ``False``, match case-insensitively.
        k: Maximum number of results.
        output_format: ``"json"`` (list of dicts, default) / ``"csv"`` (text).
    """
    return await self.adapter.regex_search(
        pattern,
        table_name=table_name,
        fields=fields,
        case_sensitive=case_sensitive,
        k=k,
        output_format=output_format,
    )

rename(source, *, table_name=None, table_description=None) async

Rename a table and/or update its description.

Pass at least one of table_name / table_description. When table_name is given the underlying table is renamed via ALTER TABLE …, the FTS / vector indexes are rebuilt under the new name, and the adapter's known-models list is updated so subsequent default-table searches find the table under its new identity.

Parameters:

Name Type Description Default
source Any

SymbolicDataModel or table-name string for the table to rename. The string form is itself PascalCase-normalized, so callers can pass the same input they used in :meth:from_csv (e.g. "my-docs").

required
table_name Optional[str]

New table name. Always normalized to PascalCase.

None
table_description Optional[str]

Optional natural-language description attached to the resulting schema.

None

Returns:

Type Description
Any

A fresh :class:SymbolicDataModel for the (possibly

Any

renamed) table.

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def rename(
    self,
    source: Any,
    *,
    table_name: Optional[str] = None,
    table_description: Optional[str] = None,
) -> Any:
    """Rename a table and/or update its description.

    Pass at least one of ``table_name`` / ``table_description``.
    When ``table_name`` is given the underlying table is
    renamed via ``ALTER TABLE …``, the FTS / vector indexes are
    rebuilt under the new name, and the adapter's known-models
    list is updated so subsequent default-table searches find
    the table under its new identity.

    Args:
        source: ``SymbolicDataModel`` or table-name string for
            the table to rename. The string form is itself
            PascalCase-normalized, so callers can pass the
            same input they used in :meth:`from_csv` (e.g.
            ``"my-docs"``).
        table_name: New table name. Always normalized to
            PascalCase.
        table_description: Optional natural-language description
            attached to the resulting schema.

    Returns:
        A fresh :class:`SymbolicDataModel` for the (possibly
        renamed) table.
    """
    return await self.adapter.rename(
        source,
        table_name=table_name,
        table_description=table_description,
    )

Vector similarity search against a single table.

Parameters:

Name Type Description Default
text_or_texts Union[str, List[str]]

Query text or list of query texts.

required
table_name str

Target table (single-table search).

required
k int

Maximum number of results to return.

10
threshold Optional[float]

Optional maximum vector-distance threshold.

None
output_format str

"json" (default, list of dicts — JSON-shaped Python data) or "csv" (CSV string, useful for handing results to an LM since CSV is ~30-50% fewer tokens than equivalent JSON).

'json'
Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def similarity_search(
    self,
    text_or_texts: Union[str, List[str]],
    *,
    table_name: str,
    k: int = 10,
    threshold: Optional[float] = None,
    output_format: str = "json",
):
    """Vector similarity search against a single table.

    Args:
        text_or_texts: Query text or list of query texts.
        table_name: Target table (single-table search).
        k: Maximum number of results to return.
        threshold: Optional maximum vector-distance threshold.
        output_format: ``"json"`` (default, list of dicts —
            JSON-shaped Python data) or ``"csv"`` (CSV string,
            useful for handing results to an LM since CSV is
            ~30-50% fewer tokens than equivalent JSON).
    """
    return await self.adapter.similarity_search(
        text_or_texts,
        table_name=table_name,
        k=k,
        threshold=threshold,
        output_format=output_format,
    )

update(data_model_or_data_models, *, verbose='auto') async

Insert or update records in the knowledge base.

Parameters:

Name Type Description Default
data_model_or_data_models JsonDataModel | List[JsonDataModel] | Dataset

A single JsonDataModel, a list of JsonDataModel / DataModel instances, or a synalinks Dataset. The Dataset form streams the source batch-by-batch (one adapter.update call per yielded batch) so memory stays bounded for large CSV / Parquet / HuggingFace sources. The dataset must be inputs-only — no output_template — because the knowledge base stores records, not (input, target) pairs; pass a labeled dataset and you'll get a ValueError.

Uses the first field as the primary key for upserts.

required
verbose int | str

"auto", 0, 1, or 2. Verbosity for the Dataset path; matches the trainer's fit() semantics. "auto" (default) resolves to 1 when a Dataset is passed (a per-batch progress bar — same widget fit() uses, with ETA when len(dataset) is known) and is a no-op for the scalar / list forms, which finish in a single adapter call.

'auto'

Returns:

Type Description
Union[Any, List[Any]]

The primary key value(s) of the inserted/updated records.

Union[Any, List[Any]]

Scalar in / scalar out; list in / list out; Dataset in /

Union[Any, List[Any]]

flat list of every batch's ids concatenated.

Source code in synalinks/src/knowledge_bases/knowledge_base.py
async def update(
    self,
    data_model_or_data_models: Union[Any, List[Any], Dataset],
    *,
    verbose="auto",
) -> Union[Any, List[Any]]:
    """Insert or update records in the knowledge base.

    Args:
        data_model_or_data_models (JsonDataModel | List[JsonDataModel] | Dataset):
            A single ``JsonDataModel``, a list of ``JsonDataModel`` /
            ``DataModel`` instances, or a synalinks ``Dataset``.
            The ``Dataset`` form streams the source batch-by-batch
            (one ``adapter.update`` call per yielded batch) so memory
            stays bounded for large CSV / Parquet / HuggingFace
            sources. The dataset must be inputs-only — no
            ``output_template`` — because the knowledge base stores
            records, not ``(input, target)`` pairs; pass a
            labeled dataset and you'll get a ``ValueError``.

            Uses the first field as the primary key for upserts.
        verbose (int | str): ``"auto"``, ``0``, ``1``, or ``2``.
            Verbosity for the ``Dataset`` path; matches the
            trainer's ``fit()`` semantics. ``"auto"`` (default)
            resolves to ``1`` when a ``Dataset`` is passed (a
            per-batch progress bar — same widget ``fit()`` uses,
            with ETA when ``len(dataset)`` is known) and is a
            no-op for the scalar / list forms, which finish in a
            single adapter call.

    Returns:
        The primary key value(s) of the inserted/updated records.
        Scalar in / scalar out; list in / list out; ``Dataset`` in /
        flat list of every batch's ids concatenated.
    """
    if isinstance(data_model_or_data_models, Dataset):
        return await self._update_from_dataset(
            data_model_or_data_models, verbose=verbose
        )
    return await self.adapter.update(data_model_or_data_models)