Skip to content

Architecture

The core problem

USearch requires all vectors in an index to have the same dimensionality. ISCC codes are variable-length (64-bit to 256-bit). iscc-usearch solves this with length-prefixed padding and the native NPHD metric that ignores padding during distance computation.

Length-prefixed padding

NphdIndex architecture overview

Every vector is padded to a uniform size before storage. The first byte holds the original vector length (in bytes), then the vector data, then zero-padding.

The padded vector is stored in USearch as a ScalarKind.B1 (binary) vector with ndim = max_dim + 8 bits (the extra 8 bits account for the length byte).

On retrieval, unpad_vectors reads the length byte and returns only the valid data bytes.

Both pad_vectors and unpad_vectors are plain Python/NumPy functions. pad_vectors uses vectorized array slicing for 2D ndarray input (uniform-length vectors) and a Python loop for list-of-arrays input (variable-length vectors).

Maximum vector size

The NPHD metric supports vectors up to 33 bytes (1 length byte + 32 data bytes), so max_dim is capped at 256 bits. This matches the maximum resolution of ISCC content fingerprints. NphdIndex validates this at construction time.

Index class hierarchy

iscc-usearch provides six index classes. Two are single-file, four are sharded:

Index                            (usearch wrapper, uint64 keys)
└── NphdIndex                    (variable-length + NPHD metric)

ShardedIndex                     (composition-based sharding, uint64 keys)
├── ShardedIndex128              (128-bit UUID keys via _UuidKeyMixin)
└── ShardedNphdIndex             (variable-length + NPHD metric)
    └── ShardedNphdIndex128      (128-bit UUID keys via _UuidKeyMixin)

NphdIndex inherits from Index (which extends USearch's Index) and adds padding and the NPHD metric. ShardedIndex is a standalone composition-based class that manages multiple USearch indexes as shards. ShardedNphdIndex extends ShardedIndex with variable-length vector support and NPHD.

The 128 variants add 128-bit key support using usearch's key_kind="uuid" mode. Keys are bytes(16) for single operations and np.dtype('V16') arrays for batches. All 128-bit logic concentrates in a _UuidKeyMixin that overrides key-handling hooks on ShardedIndex — the base classes stay clean.

Choosing an index class

Class Var-len Keys Shards Upsert Remove Compact Use case
NphdIndex uint64 ISCC codes, fits in RAM
ShardedIndex uint64 Fixed-length vectors, large scale
ShardedIndex128 128-bit Fixed-length vectors, 128-bit keys
ShardedNphdIndex uint64 ISCC codes, large scale (production)
ShardedNphdIndex128 128-bit ISCC codes, large scale, 128-bit keys

About Index

Index is an internal base class that wraps USearch's Index with upsert support. It is not part of the public API and not exported in __all__. Use NphdIndex instead — it inherits all of Index's functionality and adds NPHD support.

For most ISCC workloads, use NphdIndex for datasets that fit in RAM, or ShardedNphdIndex for datasets that exceed RAM or need consistent insert throughput. Use the 128 variants when you need keys beyond 64 bits — see the UUID keys how-to.

Data flow

Write path (add)

For NphdIndex, the application calls add(key, vector), which pads the vector (prepends a length byte, zero-fills to max_dim + 8 bits) via pad_vectors and stores it in the USearch HNSW graph.

For ShardedNphdIndex, the write path adds bloom filter updates, dirty counter increments, and automatic shard rotation: once the active shard exceeds shard_size, it is saved to disk and reopened as a memory-mapped view while a fresh active shard is created.

Delete path (remove)

For ShardedNphdIndex, remove() checks the bloom filter first for fast rejection. Active shard entries are removed immediately via USearch's lazy deletion. View shard entries are tombstoned — tracked in a _tombstones set and persisted as tombstones.npy. Tombstoned entries are suppressed in search results and iterators. compact() rebuilds view shards to physically remove tombstoned entries and reclaim disk space.

The query is padded and searched across all shards in parallel (active shard in RAM plus memory-mapped view shards). Each shard invokes the NPHD metric for distance computations, returning distances in [0.0, 1.0]. Results are merged via argsort and top-k selection. When tombstones or cross-shard duplicates exist, view shard results are oversampled and filtered to exclude stale entries before merging.

Concurrency model

iscc-usearch is designed for single-process access. The underlying .usearch files have no file locking or multi-process coordination.

Warning

Running multiple processes against the same index files may corrupt data.

Within a single process, use async/await for concurrent connections (e.g., serving search queries from an async web framework). The index objects themselves are not thread-safe -- guard concurrent access with a lock if using threads.

For sharded indexes, shard rotation (save + reopen) is not atomic. Concurrent reads during rotation are safe because completed view shards are immutable, but concurrent writes must be serialized.

Why a thin wrapper

iscc-usearch does not fork USearch's index logic. It wraps the existing Index class and adds padding at the boundary. The NPHD metric is provided natively by usearch-iscc as MetricKind.NPHD, so no metric restoration is needed after persistence operations. This keeps the wrapper small and lets it track upstream USearch improvements without merge conflicts.

The one exception is the patched usearch fork, which modifies USearch's C++ core for performance. Those patches are confined to view/load paths and don't change the index format.