iscc-usearch¶
Scalable approximate nearest neighbor search for variable-length binary bit-vectors.
iscc-usearch extends USearch with capabilities
purpose-built for ISCC (ISO 24138) content fingerprints: indexing
binary vectors of mixed bit-lengths in a single index, and scaling beyond available RAM through
transparent sharding.
Why not plain USearch?¶
USearch is a fast, general-purpose vector index -- but it assumes all vectors have the same dimensionality, and a single index must fit in memory for writes. ISCC codes break both assumptions:
-
Variable-length codes. An ISCC content fingerprint can be 64, 128, or 256 bits depending on resolution. Shorter codes are prefixes of longer ones -- a design shared with Matryoshka Representation Learning. A useful index must store and compare all resolutions together.
-
Large-scale collections. Real-world content registries grow to hundreds of millions of fingerprints. Write throughput in HNSW graphs degrades as the graph grows, and the full graph must be loaded into RAM for inserts.
iscc-usearch solves both problems with two core additions:
flowchart TB
subgraph VL ["Variable-length indexing"]
direction LR
A["64-bit code"] ---|prefix of| B["128-bit code"]
B ---|prefix of| C["256-bit code"]
end
subgraph NP ["NPHD metric"]
direction LR
D["Compare common prefix"] --- E["Normalize to [0.0, 1.0]"]
end
subgraph SS ["Sharded storage"]
direction LR
F["Active shard<br/>(RAM, read-write)"] -.->|rotate when full| G["View shards<br/>(mmap, read-only)"]
end
VL --> NP --> SS
Normalized Prefix Hamming Distance (NPHD) compares only the bits that both vectors share and
normalizes the result to [0.0, 1.0]. A 64-bit query can find its nearest neighbors among
256-bit vectors -- distances remain comparable across resolutions.
Transparent sharding keeps a single active shard in RAM for writes while completed shards are memory-mapped for reads. This maintains consistent insert throughput regardless of index size and keeps the memory footprint bounded.
Quick start¶
import numpy as np
from iscc_usearch import NphdIndex
index = NphdIndex(max_dim=256)
# Mix 64-bit and 128-bit vectors in the same index
index.add(1, np.array([255, 128, 64, 32, 16, 8, 4, 2], dtype=np.uint8))
index.add(2, np.array([255, 128, 64, 32, 16, 8, 4, 2, 1, 0, 255, 128, 64, 32, 16, 8], dtype=np.uint8))
# Search with a 64-bit query -- NPHD compares the common prefix
query = np.array([255, 128, 64, 32, 16, 8, 4, 2], dtype=np.uint8)
matches = index.search(query, count=2)
print(matches.keys) # Nearest neighbor keys
print(matches.distances) # NPHD distances in [0.0, 1.0]
Documentation¶
-
Tutorials -- Learn the basics
Hands-on guides from installation to working code.
-
How-to guides -- Solve specific problems
Recipes for persistence, sharding, upsert, and bloom filters.
-
Explanation -- Understand the design
Background on NPHD, architecture, sharding, and performance.
-
Reference -- API details
Auto-generated API documentation for all public classes.
-
Development -- Contribute
Dev setup, testing, and contribution guidelines.