Variable-length vectors¶
This tutorial builds on the Getting Started guide. You will add vectors of different bit-lengths to the same index and see how NPHD handles prefix matching.
The prefix relationship¶
In ISCC content fingerprinting, a shorter code is always a prefix of a longer one. A 64-bit code holds the most significant bits. A 128-bit code extends it with finer detail. A 256-bit code is the full-resolution fingerprint.
graph LR
A["64-bit<br/>(8 bytes)"] -->|prefix of| B["128-bit<br/>(16 bytes)"]
B -->|prefix of| C["256-bit<br/>(32 bytes)"]
iscc-usearch stores all these lengths in a single index and compares them with the
Normalized Prefix Hamming Distance (NPHD).
Create an index with mixed-length vectors¶
import numpy as np
from iscc_usearch import NphdIndex
index = NphdIndex(max_dim=256)
# 64-bit vector (8 bytes)
v64 = np.array([255, 128, 64, 32, 16, 8, 4, 2], dtype=np.uint8)
# 128-bit vector (16 bytes) -- first 8 bytes match v64
v128 = np.array([255, 128, 64, 32, 16, 8, 4, 2, 1, 0, 255, 128, 64, 32, 16, 8], dtype=np.uint8)
# fmt: off
# 256-bit vector (32 bytes) -- first 16 bytes match v128
v256 = np.array([255, 128, 64, 32, 16, 8, 4, 2, 1, 0, 255, 128, 64, 32, 16, 8, 7, 6, 5, 4, 3, 2, 1, 0, 255, 254, 253, 252, 251, 250, 249, 248], dtype=np.uint8)
# fmt: on
index.add(1, v64)
index.add(2, v128)
index.add(3, v256)
Search with a short query¶
A 64-bit query causes NPHD to compare only the first 64 bits of each stored vector:
query = v64.copy()
matches = index.search(query, count=3)
for key, dist in zip(matches.keys, matches.distances):
print(f"Key {key}: distance = {dist:.4f}")
Expected output:
All three vectors share the same first 8 bytes, so every distance is 0.0.
Search with a longer query¶
Now search with the 128-bit vector. NPHD compares 128 bits against vectors that are at least that long, but only 64 bits against the shorter vector:
query = v128.copy()
matches = index.search(query, count=3)
for key, dist in zip(matches.keys, matches.distances):
print(f"Key {key}: distance = {dist:.4f}")
The 128-bit and 256-bit vectors match over all 128 compared bits. The 64-bit vector still matches over its shorter prefix.
How NPHD distances work¶
NPHD divides the Hamming distance by the length of the shorter vector:
Properties:
- Range: Always
[0.0, 1.0], regardless of vector lengths. - Prefix compatibility: A 64-bit vector matching the first 64 bits of a 256-bit vector has
distance
0.0. - Symmetry:
NPHD(a, b) == NPHD(b, a).
For the math behind this, see the NPHD metric explanation.
Introduce a difference¶
Add a vector that differs by one bit in the first byte:
v64_diff = np.array([254, 128, 64, 32, 16, 8, 4, 2], dtype=np.uint8) # 254 vs 255
index.add(4, v64_diff)
matches = index.search(v64, count=4)
for key, dist in zip(matches.keys, matches.distances):
print(f"Key {key}: distance = {dist:.6f}")
The distance between v64 and v64_diff is 1/64 = 0.015625 -- one differing bit out of 64.
Next steps¶
- NPHD metric -- Mathematical properties of the distance function.
- Architecture -- How variable-length vectors are stored internally.