Skip to content

Architecture

The core problem

USearch requires all vectors in an index to have the same dimensionality. ISCC codes are variable-length (64-bit to 256-bit). iscc-usearch solves this with length-prefixed padding and a custom metric that ignores padding during distance computation.

Length-prefixed padding

Every vector is padded to a uniform size before storage. The first byte holds the original vector length (in bytes), then the vector data, then zero-padding:

flowchart TB
    subgraph RAW ["Raw vector (3 bytes)"]
        direction LR
        R1["0xFF"] --- R2["0x80"] --- R3["0x40"]
    end

    subgraph PAD ["Padded to 33 bytes"]
        direction LR
        P0["**0x03** length"] --- P1["0xFF"] --- P2["0x80"] --- P3["0x40"] --- P4["0x00 ..."] --- P32["0x00"]
    end

    RAW -->|pad_vectors| PAD

The padded vector is stored in USearch as a ScalarKind.B1 (binary) vector with ndim = max_dim + 8 bits (the extra 8 bits account for the length byte).

On retrieval, unpad_vectors reads the length byte and returns only the valid data bytes:

flowchart TB
    subgraph STO ["Stored (33 bytes)"]
        direction LR
        S0["**0x03**"] --- S1["0xFF"] --- S2["0x80"] --- S3["0x40"] --- S4["0x00..."]
    end

    subgraph RET ["Returned (3 bytes)"]
        direction LR
        O1["0xFF"] --- O2["0x80"] --- O3["0x40"]
    end

    STO -->|unpad_vectors| RET

Both pad_vectors and unpad_vectors are compiled with Numba @njit(cache=True) for native speed.

Custom metric restoration

USearch serializes the metric kind (e.g., Hamming) but not the custom function pointer. When an index is loaded or viewed, USearch substitutes the standard metric for that kind. Since NPHD is registered as MetricKind.Hamming (the closest built-in kind), a loaded index would use standard Hamming distance instead of NPHD.

NphdIndex.load() and NphdIndex.view() call change_metric() after every load or view to restore the NPHD function pointer. This happens automatically and callers never need to do it manually.

Numba compilation strategy

Two compilation modes are used:

  • @cfunc for the NPHD metric: produces a C-callable function pointer compatible with USearch's CompiledMetric interface. USearch calls it from C++ during graph traversal without crossing the Python/C boundary.

  • @njit(cache=True) for padding functions: JIT-compiled with result caching so recompilation cost is paid only once per environment. Operates on NumPy arrays.

Maximum vector size

The NPHD metric is compiled with a fixed buffer size of 33 bytes (1 length byte + 32 data bytes), so max_dim is capped at 256 bits. This matches the maximum resolution of ISCC content fingerprints. NphdIndex validates this at construction time.

Why a thin wrapper

iscc-usearch does not fork USearch's index logic. It wraps the existing Index class, adds padding at the boundary, and restores the metric after persistence operations. This keeps the wrapper small and lets it track upstream USearch improvements without merge conflicts.

The one exception is the patched usearch fork, which modifies USearch's C++ core for performance. Those patches are confined to view/load paths and don't change the index format.