Skip to content

Normalized Prefix Hamming Distance (NPHD)

The problem with standard Hamming distance

Standard Hamming distance counts the positions where two equal-length bit-strings differ. It is undefined for vectors of different lengths -- you cannot compare a 64-bit code with a 256-bit code without deciding what to do about the 192 extra bits.

Common workarounds like zero-padding or truncation either introduce artificial differences or discard information. Neither preserves the prefix relationship that ISCC codes rely on.

NPHD definition

NPHD compares only the common prefix (the shorter of the two vectors) and normalizes by its length:

\[ \text{NPHD}(a, b) = \frac{\text{hamming}(a_{1..m},\; b_{1..m})}{m \times 8} \]

where \(m = \min(\text{len}(a),\; \text{len}(b))\) in bytes.

  • Numerator: Hamming distance over the first \(m\) bytes (the common prefix).
  • Denominator: \(m \times 8\) (the number of bits in the prefix).

Worked example

Take two vectors:

  • a = [0xFF, 0x80, 0x40] (3 bytes, 24 bits)
  • b = [0xFF, 0x81] (2 bytes, 16 bits)
graph TD
    A["<b>Step 1:</b> Determine prefix length<br/>min(3, 2) = 2 bytes = 16 bits"] --> B
    B["<b>Step 2:</b> Compare first 2 bytes<br/>a: 0xFF 0x80<br/>b: 0xFF 0x81"] --> C
    C["<b>Step 3:</b> XOR and count bits<br/>0xFF ^ 0xFF = 0x00 → 0 bits<br/>0x80 ^ 0x81 = 0x01 → 1 bit"] --> D
    D["<b>Step 4:</b> Normalize<br/>NPHD = 1 / 16 = 0.0625"]

The third byte of a (0x40) is ignored because b is only 2 bytes long.

Metric properties

NPHD satisfies identity, symmetry, and non-negativity. However, the triangle inequality does not hold in general when vectors have different lengths, because normalization by the shorter prefix length can inflate distances between vectors of mismatched lengths.

Property Holds? Explanation
Identity Yes \(\text{NPHD}(a, a) = 0\) for any vector \(a\)
Symmetry Yes \(\text{NPHD}(a, b) = \text{NPHD}(b, a)\) by commutativity of XOR
Triangle inequality No Violated when vectors have different lengths (see below)
Non-negativity Yes Hamming distance is always \(\geq 0\)

Triangle inequality counterexample

Consider three vectors:

  • a = [0x00] (1 byte)
  • b = [0x00, 0x00] (2 bytes)
  • c = [0xFF, 0x00] (2 bytes)
Pair Prefix length Hamming bits NPHD
(a, b) 1 byte 0 0 / 8 = 0.0
(b, c) 2 bytes 8 8 / 16 = 0.5
(a, c) 1 byte 8 8 / 8 = 1.0

\(\text{NPHD}(a, c) = 1.0 > 0.0 + 0.5 = \text{NPHD}(a, b) + \text{NPHD}(b, c)\)

This happens because a and c are compared over only 1 byte (maximizing the normalized distance), while b and c are compared over 2 bytes (diluting the same 8-bit difference).

Note

NPHD is technically a semimetric (or dissimilarity), not a full metric. This does not affect approximate nearest neighbor search -- USearch's HNSW graph construction and traversal work correctly with semimetrics, and NPHD produces meaningful similarity rankings in practice.

Distance range

NPHD values fall in \([0.0,\; 1.0]\) regardless of vector lengths:

  • 0.0 -- The common prefix is identical.
  • 1.0 -- Every bit in the common prefix differs.

Distances are directly comparable even when the vectors have different lengths.

Connection to ISCC (ISO 24138)

ISCC content codes are prefix-compatible: a 64-bit ISCC code is the most significant portion of a 256-bit code. Searching a mixed-resolution index with NPHD works because the metric respects this nesting.

Connection to Matryoshka Representation Learning

ISCC's prefix-compatible design shares a conceptual basis with Matryoshka Representation Learning (MRL), where embeddings are structured so that truncating to a shorter prefix preserves the most important information. NPHD is the metric counterpart: it compares representations at any truncation level with distances that are comparable across resolutions.

Implementation

The NPHD metric is compiled to a C-callable function with Numba's @cfunc decorator and passed to USearch's C++ core as a function pointer. This avoids Python callback overhead on every distance computation during graph traversal. See iscc_usearch.metrics for the source.