Skip to content

Sharding

Use ShardedNphdIndex when your dataset needs to scale beyond a single index file.

When to use sharding

Switch to ShardedNphdIndex when:

  • Your dataset exceeds available RAM.
  • Insert throughput degrades as the index grows because HNSW graph construction slows with size.
  • You need persistent storage with transparent shard rotation.

Create a sharded index

from iscc_usearch import ShardedNphdIndex

index = ShardedNphdIndex(
    max_dim=256,
    path="./my_shards",
    shard_size=512 * 1024 * 1024,  # 512 MB per shard
)

The path directory is created automatically. After adding data and saving, the directory looks like this:

my_shards/
    shard_000.usearch   # view shard (memory-mapped, read-only)
    shard_001.usearch   # view shard (memory-mapped, read-only)
    shard_002.usearch   # active shard (RAM, read-write)
    bloom.isbf          # bloom filter state
    tombstones.npy      # deletion/dedup state (if any removals or upserts pending)

Completed shards are read-only. The highest-numbered shard is the active shard.

Add data

The API is the same as NphdIndex. Shard rotation happens automatically:

import numpy as np

keys = list(range(1000))
vectors = [np.random.randint(0, 256, size=32, dtype=np.uint8) for _ in range(1000)]

for key, vec in zip(keys, vectors):
    index.add(key, vec)

When the active shard exceeds shard_size, it is saved to disk and reopened in view mode (read-only, memory-mapped). A new active shard is then created.

Search across shards

Queries run across all shards automatically:

query = np.random.randint(0, 256, size=32, dtype=np.uint8)
matches = index.search(query, count=10)
print(matches.keys, matches.distances)

Results from all shards are merged and sorted by distance.

Save and reopen

# Save current state (active shard + bloom filter)
index.save()

# Reopen later -- auto-detects existing shards and max_dim
index = ShardedNphdIndex(path="./my_shards")

Read-only mode

Open an existing index for read-only access with read_only=True. All shards are memory-mapped (no active shard in RAM), and write operations raise RuntimeError:

index = ShardedNphdIndex(path="./my_shards", read_only=True)

# Search and retrieve work normally
matches = index.search(query, count=10)
vec = index.get(42)

# Writes are blocked
index.add(99, vec)  # RuntimeError: index is read-only

Read-only mode requires existing shards on disk — passing read_only=True to an empty path raises ValueError.

Use read-only mode when:

  • Serving search queries from a pre-built index without risk of accidental writes.
  • Running multiple read-only instances against the same shard directory (each process opens its own memory-mapped views).

Skip-if-exists with add_once()

add_once() adds vectors only when their keys do not already exist. Existing keys are silently skipped (first-write-wins):

import numpy as np

vec_a = np.random.randint(0, 256, size=32, dtype=np.uint8)
vec_b = np.random.randint(0, 256, size=32, dtype=np.uint8)

# First add succeeds
index.add_once(1, vec_a)

# Second add is silently skipped — vec_a is kept
index.add_once(1, vec_b)
assert np.array_equal(index.get(1), vec_a)

Batch add_once() deduplicates within the batch (first occurrence wins) and skips keys already in the index:

keys = [10, 11, 10]  # duplicate key 10
vecs = np.random.randint(0, 256, size=(3, 32), dtype=np.uint8)
index.add_once(keys, vecs)  # Adds keys 10 and 11 (second key=10 skipped)

Note

add_once() requires explicit keys — keys=None raises ValueError.

Reset the index

reset() releases all resources (view shards, active shard, bloom filter) without deleting files on disk. After reset, the index is empty and ready for new add() calls:

print(index.size)  # e.g. 1000
index.reset()
print(index.size)  # 0

# Add fresh data with the same configuration
index.add(1, vec_a)

Note

reset() does not delete shard files. Call it when you want to release memory and start fresh in-process without removing the persisted data.

Choosing shard_size

Workload Recommended shard size Rationale
Write-heavy ⅛ of available RAM More shards, consistent add throughput
Read-heavy ½ of available RAM Fewer shards, lower query latency
Balanced ¼ of available RAM Default recommendation

The default is 1 GB. Smaller shards keep insert throughput high but increase query latency because more shards need to be searched. See Sharding design for trade-off details.

Track unsaved changes with dirty

The dirty property counts unsaved key mutations (adds and removes). Use it to implement caller-driven flush policies:

flush_threshold = 1000

for i, vec in enumerate(vectors):
    index.add(i, vec)
    if index.dirty >= flush_threshold:
        index.save()  # resets dirty to 0
        print(f"Flushed at {i + 1} vectors")

print(index.dirty)  # Mutations since last save

dirty resets to 0 on save() and reset(). Shard rotation does not reset it — unsaved bloom filter and tombstone state may still need flushing. Read-only indexes always return 0.

Properties

print(index.size)  # Logical vector count (excludes tombstoned entries)
print(index.shard_count)  # Number of shard files
print(index.max_dim)  # Maximum bits per vector
print(index.dirty)  # Unsaved key mutations since last save

# Lazy iterators (memory-efficient)
for key in index.keys:
    pass
for vec in index.vectors:
    pass

128-bit key variants

If your keys exceed 64 bits (e.g., composite (iscc_id_body, chunk_index) keys for simprint indexing), use the 128-bit variants:

  • ShardedIndex128 — same as ShardedIndex but with bytes(16) keys
  • ShardedNphdIndex128 — same as ShardedNphdIndex but with bytes(16) keys

The API is identical except that keys are bytes of length 16 (single) or np.dtype('V16') arrays (batch) instead of integers. See the UUID keys how-to for details.

Remove vectors

remove() deletes vectors by key. Active shard entries are removed immediately via USearch's lazy deletion. View shard entries are tombstoned — suppressed on reads and cleaned on compact():

# Single remove
index.remove(42)

# Batch remove
index.remove([10, 11, 12])

# Python del syntax
del index[42]

Keys not found in the index are silently ignored.

Note

remove() does not support multi-label indexes (multi=True). Passing multi=True raises ValueError.

Upsert (insert-or-update)

upsert() ensures each key maps to the given vector:

  • Key is new: inserts the vector.
  • Key exists: removes the old entry and inserts the new vector.
import numpy as np

vec = np.array([255, 128, 64, 32], dtype=np.uint8)
index.upsert(1, vec)

# Update with different vector
vec_new = np.array([0, 0, 0, 0], dtype=np.uint8)
index.upsert(1, vec_new)
print(index.get(1))  # array([0, 0, 0, 0], dtype=uint8)

Batch upsert deduplicates within the batch (last occurrence wins):

keys = [1, 2, 1]  # duplicate key 1
vecs = np.random.randint(0, 256, size=(3, 8), dtype=np.uint8)
index.upsert(keys, vecs)  # key 1 gets the third vector

Note

upsert() does not support multi-label indexes (multi=True). Passing multi=True raises ValueError.

See the Upsert how-to for details on upsert() vs add() vs add_once().

Compact the index

After removing or upserting vectors, tombstoned entries still occupy space in view shard files. compact() rebuilds view shards to reclaim that space:

removed = index.compact()
print(f"Compaction removed {removed} entries")

Compaction processes shards newest-to-oldest, dropping tombstoned entries and cross-shard duplicates. It then saves the rebuilt index (including an updated bloom filter).

Tip

Compaction is optional. Tombstoned entries are already filtered from search results and iterators. Compact when disk space matters or tombstone density is high.

Shard directory layout

After adding, removing, and saving data, the directory looks like this:

my_shards/
    shard_000.usearch     # view shard (memory-mapped, read-only)
    shard_001.usearch     # view shard (memory-mapped, read-only)
    shard_002.usearch     # active shard (RAM, read-write)
    bloom.isbf            # bloom filter state
    tombstones.npy        # deletion/dedup state (present after remove or upsert)

The tombstones.npy file persists tombstoned keys and signals that cross-shard deduplication filtering is needed. It is created on first remove() or upsert() and cleared after compact().

Limitations

The following operations raise NotImplementedError:

  • copy() / clear() — would require handling multiple files.
  • join() / cluster() / pairwise_distance() — not applicable to sharded storage.
  • rename() — not supported.

Single-process only

Running multiple processes against the same index files may corrupt data. See Architecture for details.

Required parameters

When creating a new sharded index (no existing shards on disk), max_dim is required. Omitting it raises ValueError. When reopening an existing index, max_dim is auto-detected from the shard metadata.