# MinIO MemKV RELEASE.2026-05-26T06-41-19Z

Released: 2026-05-26

A major reliability and throughput release for the Dynamo / KVBM and NIXL
data paths. The batch control plane moves off RC SEND/RECV onto a
TCP-multiplexed control channel (DC RDMA still carries the bulk bytes),
eliminating the per-thread RC-QP fan-in lock that was timing out 88% of
high-concurrency onboards. The client and NIXL plugin gain GPU Direct
RDMA — GPU device pointers are now registered with the NIC via dma-buf,
removing the host bounce + cudaMemcpy H2D from the onboard hot path. A
new two-stage request batch optimizer in the NIXL plugin collapses
fragmented descriptor lists into the minimum number of wire ops and
fans them out to multiple destinations via DRAM or CUDA-aware scatter.

---

## Downloads

### Server Binary

| Platform | Architecture | Download |
| -------- | ------------ | -------- |
| Linux    | amd64        | [memkv](https://dl.min.io/aistor/memkv/release/linux-amd64/memkv) |
| Linux    | arm64        | [memkv](https://dl.min.io/aistor/memkv/release/linux-arm64/memkv) |

### NIXL Plugin (for Dynamo / KVBM integrations)

| Platform | Architecture | Download |
| -------- | ------------ | -------- |
| Linux    | amd64        | [libplugin_MEMKV.so](https://dl.min.io/aistor/memkv/release/linux-amd64/libplugin_MEMKV.so) |
| Linux    | arm64        | [libplugin_MEMKV.so](https://dl.min.io/aistor/memkv/release/linux-arm64/libplugin_MEMKV.so) |

### LD_PRELOAD Shim (for MLPerf-Storage kvcache workloads)

| Platform | Architecture | Download |
| -------- | ------------ | -------- |
| Linux    | amd64        | [libmemkv_preload.so](https://dl.min.io/aistor/memkv/release/linux-amd64/libmemkv_preload.so) |
| Linux    | arm64        | [libmemkv_preload.so](https://dl.min.io/aistor/memkv/release/linux-arm64/libmemkv_preload.so) |

### Packages

`.deb`, `.rpm`, and `.apk` packages bundle the server + both `.so` sidecars + the LMCache and sglang Python wheels into a single per-arch install.

| Format | Architecture | Download |
| ------ | ------------ | -------- |
| DEB    | amd64        | [memkv\_20260526064119.0.0_amd64.deb](https://dl.min.io/aistor/memkv/release/linux-amd64/memkv_20260526064119.0.0_amd64.deb) |
| DEB    | arm64        | [memkv\_20260526064119.0.0_arm64.deb](https://dl.min.io/aistor/memkv/release/linux-arm64/memkv_20260526064119.0.0_arm64.deb) |
| RPM    | amd64        | [memkv-20260526064119.0.0-1.x86_64.rpm](https://dl.min.io/aistor/memkv/release/linux-amd64/memkv-20260526064119.0.0-1.x86_64.rpm) |
| RPM    | arm64        | [memkv-20260526064119.0.0-1.aarch64.rpm](https://dl.min.io/aistor/memkv/release/linux-arm64/memkv-20260526064119.0.0-1.aarch64.rpm) |
| APK    | amd64        | [memkv\_20260526064119.0.0_x86_64.apk](https://dl.min.io/aistor/memkv/release/linux-amd64/memkv_20260526064119.0.0_x86_64.apk) |
| APK    | arm64        | [memkv\_20260526064119.0.0_aarch64.apk](https://dl.min.io/aistor/memkv/release/linux-arm64/memkv_20260526064119.0.0_aarch64.apk) |

After installing the deb/rpm, the Python plugin wheels land at `/usr/share/memkv/wheels/`:

```bash
pip install /usr/share/memkv/wheels/memkv_lmcache-*.whl
pip install /usr/share/memkv/wheels/memkv_sglang-*.whl
```

The NIXL plugin is auto-symlinked to `/opt/nvidia/nvda_nixl/lib/plugins/` when that directory exists (postinstall hook).

### Container Image

```bash
docker pull quay.io/minio/memkv:RELEASE.2026-05-26T06-41-19Z
docker pull quay.io/minio/memkv:latest
```

Container ships the server + the NIXL plugin (under `/usr/local/lib/plugins/`). The LD_PRELOAD shim and Python wheels are not included in the container image — use the deb/rpm for those.

### Verification

Each binary is signed with both minisign (preferred) and GPG; sha256sums are published alongside.

```bash
# minisign
minisign -Vm memkv -P RWTx5Zr1tiHQLwG9keckT0c45M3AGeHD6IvimQHpyRywVWGbP1aVSGav

# sha256
sha256sum -c memkv.sha256sum
```

(The minisign public key above is the MinIO release-signing key — same key used across the AIStor product line.)

---

## Breaking Changes

### `admin drives reset` default is now BLKDISCARD-only

The `POST /v1/drives/{id}/reset` admin endpoint, and the new
`memkv drive reset` CLI, default to **BLKDISCARD-only** for the data
region. The previous behavior — full zero-fill of the data region on
every reset — is now opt-in via the new `--secure` CLI flag or the
`?secure=true` query parameter.

Reset metadata (journal, bitmap, btree) is always cleared regardless of
the `secure` flag, so this only affects how aggressively the *user data*
region is wiped. The change makes the fast path the default; callers
that depend on guaranteed zero-fill should pass `secure=true`
explicitly.

---

## New Features

### TCP-multiplexed control plane for batch RDMA operations

The bulk data path stays on DC RDMA — block bytes still move via
`RDMA WRITE` (reads) and `RDMA READ` (writes) between server slabs and
client-registered buffers. What moves is the **control plane**: the
`BatchRead` / `BatchWrite` RPCs that announce which keys to fetch and
where to land them.

Previously the control plane rode per-client RC connections. Each new
caller thread allocated a fresh RC QP + CQ + ~4 MiB of pinned control
buffers (~30 MiB pinned per thread), and an outer
`Arc<Mutex<RdmaConnection>>` serialized every RPC round trip. Under
Dynamo's KVBM at TP=8 — eight plugin processes, each with rotating
Tokio worker threads — every onboard collided on a small RC slot pool
and stranded state when threads exited.

The new path opens one long-lived TCP socket per (client, server, rail)
that multiplexes arbitrarily many in-flight RPCs by request ID, with
the writer mutex held only for the bytes of a single socket write. Per
connection state drops from ~30 MiB to ~80 KiB, and socket close is an
unambiguous "client gone" signal so server-side DC peer state is reaped
on disconnect instead of leaking.

New wire messages: `TcpAttach`, `TcpDetach`, `TcpBatchRead`,
`TcpBatchWrite` (response payloads reuse the existing `BatchRead` /
`BatchWrite` shapes). The TCP frames use a 16 MiB payload cap so large
batches (typical KVBM onboards are 7500+ entries) ride a single frame
where the RC path had to chunk them.

Measured impact on Dynamo / KVBM (gpu4 + coe02/coe04, dual 400G,
TP=8, `gpt-oss-120b`, ISL=120 K):

| Metric                  | Before (RC control) | After (TCP control) |
| ----------------------- | ------------------- | ------------------- |
| Onboard success rate    | 16 / 136 (12%)      | 100%                |
| Avg successful TTFT     | 10.6 s              | 3.23 s              |
| Cold-recompute baseline | 16.91 s             | 16.91 s             |
| Speedup vs baseline     | (mostly timeouts)   | 4.5× TTFT           |

### NIXL plugin: two-stage request batch optimizer with CUDA-aware scatter

NIXL hands plugins a fragmented descriptor list of arbitrary length
and shape; without an optimizer layer the plugin issues one wire op
per descriptor and gets bound by per-op framework overhead long
before the network is saturated. This release ports the AIStor-style
two-stage optimizer into the MemKV NIXL plugin:

- **Stage 1** — incremental on-add merge per (server, key) group:
  reads merge across contiguous-or-contained ranges; writes only merge
  strict-contiguous on both object offset and source memory.
- **Stage 2** — global sort and fold across all descriptors; for
  reads, multi-destination lists collapse and any destination spanning
  the full chunk is hoisted as the wire target ("cover"). A per-chunk
  descriptor count preserves NIXL's per-descriptor completion
  accounting after merges.

The plugin then either emits one wire op per chunk into the cover
destination and fans the remaining bytes out via `memcpy` (host
scatter) or `cudaMemcpy(cudaMemcpyDefault)` (CUDA-aware scatter, new
in this release). The CUDA scatter path lazily `dlopen`s `libcudart.so`
at first use — no build-time CUDA dependency — and falls back to
independent wire ops on hosts without CUDA installed. KVBM stores its
KV blocks in plain `cudaMemoryTypeDevice` memory, so this path is
load-bearing for Dynamo onboards.

nixlbench READ pass at `batch_size=16`, dual coe02 + coe04 (32
threads, `num_files=32`, `--check_consistency=1`):

| Block | Avg latency | Throughput   |
| ----- | ----------- | ------------ |
| 4 KB  | sub-µs      | 26.97 GB/s   |
| 64 KB | 4.0 µs      | 16.24 GB/s   |
| 256 KB| 5.2 µs      | 50.55 GB/s   |
| 1 MB  | 12.5 µs     | 83.67 GB/s   |
| 4 MB  | 37.5 µs     | 111.88 GB/s  |
| 8 MB  | 48.7 µs     | 172.17 GB/s  |
| 16 MB | 119.1 µs    | 140.85 GB/s  |

The reported throughput above the dual-400 G wire ceiling at 8 / 16 MB
is a nixlbench accounting artifact (it counts logical bytes per
descriptor while the optimizer collapses many descriptors into one
wire op). The honest reading is that the optimizer drives the wire to
saturation; per-descriptor latency is the more meaningful user-facing
metric and drops 15–20× vs `batch_size=1` across 16–256 KB blocks.

### GPU Direct RDMA via dma-buf for GPU device pointers

`RdmaConnectionPool::get_or_register_mr` now classifies the incoming
address via the CUDA Driver API. When it sees a GPU device pointer it
exports a dma-buf fd via `cuMemGetHandleForAddressRange` and registers
it directly with the NIC through `ibv_reg_dmabuf_mr`. Combined with
the NIXL plugin's new CUDA-aware scatter path, this is the
infrastructure required for the direct-to-device (R2D) onboard path
that lets KV blocks land in HBM without a host bounce + separate
`cudaMemcpy` H2D.

If the dma-buf export is unavailable (older CUDA / kernel) the path
falls back to plain `ibv_reg_mr` on the GPU address; on mlx5 hardware
this still works when the `nvidia_peermem` kernel module is loaded.

The classify cost is sub-microsecond per cache miss and skipped on
cache hits. On hosts without `libcuda.so` the Driver API loader
returns `None` and every address classifies as host, so the existing
host-only paths take the same code path as before.

### Offline `memkv drive reset` CLI

The admin HTTP endpoint `POST /v1/drives/{id}/reset` had no CLI
equivalent for scripted recovery flows — operators had to shell out to
`curl` or run `blkdiscard` manually. The new subcommand:

```bash
memkv drive reset <path> [--secure]
```

resets a drive while the server is offline. It reads the existing
superblock to recover the `device_id` and the JBOF's `num_devices` so
operators do not need to track them externally, and falls back to the
mirror superblock if the primary is unreadable. The metadata region
(journal + bitmap + btree) is always zero-filled before the fresh
superblock is written, so stale journal entries are never replayed on
the next open.

Both the CLI and the admin endpoint accept the new `--secure` /
`?secure=true` flag described under **Breaking Changes**.

### Startup warning when `MALLOC_CONF` leaves jemalloc heap-profile sampling on

If `MALLOC_CONF` is set in the server's environment with
`prof_active:true` (or any related sampling flag), the server now logs
a warning at startup. The combination silently allocates per-call
backtraces and is responsible for steady RSS growth that looks like a
memory leak. The warning makes the misconfiguration visible without
silently disabling it.

### `Engine::route_key` public API + `Key` re-export

`memkv_client::Engine::route_key(&Key) -> String` is a new public
wrapper around the HRW router for use by external callers that need
to group arbitrary-keyed operations by server before invoking
per-server primitives (`Engine::exists`, `Engine::batch_xfer_for_server`,
etc). The existing `route_block` only accepts `u64` block IDs; the new
method takes the full `Key`. `memkv_proto::Key` is now re-exported from
the `memkv-client` crate root so downstream callers no longer need a
separate `memkv-proto` dependency.

---

## Performance

### MR registration: dedup intents, lock-release refactor

Two related optimizations on the MR registration path in the NIXL
plugin and `memkv-client`:

- `engine.track_register` now uses a `HashSet` with a read-locked fast
  path; duplicate `register_mem` calls from NIXL short-circuit before
  touching the write lock. Previously the registration list grew
  without bound under high-frequency calls from
  `execute_memkv_transfer` and the write lock thrashed under
  concurrency.
- `with_thread_rails` now snapshots the pending intent set under the
  shared lock, drops the lock for the slow `ibv_reg_mr` calls (which
  can run in parallel across threads), then reacquires the lock to
  record what was replayed. Concurrent first callers now parallelize
  their MR setup instead of queuing behind one another.

The per-block `register_mem` log was demoted from `info!` to `debug!`
— the synchronous tracing writer serializes worker threads through a
single mutex, and at INFO level the volume of `O(blocks * transfers)`
events was a measurable chunk of TTFT under fan-in.

### Updated headline performance numbers — 96.7 GiB/s peak read

`memkv-bench` dual-server (coe02 + coe04 from gpu4, 64 threads,
non-batch RDMA, 1000 iterations per block size) measures:

- Read peak: 96.70 GiB/s at 16 MB (was 89.98)
- Write peak: 95.77 GiB/s at 16 MB (was 89.86)
- Both directions clear 96 GiB/s at the 8 MB and 16 MB block sizes

Same hardware as the previous published numbers (2× ConnectX-7 400 GbE
on gpu4, 24× PCIe Gen4 QLC across coe02 + coe04). No methodology
change; the gain is from server-side optimizations already in tree.

---

## Hardware / Compatibility

- **Cross-architecture build fix.** `cudaGetErrorString`'s return type
  in the plugin's CUDA bindings was hardcoded to `*const i8`, which
  built fine on x86_64 (where `c_char = i8`) but failed the aarch64
  cross-build (where `c_char = u8`) at `CStr::from_ptr`. The binding
  now uses `*const c_char` and is portable across both targets, so the
  arm64 `.deb` / `.rpm` / `.apk` artifacts in this release ship with
  the same CUDA-aware scatter path as the amd64 build.
- **GPU Direct RDMA requirements.** The new dma-buf MR path is best on
  mlx5 hardware with CUDA ≥ 11.7 and a kernel that exposes
  `cuMemGetHandleForAddressRange`. The `nvidia_peermem` fallback covers
  older stacks transparently.

---

## Build and Packaging

- **Release binary hardening.** Release builds now apply additional
  compiler / linker options (symbol stripping, internal debug-info
  scrubbing) to reduce the surface that automated `strings`-style
  extractors and LLM-driven reverse-engineering tools can reconstruct
  from a shipped binary. No public API or behavior change.

---

## Known Limitations

- **Per-socket TCP session reap is not yet automatic.** When a client
  drops a TCP control connection without sending `TcpDetach`, the
  server leaks one `DcClientHandle` (~tens of bytes) per session until
  the server restarts. This is six orders of magnitude smaller than
  the RC-connection leak the TCP control plane replaces, but it is
  not zero. Automatic socket-close reaping is queued for the next
  release.
- **R2D wiring on the Dynamo side.** The plugin- and client-side
  ground work for direct-to-device onboards (dma-buf MR registration
  + CUDA-aware scatter) is complete in this release, but the matching
  Dynamo connector contract still has to be taught to recognize R2D-
  completed onboards before KVBM will dispatch through it. Until that
  lands upstream, the host-bounce + `cudaMemcpy` H2D remains the
  default path for end-to-end TTFT measurements.

---

## Documentation

- Hosted docs: <https://docs.min.io/memkv/>
- Embedded docs (in the binary): `memkv doc` serves the same site locally.

## Support

- Issues: filed privately by licensed customers and contributors via the support channel
- Security disclosures: security@min.io