Expand description
High-performance RWKV7 inference kernel for x86_64.
This module provides a highly optimized RWKV7 implementation specifically designed for x86_64 CPUs with AVX2/FMA support. No portability fallbacks.
§Architecture
- All matrix operations are SIMD-vectorized (AVX2 + FMA)
- State updates use hand-tuned kernel for N=64 head dimension
- Memory layout optimized for cache efficiency
- No external BLAS dependencies
Modules§
- training
- RWKV7 training (byte-level) implemented in Rust.
Structs§
- Config
- Model configuration.
- Layer
Profiler - Collects wall-clock timings for each transformer block.
- Layer
Timing - Timing data for a single transformer block.
- Model
- RWKV7 model.
- Null
Profiler - No-op profiler used by default to keep the fast path branch-free.
- Scratch
Buffers - Pre-allocated scratch buffers to avoid allocations in hot path.
- State
- Full model state.
- Tensor1D
- Owned 1D tensor with aligned memory.
- Tensor2D
- Owned 2D tensor with aligned memory (row-major).
- Tensor
View1D - View into external f32 data (for weights).
- Tensor
View2D - View into external f32 data (for weights), row-major.
- Weights
- Container for all loaded RWKV7 model weights.
Traits§
- Profiler
Sink - Sink trait used by the model to surface per-layer timings without committing to a particular profiler implementation.