Crate infotheory

Crate infotheory 

Source
Expand description

§InfoTheory: Information Theoretic Estimators & Metrics

This crate provides a comprehensive suite of information-theoretic primitives for quantifying complexity, dependence, and similarity between data sequences.

It implements two primary classes of estimators:

  1. Compression-based (Kolmogorov Complexity): Using the ZPAQ compression algorithm to estimate Normalized Compression Distance (NCD).
  2. Entropy-based (Shannon Information): Using both exact marginal histograms (for i.i.d. data) and the ROSA (Rapid Online Suffix Automaton) predictive language model (for sequential data) to estimate Entropy, Mutual Information, and related distances.

§Mathematical Primitives

The library implements the following core measures. For sequential data, “Rate” variants use the ROSA model to estimate Ĥ(X) (entropy rate), while “Marginal” variants treat data as a bag-of-bytes (i.i.d.) and compute H(X) from histograms.

§1. Normalized Compression Distance (NCD)

Approximates the Normalized Information Distance (NID) using a compressor C.

NCD(x,y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))

§2. Normalized Entropy Distance (NED)

An entropic analogue to NCD, defined using Shannon entropy H.

NED(X,Y) = (H(X,Y) - min(H(X), H(Y))) / max(H(X), H(Y))

§3. Normalized Transform Effort (NTE)

Based on the Variation of Information (VI), normalized by the maximum entropy.

NTE(X,Y) = (H(X|Y) + H(Y|X)) / max(H(X), H(Y)) = (2H(X,Y) - H(X) - H(Y)) / max(H(X), H(Y))

§4. Mutual Information (MI)

Measures the amount of information obtained about one random variable by observing another.

I(X;Y) = H(X) + H(Y) - H(X,Y)

§5. Divergences & Distances

  • Total Variation Distance (TVD): δ(P,Q) = 0.5 * Σ |P(x) - Q(x)|
  • Normalized Hellinger Distance (NHD): sqrt(1 - Σ sqrt(P(x)Q(x)))
  • Kullback-Leibler Divergence (KL): D_KL(P||Q) = Σ P(x) log(P(x)/Q(x))
  • Jensen-Shannon Divergence (JSD): Symmetrized and smoothed KL divergence.

§6. Intrinsic Dependence (ID)

Measures the redundancy within a sequence, comparing marginal entropy to entropy rate.

ID(X) = (H_marginal(X) - H_rate(X)) / H_marginal(X)

§7. Resistance to Transformation

Quantifies how much information is preserved after a transformation T is applied.

R(X, T) = I(X; T(X)) / H(X)

§Usage

use infotheory::{ncd_vitanyi, mutual_information_bytes, NcdVariant};

let x = b"some data sequence";
let y = b"another data sequence";

// Compression-based distance
let ncd = ncd_vitanyi("file1.txt", "file2.txt", "5");

// Entropy-based mutual information (Marginal / i.i.d.)
let mi_marg = mutual_information_bytes(x, y, 0);

// Entropy-based mutual information (Rate / Sequential, max_order=8)
let mi_rate = mutual_information_bytes(x, y, 8);

Re-exports§

pub use backends::ctw;
pub use backends::mambazip;
pub use backends::match_model;
pub use backends::particle;
pub use backends::ppmd;
pub use backends::rosaplus;
pub use backends::rwkvzip;
pub use backends::sparse_match;
pub use backends::zpaq_rate;

Modules§

aixi
AIXI planning components, environments, and model abstractions. AIXI Agent Implementations
axioms
Core information-theoretic axioms and validation helpers.
backends
Entropy/compression backend implementations and backend discovery. Backend discovery helpers and canonical backend naming.
coders
Entropy coder implementations (AC and rANS). Entropy coders for rwkvzip.
compression
Rate-coded compression helpers built on generic rate backends. Rate-coded compression helpers (AC/rANS) with optional framing.
datagen
Synthetic data generators for information-theory experiments.
mixture
Online Bayesian/switching/MDL mixture predictors. Online mixtures of probabilistic predictors (log-loss Hedge / Bayes, switching, MDL).
search
Information-theoretic code search pipeline (3-stage: prefilter, filter, KMI rerank).

Structs§

CalibratedSpec
Configuration for a calibrated wrapper rate backend.
GenerationConfig
Generation options shared by the library API and CLI.
InfotheoryCtx
Reusable execution context holding default rate and compression backends.
MixtureExpertSpec
Expert specification for mixture backends.
MixtureSpec
Mixture specification for rate-backend mixtures.
ParticleSpec
Configuration for a particle-latent filter ensemble rate backend.
RateBackendSession
Stateful rate-backend session for fitting, conditioning, and continuation.

Enums§

CalibrationContextKind
Fixed context families for calibrated PDF wrappers.
CompressionBackend
Compression backend used by NCD/compression-size operations.
GenerationStrategy
How to pick the next byte from the model distribution.
GenerationUpdateMode
How generated symbols should update the model state.
MixtureKind
Mixture policy kind for rate-backend mixtures.
NcdVariant
—–– NCD (Normalized Compression Distance) ——
RateBackend
Sequential entropy/rate backend used by context-aware metrics.

Functions§

biased_entropy_rate_backend
Estimate biased/plugin entropy rate of data using the explicit rate backend.
biased_entropy_rate_bytes
Compute biased entropy rate Ĥ_biased(X) bits per symbol.
compress_bytes_backend
Compress bytes with a selected compression backend.
compress_size_backend
Compute compressed size of a single byte slice with a selected compression backend.
compress_size_chain_backend
Compute compressed size of a chain of byte slices with a selected compression backend.
conditional_entropy_bytes
Compute conditional entropy H(X|Y) = H(X,Y) − H(Y)
conditional_entropy_paths
Conditional Entropy for files.
conditional_entropy_rate_bytes
Compute conditional entropy rate Ĥ(X|Y).
cross_entropy_bytes
Compute cross-entropy H_{train}(test) - score test_data under model trained on train_data.
cross_entropy_paths
Cross-Entropy for files.
cross_entropy_rate_backend
Cross-entropy H_{train}(test) - score test_data under model trained on train_data.
cross_entropy_rate_bytes
Compute cross-entropy rate using ROSA/CTW/RWKV. Training model on train_data and evaluating probability of test_data.
d_kl_bytes
Kullback-Leibler Divergence D_KL(P || Q) = Σ p(x) log(p(x) / q(x))
decompress_bytes_backend
Decompress bytes with a selected compression backend.
entropy_rate_backend
Estimate entropy rate of data using the explicit rate backend.
entropy_rate_bytes
Compute entropy rate Ĥ(X) in bits/symbol using ROSA LM.
generate_bytes
Generate a continuation from prompt using the current default context and GenerationConfig::default().
generate_bytes_conditional_chain
Generate a continuation after conditioning on an explicit chain of prefix parts using the current default context and GenerationConfig::default().
generate_bytes_conditional_chain_with_config
Generate a continuation after conditioning on an explicit chain of prefix parts using the current default context.
generate_bytes_with_config
Generate a continuation from prompt using the current default context.
get_bytes_from_paths
Read all files in paths in parallel and return their byte contents.
get_compressed_size
—–– Base Compression Functions —––
get_compressed_size_parallel
Compute compressed size for a file path with an explicit ZPAQ thread count.
get_compressed_sizes_from_paths
Optimizes parallelization
get_default_ctx
Returns the current default information theory context for the thread.
get_parallel_compressed_sizes_from_parallel_paths
Compress all paths directly from disk using per-file multi-thread ZPAQ.
get_parallel_compressed_sizes_from_sequential_paths
Compress all paths after preloading bytes, using per-file parallel ZPAQ compression.
get_sequential_compressed_sizes_from_parallel_paths
Compress all paths directly from disk using single-thread ZPAQ per file.
get_sequential_compressed_sizes_from_sequential_paths
—–– Bulk File Compression Functions —––
intrinsic_dependence_bytes
Primitive 6: Intrinsic Dependence (Redundancy Ratio).
joint_entropy_rate_backend
Estimate joint entropy rate H(X,Y) using an explicit backend.
joint_entropy_rate_bytes
Compute joint entropy rate Ĥ(X,Y).
joint_marginal_entropy_bytes
Compute joint marginal entropy H(X,Y) = −Σ p(x,y) log₂ p(x,y) in bits/symbol-pair.
js_div_bytes
Jensen-Shannon Divergence JSD(P || Q) = 1/2 D_KL(P || M) + 1/2 D_KL(Q || M) where M = 1/2 (P + Q)
js_divergence_paths
Jensen-Shannon Divergence for files.
kl_divergence_paths
KL Divergence for files.
load_mamba_model_from_path
Load a Mamba-1 model from .safetensors path.
load_rwkv7_model_from_path
Load an RWKV7 model from .safetensors path.
marginal_entropy_bytes
Compute marginal (Shannon) entropy H(X) = −Σ p(x) log₂ p(x) in bits/symbol.
mutual_information_bytes
Compute mutual information I(X;Y) = H(X) + H(Y) - H(X,Y).
mutual_information_marg_bytes
Marginal Mutual Information (exact/histogram)
mutual_information_paths
Mutual Information for files.
mutual_information_rate_backend
Mutual information rate estimate under an explicit backend.
mutual_information_rate_bytes
Entropy Rate Mutual Information (ROSA predictive)
ncd_bytes
Compute NCD for in-memory byte slices using the given ZPAQ method and variant.
ncd_bytes_backend
Compute NCD for in-memory byte slices using an explicit compression backend.
ncd_bytes_default
NCD with bytes using the default context.
ncd_cons
Convenience wrapper for conservative NCD on file paths.
ncd_matrix_bytes
Computes an NCD matrix (row-major, len = n*n) for in-memory byte blobs.
ncd_matrix_paths
Computes an NCD matrix (row-major, len = n*n) for files (preloads all files into memory once).
ncd_paths
Compute NCD for two file paths using a ZPAQ method and variant.
ncd_paths_backend
Compute NCD for two file paths using an explicit compression backend.
ncd_sym_cons
Convenience wrapper for symmetric-conservative NCD on file paths.
ncd_sym_vitanyi
Convenience wrapper for symmetric-Vitanyi NCD on file paths.
ncd_vitanyi
Back-compat convenience wrappers (operate on file paths).
ned_bytes
NED(X,Y) = (H(X,Y) - min(H(X), H(Y))) / max(H(X), H(Y))
ned_cons_bytes
NED_cons(X,Y) = (H(X,Y) - min(H(X), H(Y))) / H(X,Y)
ned_cons_marg_bytes
Conservative marginal NED using histogram entropy estimates.
ned_cons_rate_bytes
Conservative rate NED using the current default context backend.
ned_marg_bytes
Marginal NED (exact/histogram)
ned_paths
NED for files.
ned_rate_backend
Normalized entropy distance under an explicit backend.
ned_rate_bytes
Normalized Entropy Distance (Rate-based)
nhd_bytes
NHD(X,Y) = sqrt(1 - BC(X,Y)) where BC = Σᵢ sqrt(p_X(i) · p_Y(i))
nhd_paths
NHD for files.
nte_bytes
NTE(X,Y) = VI(X,Y) / max(H(X), H(Y)) where VI(X,Y) = H(X|Y) + H(Y|X) = 2H(X,Y) - H(X) - H(Y).
nte_marg_bytes
Marginal NTE using histogram entropy estimates.
nte_paths
NTE for files.
nte_rate_backend
Normalized transform effort (variation-of-information form) under an explicit backend.
nte_rate_bytes
Rate NTE using the current default context backend.
resistance_to_transformation_bytes
Primitive 7: Resistance under Allowed Transformations.
set_default_ctx
Sets the default information theory context for the thread.
tvd_bytes
TVD_marg(X,Y) = (1/2) Σᵢ |p_X(i) - p_Y(i)|
tvd_paths
TVD for files.
validate_zpaq_rate_method
Validate that a ZPAQ method string is supported for rate estimation.