Crate infotheory

Expand description

§InfoTheory: Information Theoretic Estimators & Metrics

This crate provides a comprehensive suite of information-theoretic primitives for quantifying complexity, dependence, and similarity between data sequences.

It implements two primary classes of estimators:

Compression-based (Kolmogorov Complexity): Using the ZPAQ compression algorithm to estimate Normalized Compression Distance (NCD).
Entropy-based (Shannon Information): Using both exact marginal histograms (for i.i.d. data) and the ROSA (Rapid Online Suffix Automaton) predictive language model (for sequential data) to estimate Entropy, Mutual Information, and related distances.

§Mathematical Primitives

The library implements the following core measures. For sequential data, “Rate” variants use the ROSA model to estimate Ĥ(X) (entropy rate), while “Marginal” variants treat data as a bag-of-bytes (i.i.d.) and compute H(X) from histograms.

§1. Normalized Compression Distance (NCD)

Approximates the Normalized Information Distance (NID) using a compressor C.

NCD(x,y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))

§2. Normalized Entropy Distance (NED)

An entropic analogue to NCD, defined using Shannon entropy H.

NED(X,Y) = (H(X,Y) - min(H(X), H(Y))) / max(H(X), H(Y))

§3. Normalized Transform Effort (NTE)

Based on the Variation of Information (VI), normalized by the maximum entropy.

NTE(X,Y) = (H(X|Y) + H(Y|X)) / max(H(X), H(Y)) = (2H(X,Y) - H(X) - H(Y)) / max(H(X), H(Y))

§4. Mutual Information (MI)

Measures the amount of information obtained about one random variable by observing another.

I(X;Y) = H(X) + H(Y) - H(X,Y)

§5. Divergences & Distances

Total Variation Distance (TVD): δ(P,Q) = 0.5 * Σ |P(x) - Q(x)|
Normalized Hellinger Distance (NHD): sqrt(1 - Σ sqrt(P(x)Q(x)))
Kullback-Leibler Divergence (KL): D_KL(P||Q) = Σ P(x) log(P(x)/Q(x))
Jensen-Shannon Divergence (JSD): Symmetrized and smoothed KL divergence.

§6. Intrinsic Dependence (ID)

Measures the redundancy within a sequence, comparing marginal entropy to entropy rate.

ID(X) = (H_marginal(X) - H_rate(X)) / H_marginal(X)

§7. Resistance to Transformation

Quantifies how much information is preserved after a transformation T is applied.

R(X, T) = I(X; T(X)) / H(X)

§Usage

use infotheory::{ncd_vitanyi, mutual_information_bytes, NcdVariant};

let x = b"some data sequence";
let y = b"another data sequence";

// Compression-based distance
let ncd = ncd_vitanyi("file1.txt", "file2.txt", "5");

// Entropy-based mutual information (Marginal / i.i.d.)
let mi_marg = mutual_information_bytes(x, y, 0);

// Entropy-based mutual information (Rate / Sequential, max_order=8)
let mi_rate = mutual_information_bytes(x, y, 8);

Modules§

aixi: MC-AIXI Implementation
axioms: Axioms: Mathematical Property Verifiers
ctw: Context Tree Weighting (CTW) and Factorized Action-Conditional CTW (FAC-CTW).
datagen: Datagen: Synthetic Data Generators for Validation
mixture: Online mixtures of probabilistic predictors (log-loss Hedge / Bayes, switching, MDL).

Structs§

InfotheoryCtx
MixtureExpertSpec: Expert specification for mixture backends.
MixtureSpec: Mixture specification for rate-backend mixtures.

Enums§

MixtureKind: Mixture policy kind for rate-backend mixtures.
NcdBackend
NcdVariant: —–– NCD (Normalized Compression Distance) ——
RateBackend

Functions§

biased_entropy_rate_backend
biased_entropy_rate_bytes: Compute biased entropy rate Ĥ_biased(X) bits per symbol.
compress_size_backend
compress_size_chain_backend
conditional_entropy_bytes: Compute conditional entropy H(X|Y) = H(X,Y) − H(Y)
conditional_entropy_paths: Conditional Entropy for files.
conditional_entropy_rate_bytes: Compute conditional entropy rate Ĥ(X|Y).
cross_entropy_bytes: Compute cross-entropy H_{train}(test) - score test_data under model trained on train_data.
cross_entropy_paths: Cross-Entropy for files.
cross_entropy_rate_backend: Cross-entropy H_{train}(test) - score test_data under model trained on train_data.
cross_entropy_rate_bytes: Compute cross-entropy rate using ROSA/CTW/RWKV. Training model on train_data and evaluating probability of test_data.
d_kl_bytes: Kullback-Leibler Divergence D_KL(P || Q) = Σ p(x) log(p(x) / q(x))
entropy_rate_backend
entropy_rate_bytes: Compute entropy rate Ĥ(X) in bits/symbol using ROSA LM.
get_bytes_from_paths
get_compressed_size: —–– Base Compression Functions —––
get_compressed_size_parallel
get_compressed_sizes_from_paths: Optimizes parallelization
get_default_ctx: Returns the current default information theory context for the thread.
get_parallel_compressed_sizes_from_parallel_paths
get_parallel_compressed_sizes_from_sequential_paths
get_sequential_compressed_sizes_from_parallel_paths
get_sequential_compressed_sizes_from_sequential_paths: —–– Bulk File Compression Functions —––
intrinsic_dependence_bytes: Primitive 6: Intrinsic Dependence (Redundancy Ratio).
joint_entropy_rate_backend
joint_entropy_rate_bytes: Compute joint entropy rate Ĥ(X,Y).
joint_marginal_entropy_bytes: Compute joint marginal entropy H(X,Y) = −Σ p(x,y) log₂ p(x,y) in bits/symbol-pair.
js_div_bytes: Jensen-Shannon Divergence JSD(P || Q) = 1/2 D_KL(P || M) + 1/2 D_KL(Q || M) where M = 1/2 (P + Q)
js_divergence_paths: Jensen-Shannon Divergence for files.
kl_divergence_paths: KL Divergence for files.
load_rwkv7_model_from_path
marginal_entropy_bytes: Compute marginal (Shannon) entropy H(X) = −Σ p(x) log₂ p(x) in bits/symbol.
mutual_information_bytes: Compute mutual information I(X;Y) = H(X) + H(Y) - H(X,Y).
mutual_information_marg_bytes: Marginal Mutual Information (exact/histogram)
mutual_information_paths: Mutual Information for files.
mutual_information_rate_backend
mutual_information_rate_bytes: Entropy Rate Mutual Information (ROSA predictive)
ncd_bytes
ncd_bytes_backend
ncd_bytes_default: NCD with bytes using the default context.
ncd_cons
ncd_matrix_bytes: Computes an NCD matrix (row-major, len = n*n) for in-memory byte blobs.
ncd_matrix_paths: Computes an NCD matrix (row-major, len = n*n) for files (preloads all files into memory once).
ncd_paths
ncd_paths_backend
ncd_sym_cons
ncd_sym_vitanyi
ncd_vitanyi: Back-compat convenience wrappers (operate on file paths).
ned_bytes: NED(X,Y) = (H(X,Y) - min(H(X), H(Y))) / max(H(X), H(Y))
ned_cons_bytes: NED_cons(X,Y) = (H(X,Y) - min(H(X), H(Y))) / H(X,Y)
ned_cons_marg_bytes
ned_cons_rate_bytes
ned_marg_bytes: Marginal NED (exact/histogram)
ned_paths: NED for files.
ned_rate_backend
ned_rate_bytes: Normalized Entropy Distance (Rate-based)
nhd_bytes: NHD(X,Y) = sqrt(1 - BC(X,Y)) where BC = Σᵢ sqrt(p_X(i) · p_Y(i))
nhd_paths: NHD for files.
nte_bytes: NTE(X,Y) = VI(X,Y) / max(H(X), H(Y)) where VI(X,Y) = H(X|Y) + H(Y|X) = 2H(X,Y) - H(X) - H(Y).
nte_marg_bytes
nte_paths: NTE for files.
nte_rate_backend
nte_rate_bytes
resistance_to_transformation_bytes: Primitive 7: Resistance under Allowed Transformations.
set_default_ctx: Sets the default information theory context for the thread.
tvd_bytes: TVD_marg(X,Y) = (1/2) Σᵢ |p_X(i) - p_Y(i)|
tvd_paths: TVD for files.
validate_zpaq_rate_method