k-spaces fits mixtures of low dimensional Gaussian latent variable models for data modeling, dimension reduction, and subspace learning https://www.biorxiv.org/content/10.1101/2025.11.24.690254v1. It provides a multipurpose toolkit for exploratory data analysis and probabilistic modeling.
More examples will be uploaded soon but model_fitting_examples.ipynb has some basic usage examples and documentation, and notebooks from the paper can be found at https://github.com/pachterlab/MEPSP_2024/tree/main/notebooks_paper, with minor differences in function calls.
functions intended for general usage:
| module | function or class | short description |
|---|---|---|
EM |
run_EM |
given data, construct and fit a model. |
EM |
E_step |
given a fitted model and some data, perform assignments. |
EM |
fit_single_space |
given data, construct and fit a single affine_subspace. |
EM |
run_EM_parallelized |
parallelized version of run_EM using multiprocessing. |
affine_subspace_ |
affine_subspace |
class defining an affine subspace with a probability density for data. |
affine_subspace_ |
affine_subspace.transform |
linear dimensionality reduction of points onto space. |
affine_subspace_ |
affine_subspace.projection |
projection of points onto space, still in high dimension. |
affine_subspace_ |
affine_subspace.probability |
compute P(points | space). |
affine_subspace_ |
fixed_space |
class that only updates noise and component weight in EM. Inherits from affine_subspace. |
affine_subspace_ |
bg_space |
class even more fixed than fixed_space for modeling a fixed level of background noise and only updates component weight in EM. Inherits from affine_subspace. |
affine_subspace_ |
ensure_vector_directions |
PCA, SVD, and k-spaces vectors are sign indeterminate. Flips basis vectors to point in the positive direction. |
affine_subspace_ |
write_spaces |
writes spaces to a csv file. |
affine_subspace_ |
read_spaces |
reads spaces from a csv file. |
model_selection_ |
total_log_likelihood |
compute the observed log likelihood of the data. |
model_selection_ |
model_selection |
perform model selection using BIC or ICL. |
model_selection_ |
get_BIC |
compute BIC for a custom model selection pipeline. |
model_selection_ |
get_ICL |
compute ICL for a custom model selection pipeline. |
generate_ |
generate |
generate synthetic data from a k-spaces model. |
plotting |
view_3D_pca_mpl |
plots points and the projections of spaces in a 3D PCA space with matplotlib. |
plotting |
view_3D_pretransformed_mpl |
plots pretransformed points in a 3D subspace and rescales sizes of points by distance with matplotlib. |
plotting |
project_space |
projects one space onto another (or onto an sklearn PCA space) |
These functions can also be accessed without the module name (e.g., kspaces.run_EM(...) instead of kspaces.EM.run_EM(...).
pip install kspacesGiven a data matrix of N observations x D features, construct and fit a model with kspaces.EM.run_EM.
kspaces.EM.run_EM requires specification of kd accepts a list to specify them. For example, [1,0,2,3], specifies
kspaces.EM.run_EM does not require specifying assignment = 'soft' or assignment = 'hard' but this choice matters. Soft assignment allows points to be probabilistically 'shared' by subspaces, as described in the paper, while hard assignment is all-or-nothing. In some contexts, hard assignment will be faster as it uses singular value decomposition. However, soft assignment has smoother convergence. assignment = 'closest' is a form of hard assignment that does not use a probabilistic model and just assigns points to their closest subspaces. It is faster but generally not recommended.
kspaces.EM.run_EM does not require specifying initializations or max_iter but depending on how hard it is to find the globally optimal solution, more initializations than the default of 10 would be needed, and decreasing the tolerance, tol, and increasing the number of allowed iterations per EM run, max_iter will help.
Here is a basic example:
spaces, responsibilities = kspaces.EM.run_EM(data, [1,1,2] assignment = 'soft', initializations = 50)k-spaces fits two lines and a plane to the data, returns those affine_subspace objects in the list spaces and also returns a N x k array of probabilities that each subspace 'generated' that point with the fitted probability distributions. The affine_subspaces of spaces and the columns of responsibilities are in the same order as the spaces were specified in kd.
To use these spaces as dimension reductions for our data, we can use the transform function. While each space was optimized for a subset of the data, the function for the dimension reduction is defined everywhere, so we can project all the points onto any of the spaces for a dimension reduction if we want. Let's use the third space, the plane, so we can get 2-D dimension-reduced data:
reduced_dim_data = spaces[2].transform(data)We can visualize this k-spaces.plotting for high dimensional data visualized in 3-D dimension reductions. view_3D_pca_mpl and view_3D_pretransformed_mpl both use matplotlib. view_3D_pca_mpl computes a dimension reduction for the data using 3D PCA and can optionally project our fitted subspaces into the plot too. view_3D_pretransformed_mpl assumes we have already transformed the data using one of our subspaces (as above, although we'd need d = 3) and want to visualize it.
If we want to project the points onto their subspaces in the high dimensional space, we use project instead.
projected_data = spaces[2].project(data)This data lies on a
We can save this with write_spaces and come back later to read it back in with read_spaces. The assignment matrix is not saved, only the model parameters, but we can use E_step to assign the points.
kspaces.affine_subspace_.write_spaces(spaces, 'myfile.csv')
spaces = kspaces.affine_subspace_.read_spaces('myfile.csv')
probabilities = kspaces.EM.E_step(points, spaces, assignment = 'soft')EM - contains functions to fit a kspaces model using an EM algorithm. The option to use deterministic annealing (Ueda and Nakano 1998) can improve the odds of finding the global maximum and is particularly helpful when many or most initializations with run_EM fail.
model_selection - contains functions for calculating likelihoods and performing model selection with ICL or BIC. BIC suggests a model based on the number of parameters and the observed likelihood of the data. ICL additionally penalizes BIC by the clustering entropy (it discourages selecting models with cluster components with overlapping density).
affine_subspace_ - implements the affine_subspace class.
generate - generates synthetic data from an affine subspace.
plotting - visualization functions using matplotlib. view_3D outputs static views of 3D data from different angles using matplotlib.
All functions have docstrings that can be displayed in a notebook with Shift + Tab. Here are some functions intended for general use:
def run_EM(points,
kd,
assignment = 'soft',
max_iter=50,
tol=5e-2,
initializations = 10,
verbose = False,
silent = False,
print_solution = False,
randomize_init = False,
batch_size = np.inf,
batch_replace = True,
print_ownerships = False,
multiprocess_spaces = False,
init_spaces = [],
fixed_spaces = [],
min_variance = 1e-10,
return_if_failed = True,
set_noise_equal = False,
DA = False,
beta_0 = 0.5,
anneal_rate = 1.2):
""" Runs EM with multiple initializations and selects the maximum likelihood one.
The first initialization uses kmeans to get centroids and then passes lines through those and the origin.
returns: spaces (list of affine subspaces), probabilities (N x K np array of P(space | point))
kd: 1 x k list containing dimensions (d) for subspaces. i.e. [1,1,1] or [0,2,1]
assignment: default "hard". Other options: "soft" and "closest".
fixed spaces: list of dicts {'vec':[basis],'tr':translation} where basis vectors and translation are all lists of length D
init spaces: list of affine_subspaces (see affine_subspace_.py) to intialize with.
max_iter: maximum number of EM iterations
tol: default 0.05. tolerance for determining EM convergence.
initializations: default 1. 5-10 is recommended. Number of EM initializations to do.
verbose: default False (recommended). Optionally can be set to True to print out
information about spaces in each EM iteration as EM converges.
print_solution: default False. Print out the spaces. You can also print the spaces out with print(space),
and the space's principal axes, translation, latent space standard deviations, complementary space
noise standard deviation, and total ownership of points (prior) will be displayed.
multiprocess_spaces = default False. Process each space in parallel in the M step of EM. Useful if fitting
many spaces, but if doing many separate kspaces runs (i.e. running kspaces on 100 different pairs of genes)
it will be faster to write a wrapper to run kspaces itself in parallel as multiprocessing in python has overhead.
batch_size: default is np.inf (no batch; use full dataset) batch size for EM iterations.
batch_replace: default is True. Sample with/without replacement if using batches.
min_variance: default is 1e-10. Minimum variance enforced to prevent singular covariance matrices in
"soft" and "hard" assignment mode.
return_if_failed: default True. Returns [spaces, probabilities] for last EM run if True. Returns [[],[]] if False.
set_noise_equal: default False. If true, enforces equal sigma_noise for each space after each M step.
DA: default False. if True, use deterministic annealing EM (Naonori Ueda and Ryohei Nakano.
Deterministic annealing EM algorithm. Neural Networks, 11(2):271–282, March 1998.) Will take longer to run.
higher beta_0 and higher anneal_rate lead to faster convergence.
beta_0: default 0.5. ignored if DA = False. Must be between 0 and 1. Inverse to initial annealing
"temperature." Lower beta_0 is "hotter"
anneal_rate: default 1.2. ignored if DA = False. Must be > 1. Factor to cool down temperature
per round (multiplied to beta_0 successively to reach beta = 1).
"""def E_step(points, spaces,assignment = 'hard',verbose = False):
""" caculates "ownership" of points by each space based on the probabilities of those spaces generating those points
P(space_i | point) = P(point | space_i)*P(space_i)/ sum over k spaces ( P(point | space_j) * P(space_j))
points: N x D np array (or less than N if EM is in batch mode)
spaces: list of affine subspaces
assignment: "hard" "closest" or "soft"
verbose: bool
returns: N x K matrix of probabilities P(space | point)"""
def fit_single_space(points,d, min_variance = 1e-10):
""" fits a single space with PCA
points: N x D array
d: int. dimension of space to fit
min_variance: float. minimum variance added if variance along a dimension is zero to avoid a singular covariance matrix
returns: affine_subspace"""def run_EM_parallelized(points,
kd,
assignment='soft',
max_iter=50,
tol=5e-2,
initializations=10,
n_jobs=None,
randomize_init=False,
batch_size=np.inf,
batch_replace=True,
init_spaces=[],
fixed_spaces=[],
min_variance=1e-10,
set_noise_equal=False,
DA=False,
beta_0=0.5,
anneal_rate=1.2,
max_additional_init = 0,
print_seeds = False):
""" Multiprocessing for run_EM with joblib. Each worker calls `run_EM` separately, and any special initializations (i.e., non-random first initialization or init_spaces are passed only to one worker. Each worker is given a separate random seed. Certain arguments for `run_EM` are hard-coded, and the `multiprocess_spaces` argument is hard-coded to False to avoid oversubscription of resources. Memory is NOT shared by jobs, so very large datasets will be copied n_jobs times.
returns: spaces (list of affine subspaces), probabilities (N x K np array of P(space | point))
kd: 1 x k list containing dimensions (d) for subspaces. i.e. [1,1,1] or [0,2,1]
assignment: default "hard". Other options: "soft" and "closest".
fixed spaces: list of dicts {'vec':[basis],'tr':translation} where basis vectors and translation are all lists of length D
init spaces: list of affine_subspaces (see affine_subspace_.py) to intialize with.
max_iter: maximum number of EM iterations
tol: default 0.05. tolerance for determining EM convergence.
initializations: default 1. 5-10 is recommended. Number of EM initializations to do.
n_jobs: number of parallel processes requested. If None, n_jobs = min(4, initializations)
batch_size: default is np.inf (no batch; use full dataset) batch size for EM iterations.
batch_replace: default is True. Sample with/without replacement if using batches.
min_variance: default is 1e-10. Minimum variance enforced to prevent singular covariance matrices in "soft" and "hard" assignment mode.
set_noise_equal: default False. If true, enforces equal sigma_noise for each space after each M step.
DA: default False. if True, use deterministic annealing EM (Naonori Ueda and Ryohei Nakano. Deterministic annealing EM algorithm. Neural Networks, 11(2):271–282, March 1998.) Will take longer to run. higher beta_0 and higher anneal_rate lead to faster convergence.
beta_0: default 0.5. ignored if DA = False. Must be between 0 and 1. Inverse to initial annealing "temperature." Lower beta_0 is "hotter"
anneal_rate: default 1.2. ignored if DA = False. Must be > 1. Factor to cool down temperature by per round (multiplied to beta_0 successively to reach beta = 1).
max_additional_init: default 0. (NOTE THIS IS DIFFERENT FROM run_EM). maximum number of additional initializations to attempt for each `run_EM` worker if all of the requested initializations fail
print_seeds: bool. default False. Whether to print the random seeds given to each worker `run_EM` run.
"""class affine_subspace:
def __init__(self,vectors, translation, sigma, latent_sigmas, prior):
""" initializes affine subspace
vectors: d x D list of lists
translation: list of length D
sigma: nonnegative scalar
latent_sigmas: list of length d
prior: scalar from 0 to 1
"""
self.vectors = self.vectors_to_orthonormal_basis(np.array(vectors)) #associated vector subspace is spanned by these basis vectors
self.translation = np.array(translation) #translation vector for "origin" of the subspace
self.sigma = sigma #standard deviation of orthogonal noise averaged over dimensions of complementary space
self.latent_sigmas = np.array(latent_sigmas) #standard deviations for data along each eigenvector of the latent space
self.D = len(translation) #dimensionality of ambient space
self.d = len(vectors) #dimensionality of subspace
self.prior = prior #mixture component weight for this subspace. All subspaces' priors add up to 1def transform(self, points):
"""Alias for self.displacement to match the call for sklearn's pca.transform.
points: N x D array
returns: N x d array"""
return self.displacement(points)def projection(self,points):
"""project point onto subspace. The subspace is low dimensional but the points are still in high dimensional space.
points: N x D array
returns: N x D array"""def probability(self, points, log=False, true_val = False):
"""By default returns a value proportional to P(point | self) and ignores 1/sqrt(2 pi) term in normal pdf but can be made exact by multiplying result by 1/(2 pi)^ D/2 with true_val = True.
points: N x D array
log: bool. default False. Whether to return log probability or probability
true_val: bool. default False. If True, multiply the constant 1/(2 pi)^ D/2 to get the exact probability rather than a proportional value.
returns: (log) probability"""def model_selection(points,model,null, print_solution = False, eq_noise = False, test = 'BIC'):
"""Perform model selection with BIC or ICL. ICL penalizes BIC with the entropy of cluster assignments. Accepts a list of affine_subspaces or a single affine_subspace for model and null, but whether a list or single space is passed in, it should be a kspaces model because likelihoods need to be calculated. In other words, if the list is not a full model fit by kspaces, affine_subspace.prior should add up to 1 over the list or should be 1 for a single space.
points: N x D array.
model: list of affine subspaces or single affine subspace.
null: list of affine subspaces or single affine subspace.
eq_noise: bool. should be True if assignment is "closest" or set_noise_equal == True
test: 'BIC' or 'ICL'. default 'BIC'. if ICL, assignments will be computed with a soft-assignment E_step as ICL with hard assignment is just BIC.
returns: 'model' or 'null'.
"""def ensure_vector_directions(spaces, inplace = True):
"""checks direction of each basis vector and flips them (reverses sign) as needed to point in a positive direction.
Specifically, checks whether dot product of each basis vector with <1,1,1,...,1> is positive. Flips if negative.
spaces: list of subspace objects
inplace: bool. default True. Whether to modify the spaces in place or not
returns: sorted spaces
"""def write_spaces(spaces, file):
"""writes spaces to a csv file
spaces: list of subspace objects
file: filename to write to"""def write_spaces(spaces, file):
"""writes spaces to a csv file
spaces: list of subspace objects
file: filename to write to"""def read_spaces(file):
"""reads in a file written by 'write_spaces'
file: filename
returns: list of subspace objects"""def total_log_likelihood(points, spaces, print_solution= False):
"""Calculate the Gaussian likelihood of the points given the lines using log sum exp.
points: N x D array
spaces: list of affine subspaces
print_solution: whether to print the spaces
returns: total log likelihood
"""def get_BIC(df,num_points,log_likelihood):
""" returns Bayesian Information Criterion
df: degrees of freedom. Can be obtained with 'get_df'
num_points: N
log_likelihood: observed log likelihood of data. Can be obtained with 'total_log_likelihood'
returns: BIC
"""def get_ICL(probs, points, spaces, eq_noise):
"""returns Integrated Completed Likelihood C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7):719–725, July 2000.
probs: N x k array of assignment probabilities
points: N x D array of points
spaces: list of affine_subspace objects
eq_noise: True/False, used to determine degrees of freedom. Was eq_noise set to True to fit the model?
returns: ICL """
def generate(spaces,size = 1, seed = None):
"""generates points given a list of affine_subspaces
spaces: list of affine_subspaces (or a single affine subspace)
size: default 1. number of data points
seed: default None. random seed for numpy
returns size x D array of points, where D is specified by the spaces.
"""def view_3D_pca_mpl(points,
aspect = 'equal',
subtypes = [],
title = '',
axis_labels = ['PC 1', 'PC 2', 'PC 3'],
color_dict = {},
cmap_dict = {},
s_points = 2,
alpha_points = 1,
elevation = [70,0],
azimuth = [20,90,160],
legend = True,
legend_loc = 'upper left',
save = '',
show = True,
markerscale = 5,
fontsize = 10,
return_figure = False):
"""**will apply PCA to data**
Plots points together in 3D PCA space with the projections of spaces (if space.d <= 3).
Optionally colors points by subtype if the subtypes argument and either color_dict or cmap_dict arguments are used.
points: N x D array
aspect: passed to ax.set_aspect
subtypes: length N list or array of subtype labels for points
title: passed to fig.suptitle
axis_labels: default ['basis 1', 'basis 2', 'basis 3']
color_dict: dict. key, val pair is subtype_name:color. Overriden by cmap dict. Default of {} will plot black, while an
incomplete dictionary will color non included subtypes gray.
cmap_dict: dict. key, val pair is subtype_name:(cmap_name,values_for_cmap)} Default of {} will plot black.
alpha_points: passed to ax.scatter as alpha
elevation = list that sets viewing elevation(s). Elements passed to ax.view_init as elev
azimuth = list that sets viewing azimuth(s). Elements passed to ax.view_init as azim
legend = whether to plot legend.
legend_loc = passed to ax.legend.
save = filename to save plot to. default '' does not save.
show = if True, triggers plt.show()
markerscale = passed to ax.legend
fontsize = passed to ax.legend, fig.suptitle, axis labels
return_figure: if True, return fig, axs"""def view_3D_pretransformed_mpl(points,
aspect = 'equal',
subtypes = [],
title = '',
axis_labels = ['Basis 1', 'Basis 2', 'Basis 3'],
color_dict = {},
cmap_dict = {},
s_points = 2,
scale_by_distance = 1,
distances = None,
alpha_points = 1,
elevation = [70,0],
azimuth = [20,90,160],
legend = True,
legend_loc = 'upper left',
save = '',
show = True,
markerscale = 5,
fontsize = 10,
return_figure = False):
"""**assumes pre-transformed data into a 3D space**
Plots points in pretransformed 3D space with matplotlib. affine_subspace.transform can be used to do the dimension reduction.
Optionally rescales points to shrink with distance according to s_points / ( log1p(distances[i]) ^ scale_by_distance ).
affine_subspace.orthogonal_distance can be used to obtain the distances.
Optionally colors points by subtype if the subtypes argument and either color_dict or cmap_dict arguments are used.
points: N x 3 array
aspect: passed to ax.set_aspect
subtypes: length N list or array of subtype labels for points
title: passed to fig.suptitle
axis_labels: default ['basis 1', 'basis 2', 'basis 3']
color_dict: dict. key, val pair is subtype_name:color. Overriden by cmap dict. Default of {} will plot black, while an
incomplete dictionary will color non included subtypes gray.
cmap_dict: dict. key, val pair is subtype_name:(cmap_name,values_for_cmap)} Default of {} will plot black.
s_points: default 2. size for points before rescaling by distance.
scale_by_distance: default 1 means point size shrinks proportionally to logarithmic distance from projection.
size for point_i is s_points / ( log1p(distances[i]) ^ scale_by_distance )
distances: If None, all points will be plotted the same size. Else, distances is used to rescale points to reflect distance from the projection.
alpha_points: passed to ax.scatter as alpha
elevation = list that sets viewing elevation(s). Elements passed to ax.view_init as elev
azimuth = list that sets viewing azimuth(s). Elements passed to ax.view_init as azim
legend = whether to plot legend.
legend_loc = passed to ax.legend.
save = filename to save plot to. default '' does not save.
show = if True, triggers plt.show()
markerscale = passed to ax.legend
fontsize = passed to ax.legend, fig.suptitle, axis labels
return_figure: if True, return fig, axs):
"""def project_space(s,vis):
"""projects k-spaces subspace into the 3D subspace of PCA or a 3D k-spaces subspace
s: a k-spaces subspace object
vis: a k-spaces subspace object or a sklearn.decomposition.PCA instance
returns: affine_subspace object for the projection of 's' onto 'vis'"""