ggml_ot.data.from_synth_gmm

Contents

ggml_ot.data.from_synth_gmm#

ggml_ot.data.from_synth_gmm(*, representation='cells', adata=False, gmm_key=None, t=4, n_dim=10, n_patients=6, n_samples=250, signal_mass_ratio=0.2, n_modes=10, signal_means_offset=12.0, signal_means_jitter=0.75, noise_means_offset=3.0, noise_means_jitter=0.75, noise_subspace_rank=2, signal_weight_concentration=None, noise_weight_concentration=None, signal_mean_shift=1.0, signal_cov_scale=1.2, signal_anisotropy=12.0, cov_rotation_jitter=10.0, cov_scale_jitter=0.15, global_rotation=30.0, random_seed=42)[source]#

Create a GGML dataset from the synthetic GMM generator.

Wraps synth_gmm() and returns a dataset that can be used directly with training and evaluation functions.

Parameters:
representation Literal['cells', 'gmm'] (default: 'cells')

How patient distributions are represented in the dataset. "cells" (default) samples n_samples cells per patient and stores them as empirical point clouds. "gmm" stores the analytical per-patient GMM component parameters directly (means, covariances, weights).

adata bool (default: False)

If True, wrap the dataset in an AnnData_TripletDataset backed by an AnnData object. Required for gmm_key to have any effect.

gmm_key Optional[str] (default: None)

When adata=True, persist the analytical raw-space ground-truth GMM under dataset.adata.uns[gmm_key]. Requires adata=True.

t int (default: 4)

Number of triplets sampled per anchor distribution.

**kwargs

All remaining keyword arguments (n_dim, n_patients, n_samples, etc.) are forwarded to synth_gmm(). See its documentation for details.

Return type:

TripletDataset

Returns:

TripletDataset | AnnData_TripletDataset A dataset ready for use with ggml_ot.train() or ggml_ot.train_gmm().

Raises:

ValueError – If representation is not "cells" or "gmm", or if gmm_key is set without adata=True.