ggml_ot.from_anndata#
- ggml_ot.from_anndata(adata, patient_col='sample', label_col='patient_group', n_cells=250, n_triplets=3, use_rep=None, group_by=None, gmm_key=None, sample_gmm=False, gmm_weights_source='auto')[source]#
Dataset to train GGML based on AnnData.
This subclass of TripletDataset formats triplets of patient-level cell distributions from an AnnData object. The triplets capture the relative relationship between patient groups (e.g. disease state) that GGML aims to learn.
By default, it captures the cells of a patient as a empirical distribution in the gene space of the AnnData (.X). Using the
use_repand/orgroup_byparameter, you can reduce the distribution to only cell_subtypes and/or low dimensional gene representations.This class exposes the dataset to the standardized interfaces used by
ggml_ot.train(),ggml_ot.tune(),ggml_ot.test()andggml_ot.train_test().- Parameters:
- adata str | anndata.AnnData
The AnnData object.
- patient_col str, optional
Column in
adata.obsthat identifies the patient / sample (default: “sample”).- label_col str, optional
Column in
adata.obsthat contains the patient group, e.g., disease state (default: “patient_group”).- n_cells int, optional
Number of cells to sample per patient (default: 250).
- n_triplets int, optional
Number of generated triplets for each patient to capture the relative relationship of the patient group. (default: 3). This will lead to
n_patients * n_triplets * n_labelstriplets being generated.- group_by None | str, optional
Optional column in
adata.obsto group cells and learn a ground metric between cell groups instead (default: None).- use_rep None | str, optional
If provided, uses
adata.obsm[use_rep]as the cell embedding representation; otherwise the raw .X matrix is used (default: None).- gmm_key None | str, optional
If provided, loads a previously fitted GMM representation from
adata.uns[gmm_key](default: None).- sample_gmm bool, optional
If
True, samples empirical supports from fitted GMM mixtures instead of using parametric supports directly (default: False).- gmm_weights_source {"auto", "stored", "components"}, optional
Controls how per-distribution GMM weights are reconstructed when
gmm_keyis provided."auto"tries stored weights first, then hard assignments predicted from the stored GMM parameters (default: “auto”).
- Return type:
Notes
Following scverse conventions, this class modifies the AnnData object in-place during dataset construction and training. In particular:
adata.uns["ggml_params"]— stores dataset construction parameters.adata.uns["W_ggml"]— the learned linear map after training.adata.varm["W_ggml"]— gene-space loadings of the learned ground metric.adata.obsm["X_ggml"]— cells projected into the learned gene subspace.
If you need an unmodified copy, call
adata.copy()before constructing the dataset.See also
ggml_ot.data.generic.TripletDatasetbase class providing triplet creation and dataset API.