scbulkde.pp.pseudobulk#
- scbulkde.pp.pseudobulk(adata, group_key, query, reference='rest', *, replicate_key=None, min_cells=50, min_fraction=0.2, min_coverage=0.75, categorical_covariates=None, continuous_covariates=None, continuous_aggregation='mean', layer=None, layer_aggregation='sum', qualify_strategy='or', covariate_strategy='sequence_order', resolve_conflicts=True)#
Perform pseudobulking on single-cell data to aggregate expression across cells.
This function aggregates single-cell expression data into pseudobulk samples by combining cells from specified groups (query vs. reference) across biological replicates, if present, and optional covariates. It creates a design matrix suitable for downstream differential expression analysis while filtering samples based on quality control metrics.
- Parameters:
adata (
AnnData) – Annotated data matrix containing single-cell expression data.group_key (
str) – Column name inadata.obsthat defines the cell groups for comparison (e.g., ‘cell_type’, ‘condition’, ‘cluster’).query (
str|Sequence[str]) – Cell group(s) to be used as the query/test condition. Must be present inadata.obs[group_key].reference (
str|Sequence[str] (default:'rest')) – Cell group(s) to be used as the reference/control condition. If “rest”, all groups not inqueryare used as reference. Must be present inadata.obs[group_key].replicate_key (
str|None(default:None)) – Column name inadata.obsdefining biological replicates (e.g., ‘sample_id’, ‘donor’, ‘batch’). Required for creating multiple pseudobulk samples per condition, but never included in the design. If None, cells are not stratified by replicate.min_cells (
int|None(default:50)) – Minimum number of cells required per pseudobulk sample. Samples with fewer cells are excluded from analysis.min_fraction (
float|None(default:0.2)) – Minimum fraction of cells of the condition in that pseudobulk sample for it to be considered valid. Samples with a lower fraction are excluded from analysis.min_coverage (
float|None(default:0.75)) – Minimum coverage provided by all valid samples per condition. Conditions with lower coverage are collapsed. Range: [0.0, 1.0].categorical_covariates (
Sequence[str] |None(default:None)) – Column names inadata.obsrepresenting categorical covariates to include in the design (e.g., [‘experiment’, ‘chemistry’, ‘batch’]). These are added as stratification factors along withreplicate_key.continuous_covariates (
Sequence[str] |None(default:None)) – Column names inadata.obsrepresenting continuous covariates to include in the design (e.g., [‘cellcycle’, ‘pct_mito’]). These are aggregated per pseudobulk sample.continuous_aggregation (
Union[Literal['mean','sum','median'],Callable,None] (default:'mean')) – Method to aggregate continuous covariates across cells within each pseudobulk sample. Can be a string specifying a standard aggregation or a custom callable.layer (
str|None(default:None)) – Layer inadata.layersto use for aggregation. If None, usesadata.X.layer_aggregation (
Literal['sum','mean'] (default:'sum')) – Method to aggregate expression values across cells.qualify_strategy (
Literal['and','or'] (default:'or')) – Strategy for sample qualification when multiple criteria are specified: - “and”: Sample candidate must pass bothmin_cellsANDmin_fractionthresholds - “or”: Samples candidate must pass eithermin_cellsORmin_fractionthresholdcovariate_strategy (
Literal['sequence_order','most_levels'] (default:'sequence_order')) – Strategy for ordering covariates in the design formula when conflicts arise: - “sequence_order”: Drop covariates from back to front in the provided list - “most_levels”: Prioritize covariates with more unique levelsresolve_conflicts (
bool(default:True)) – If True, automatically resolve confounded covariates by iteratively removing them to ensure a full-rank design matrix. If False, raise an error when confounding is detected.
- Returns:
PseudobulkResult Container object with the following attributes:
- adata_subad.AnnData
Subset of input AnnData containing only query and reference cells
- pb_countspd.DataFrame
Aggregated pseudobulk expression matrix (samples × genes). Empty if no valid strata exist (collapsed case)
- groupedpd.api.typing.DataFrameGroupBy
Grouped observation data for internal use
- sample_tablepd.DataFrame
Metadata for each pseudobulk sample, including covariates, cell counts, and quality metrics
- design_matrixpd.DataFrame
Design matrix for statistical testing, created from
design_formula
- design_formulastr
Patsy-style formula describing the statistical model
- group_keystr
Original group key parameter
- group_key_internalstr
Internal column name for query/reference labels (‘psbulk_condition’)
- querystr or list
Query group(s) used
- referencestr or list
Reference group(s) used
- stratalist of str
Final stratification factors used (may be subset of requested due to conflict resolution). Empty list indicates collapsed pseudobulk
- collapsedbool
True if insufficient replicates exist and data was collapsed across all cells per condition
- n_samplesint
Number of pseudobulk samples created
Warning
If
min_cells,min_fraction, ormin_coveragethresholds are not met, samples or entire conditions may be excluded or collapsedConfounded covariates are automatically removed when
resolve_conflicts=TrueEmpty
pb_counts(collapsed case) indicates no valid independent samples exist and differential expression testing may require special handling
Examples
n.a.
Notes
The pseudobulking approach aggregates cells from the same biological replicate and condition, reducing the computational burden and addressing the issue of pseudoreplication in single-cell data. This enables the use of standard bulk RNA-seq differential expression methods while accounting for biological variability.
When
collapsed=True, the result contains only aggregated condition-level information without independent replicates. In that case one needs to use thetl.defunction with fallback strategies ('pseudoreplicates'or'single_cell').The function automatically:
Filters cells to only query and reference groups
Validates stratification factors (replicates and covariates)
Removes samples not meeting quality thresholds
Resolves confounded covariates to ensure full-rank design
Creates both count matrix and metadata for downstream analysis