scbulkde.pp.pseudobulk

Contents

scbulkde.pp.pseudobulk#

scbulkde.pp.pseudobulk(adata, group_key, query, reference='rest', *, replicate_key=None, min_cells=50, min_fraction=0.2, min_coverage=0.75, categorical_covariates=None, continuous_covariates=None, continuous_aggregation='mean', layer=None, layer_aggregation='sum', qualify_strategy='or', covariate_strategy='sequence_order', resolve_conflicts=True)#

Perform pseudobulking on single-cell data to aggregate expression across cells.

This function aggregates single-cell expression data into pseudobulk samples by combining cells from specified groups (query vs. reference) across biological replicates, if present, and optional covariates. It creates a design matrix suitable for downstream differential expression analysis while filtering samples based on quality control metrics.

Parameters:
  • adata (AnnData) – Annotated data matrix containing single-cell expression data.

  • group_key (str) – Column name in adata.obs that defines the cell groups for comparison (e.g., ‘cell_type’, ‘condition’, ‘cluster’).

  • query (str | Sequence[str]) – Cell group(s) to be used as the query/test condition. Must be present in adata.obs[group_key].

  • reference (str | Sequence[str] (default: 'rest')) – Cell group(s) to be used as the reference/control condition. If “rest”, all groups not in query are used as reference. Must be present in adata.obs[group_key].

  • replicate_key (str | None (default: None)) – Column name in adata.obs defining biological replicates (e.g., ‘sample_id’, ‘donor’, ‘batch’). Required for creating multiple pseudobulk samples per condition, but never included in the design. If None, cells are not stratified by replicate.

  • min_cells (int | None (default: 50)) – Minimum number of cells required per pseudobulk sample. Samples with fewer cells are excluded from analysis.

  • min_fraction (float | None (default: 0.2)) – Minimum fraction of cells of the condition in that pseudobulk sample for it to be considered valid. Samples with a lower fraction are excluded from analysis.

  • min_coverage (float | None (default: 0.75)) – Minimum coverage provided by all valid samples per condition. Conditions with lower coverage are collapsed. Range: [0.0, 1.0].

  • categorical_covariates (Sequence[str] | None (default: None)) – Column names in adata.obs representing categorical covariates to include in the design (e.g., [‘experiment’, ‘chemistry’, ‘batch’]). These are added as stratification factors along with replicate_key.

  • continuous_covariates (Sequence[str] | None (default: None)) – Column names in adata.obs representing continuous covariates to include in the design (e.g., [‘cellcycle’, ‘pct_mito’]). These are aggregated per pseudobulk sample.

  • continuous_aggregation (Union[Literal['mean', 'sum', 'median'], Callable, None] (default: 'mean')) – Method to aggregate continuous covariates across cells within each pseudobulk sample. Can be a string specifying a standard aggregation or a custom callable.

  • layer (str | None (default: None)) – Layer in adata.layers to use for aggregation. If None, uses adata.X.

  • layer_aggregation (Literal['sum', 'mean'] (default: 'sum')) – Method to aggregate expression values across cells.

  • qualify_strategy (Literal['and', 'or'] (default: 'or')) – Strategy for sample qualification when multiple criteria are specified: - “and”: Sample candidate must pass both min_cells AND min_fraction thresholds - “or”: Samples candidate must pass either min_cells OR min_fraction threshold

  • covariate_strategy (Literal['sequence_order', 'most_levels'] (default: 'sequence_order')) – Strategy for ordering covariates in the design formula when conflicts arise: - “sequence_order”: Drop covariates from back to front in the provided list - “most_levels”: Prioritize covariates with more unique levels

  • resolve_conflicts (bool (default: True)) – If True, automatically resolve confounded covariates by iteratively removing them to ensure a full-rank design matrix. If False, raise an error when confounding is detected.

Returns:

PseudobulkResult Container object with the following attributes:

  • adata_subad.AnnData

    Subset of input AnnData containing only query and reference cells

  • pb_countspd.DataFrame

    Aggregated pseudobulk expression matrix (samples × genes). Empty if no valid strata exist (collapsed case)

  • groupedpd.api.typing.DataFrameGroupBy

    Grouped observation data for internal use

  • sample_tablepd.DataFrame

    Metadata for each pseudobulk sample, including covariates, cell counts, and quality metrics

  • design_matrixpd.DataFrame

    Design matrix for statistical testing, created from design_formula

  • design_formulastr

    Patsy-style formula describing the statistical model

  • group_keystr

    Original group key parameter

  • group_key_internalstr

    Internal column name for query/reference labels (‘psbulk_condition’)

  • querystr or list

    Query group(s) used

  • referencestr or list

    Reference group(s) used

  • stratalist of str

    Final stratification factors used (may be subset of requested due to conflict resolution). Empty list indicates collapsed pseudobulk

  • collapsedbool

    True if insufficient replicates exist and data was collapsed across all cells per condition

  • n_samplesint

    Number of pseudobulk samples created

Warning

  • If min_cells, min_fraction, or min_coverage thresholds are not met, samples or entire conditions may be excluded or collapsed

  • Confounded covariates are automatically removed when resolve_conflicts=True

  • Empty pb_counts (collapsed case) indicates no valid independent samples exist and differential expression testing may require special handling

Examples

n.a.

Notes

The pseudobulking approach aggregates cells from the same biological replicate and condition, reducing the computational burden and addressing the issue of pseudoreplication in single-cell data. This enables the use of standard bulk RNA-seq differential expression methods while accounting for biological variability.

When collapsed=True, the result contains only aggregated condition-level information without independent replicates. In that case one needs to use the tl.de function with fallback strategies ('pseudoreplicates' or 'single_cell').

The function automatically:

  • Filters cells to only query and reference groups

  • Validates stratification factors (replicates and covariates)

  • Removes samples not meeting quality thresholds

  • Resolves confounded covariates to ensure full-rank design

  • Creates both count matrix and metadata for downstream analysis