scbulkde.tl.de

Contents

scbulkde.tl.de#

scbulkde.tl.de(data, group_key=None, query=None, reference='rest', replicate_key=None, min_cells=50, min_fraction=0.2, min_coverage=0.75, categorical_covariates=None, continuous_covariates=None, continuous_aggregation='mean', layer=None, layer_aggregation='sum', qualify_strategy='or', covariate_strategy='sequence_order', resolve_conflicts=True, n_repetitions=3, resampling_fraction=0.33, min_samples=3, alpha=0.05, alpha_fallback=0.05, correction_method='fdr_bh', engine='anova', engine_kwargs=None, fallback_strategy='pseudoreplicates', seed=42)#

Perform differential expression analysis on pseudobulked single-cell data.

This function integrates pseudobulking and differential expression testing with fallback strategies when insufficient biological replicates exist.

Parameters:
  • data (AnnData | PseudobulkResult) – Input data. Either an AnnData object (will be pseudobulked automatically) or a pre-computed PseudobulkResult from pp.pseudobulk().

  • group_key (str | None (default: None)) – Column name in adata.obs that defines the cell groups for comparison (e.g., ‘cell_type’, ‘condition’, ‘cluster’).

  • query (str | Sequence[str] | None (default: None)) – Cell group(s) to be used as the query/test condition. Must be present in adata.obs[group_key].

  • reference (str | Sequence[str] (default: 'rest')) – Cell group(s) to be used as the reference/control condition. If “rest”, all groups not in query are used as reference. Must be present in adata.obs[group_key].

  • replicate_key (str | None (default: None)) – Column name in adata.obs defining biological replicates. Required for creating multiple pseudobulk samples per condition, but never included in the design. If None, cells are not stratified by replicate.

  • min_cells (int | None (default: 50)) – Minimum number of cells required per pseudobulk sample. Samples with fewer cells are excluded from analysis.

  • min_fraction (float | None (default: 0.2)) – Minimum fraction of cells of the condition in that pseudobulk sample for it to be considered valid. Samples with a lower fraction are excluded from analysis.

  • min_coverage (float | None (default: 0.75)) – Minimum coverage provided by all valid samples per condition. Conditions with lower coverage are collapsed. Range: [0.0, 1.0].

  • categorical_covariates (Sequence[str] | None (default: None)) – Column names in adata.obs representing categorical covariates to include in the design (e.g., [‘experiment’, ‘chemistry’, ‘batch’]). These are added as stratification factors along with replicate_key.

  • continuous_covariates (Sequence[str] | None (default: None)) – Column names in adata.obs representing continuous covariates to include in the design (e.g., [‘cellcycle’, ‘pct_mito’]). These are aggregated per pseudobulk sample.

  • continuous_aggregation (Union[Literal['mean', 'sum', 'median'], Callable, None] (default: 'mean')) – Method to aggregate continuous covariates across cells within each pseudobulk sample. Can be a string specifying a standard aggregation or a custom callable.

  • layer (str | None (default: None)) – Layer in adata.layers to use for aggregation. If None, uses adata.X.

  • layer_aggregation (Literal['sum', 'mean'] (default: 'sum')) – Method to aggregate expression values across cells.

  • qualify_strategy (Literal['and', 'or'] (default: 'or')) – Strategy for sample qualification when multiple criteria are specified: - “and”: Sample candidate must pass both min_cells AND min_fraction thresholds - “or”: Samples candidate must pass either min_cells OR min_fraction threshold

  • covariate_strategy (Literal['sequence_order', 'most_levels'] (default: 'sequence_order')) – Strategy for ordering covariates in the design formula when conflicts arise: - “sequence_order”: Drop covariates from back to front in the provided list - “most_levels”: Prioritize covariates with more unique levels

  • resolve_conflicts (bool (default: True)) – If True, automatically resolve confounded covariates by iteratively removing them to ensure a full-rank design matrix. If False, raise an error when confounding is detected.

  • n_repetitions (int (default: 3)) – Number of pseudoreplicate iterations to generate.

  • resampling_fraction (float (default: 0.33)) – Fraction of cells to sample (with replacement) from a valid pseudobulk to generate a pseudoreplicate.

  • min_samples (int (default: 3)) – Minimum number of pseudobulk samples required per condition for direct DE testing. If fewer exist, falls back according to fallback_strategy.

  • alpha (float (default: 0.05)) – Significance threshold for direct pseudobulk DE testing.

  • alpha_fallback (float | None (default: 0.05)) – Separate significance threshold for fallback methods (pseudoreplicates or single-cell). If None, uses alpha.

  • correction_method (str (default: 'fdr_bh')) – Multiple testing correction method. Options include: - ‘fdr_bh’: Benjamini-Hochberg FDR (recommended) - ‘bonferroni’: Bonferroni correction - Others supported by statsmodels.stats.multitest.multipletests

  • engine (str (default: 'anova')) – Statistical engine for DE testing. Available engines are ‘pydeseq2’ and ‘anova’

  • engine_kwargs (dict | None (default: None)) – Additional keyword arguments passed to the DE engine.

  • fallback_strategy (Optional[Literal['pseudoreplicates', 'single_cell']] (default: 'pseudoreplicates')) –

    Strategy when fewer than min_samples exist per condition: - ‘pseudoreplicates’: Generate synthetic replicates by resampling cells

    and run multiple DE tests, aggregating results

    • ’single_cell’: Perform DE at single-cell resolution using all cells

    • None: Raise an error if insufficient samples

  • seed (int (default: 42)) – Random seed for reproducibility of pseudoreplicate generation.

Return type:

DEResult

Returns:

DEResult Container object with differential expression results and metadata:

  • resultspd.DataFrame

    Main results table with columns:

    • gene: Gene identifier

    • baseMean: Mean expression across samples

    • log2FoldChange: Log2 fold change (query vs reference)

    • lfcSE: Standard error of log2 fold change

    • stat: Test statistic

    • stat_sign: Signed statistic for ranking

    • pvalue: Raw p-value

    • padj: Adjusted p-value (FDR)

  • querystr or list

    Query condition(s) tested

  • referencestr or list

    Reference condition(s) tested

  • designstr

    Design formula used for testing

  • enginestr

    Statistical engine used

  • used_pseudoreplicatesbool

    True if pseudoreplicates were generated

  • used_single_cellbool

    True if single-cell level testing was performed

  • n_repetitionsint

    Number of repetitions (1 for direct testing, >1 for pseudoreplicates)

  • repetition_resultsdict, optional

    Individual results from each repetition (only for pseudoreplicates)

Raises:

ValueError

  • If fallback_strategy=None and insufficient samples exist - If data is AnnData but group_key or query is not provided - If specified groups/keys don’t exist in the data

Warning

  • Single-cell fallback testing treats each cell as an independent sample, which inflates test statistics

  • Pseudoreplicate fallback is more conservative but if a large fraction of cells are sampled, the independence assumption may still be violated.

  • Results from fallback strategies should be interpreted with caution and ideally validated with independent biological replicates

Examples

n.a.