scbulkde.tl.de#

scbulkde.tl.de(data, group_key=None, query=None, reference='rest', replicate_key=None, min_cells=50, min_fraction=0.2, min_coverage=0.75, categorical_covariates=None, continuous_covariates=None, continuous_aggregation='mean', layer=None, layer_aggregation='sum', qualify_strategy='or', covariate_strategy='sequence_order', resolve_conflicts=True, n_repetitions=3, resampling_fraction=0.33, min_samples=3, alpha=0.05, alpha_fallback=0.05, correction_method='fdr_bh', engine='anova', engine_kwargs=None, fallback_strategy='pseudoreplicates', seed=42)#

Perform differential expression analysis on pseudobulked single-cell data.

This function integrates pseudobulking and differential expression testing with fallback strategies when insufficient biological replicates exist.

Parameters:

data (AnnData | PseudobulkResult) – Input data. Either an AnnData object (will be pseudobulked automatically) or a pre-computed PseudobulkResult from pp.pseudobulk().
group_key (str | None (default: None)) – Column name in adata.obs that defines the cell groups for comparison (e.g., ‘cell_type’, ‘condition’, ‘cluster’).
query (str | Sequence[str] | None (default: None)) – Cell group(s) to be used as the query/test condition. Must be present in adata.obs[group_key].
reference (str | Sequence[str] (default: 'rest')) – Cell group(s) to be used as the reference/control condition. If “rest”, all groups not in query are used as reference. Must be present in adata.obs[group_key].
replicate_key (str | None (default: None)) – Column name in adata.obs defining biological replicates. Required for creating multiple pseudobulk samples per condition, but never included in the design. If None, cells are not stratified by replicate.
min_cells (int | None (default: 50)) – Minimum number of cells required per pseudobulk sample. Samples with fewer cells are excluded from analysis.
min_fraction (float | None (default: 0.2)) – Minimum fraction of cells of the condition in that pseudobulk sample for it to be considered valid. Samples with a lower fraction are excluded from analysis.
min_coverage (float | None (default: 0.75)) – Minimum coverage provided by all valid samples per condition. Conditions with lower coverage are collapsed. Range: [0.0, 1.0].
categorical_covariates (Sequence[str] | None (default: None)) – Column names in adata.obs representing categorical covariates to include in the design (e.g., [‘experiment’, ‘chemistry’, ‘batch’]). These are added as stratification factors along with replicate_key.
continuous_covariates (Sequence[str] | None (default: None)) – Column names in adata.obs representing continuous covariates to include in the design (e.g., [‘cellcycle’, ‘pct_mito’]). These are aggregated per pseudobulk sample.
continuous_aggregation (Union[Literal['mean', 'sum', 'median'], Callable, None] (default: 'mean')) – Method to aggregate continuous covariates across cells within each pseudobulk sample. Can be a string specifying a standard aggregation or a custom callable.
layer (str | None (default: None)) – Layer in adata.layers to use for aggregation. If None, uses adata.X.
layer_aggregation (Literal['sum', 'mean'] (default: 'sum')) – Method to aggregate expression values across cells.
qualify_strategy (Literal['and', 'or'] (default: 'or')) – Strategy for sample qualification when multiple criteria are specified: - “and”: Sample candidate must pass both min_cells AND min_fraction thresholds - “or”: Samples candidate must pass either min_cells OR min_fraction threshold
covariate_strategy (Literal['sequence_order', 'most_levels'] (default: 'sequence_order')) – Strategy for ordering covariates in the design formula when conflicts arise: - “sequence_order”: Drop covariates from back to front in the provided list - “most_levels”: Prioritize covariates with more unique levels
resolve_conflicts (bool (default: True)) – If True, automatically resolve confounded covariates by iteratively removing them to ensure a full-rank design matrix. If False, raise an error when confounding is detected.
n_repetitions (int (default: 3)) – Number of pseudoreplicate iterations to generate.
resampling_fraction (float (default: 0.33)) – Fraction of cells to sample (with replacement) from a valid pseudobulk to generate a pseudoreplicate.
min_samples (int (default: 3)) – Minimum number of pseudobulk samples required per condition for direct DE testing. If fewer exist, falls back according to fallback_strategy.
alpha (float (default: 0.05)) – Significance threshold for direct pseudobulk DE testing.
alpha_fallback (float | None (default: 0.05)) – Separate significance threshold for fallback methods (pseudoreplicates or single-cell). If None, uses alpha.
correction_method (str (default: 'fdr_bh')) – Multiple testing correction method. Options include: - ‘fdr_bh’: Benjamini-Hochberg FDR (recommended) - ‘bonferroni’: Bonferroni correction - Others supported by statsmodels.stats.multitest.multipletests
engine (str (default: 'anova')) – Statistical engine for DE testing. Available engines are ‘pydeseq2’ and ‘anova’
engine_kwargs (dict | None (default: None)) – Additional keyword arguments passed to the DE engine.
fallback_strategy (Optional[Literal['pseudoreplicates', 'single_cell']] (default: 'pseudoreplicates')) –
Strategy when fewer than min_samples exist per condition: - ‘pseudoreplicates’: Generate synthetic replicates by resampling cells

and run multiple DE tests, aggregating results
- ’single_cell’: Perform DE at single-cell resolution using all cells
- None: Raise an error if insufficient samples
seed (int (default: 42)) – Random seed for reproducibility of pseudoreplicate generation.

Return type:

DEResult

Returns:

DEResult Container object with differential expression results and metadata:

resultspd.DataFrame
Main results table with columns:
- gene: Gene identifier
- baseMean: Mean expression across samples
- log2FoldChange: Log2 fold change (query vs reference)
- lfcSE: Standard error of log2 fold change
- stat: Test statistic
- stat_sign: Signed statistic for ranking
- pvalue: Raw p-value
- padj: Adjusted p-value (FDR)
querystr or list
Query condition(s) tested
referencestr or list
Reference condition(s) tested
designstr
Design formula used for testing
enginestr
Statistical engine used
used_pseudoreplicatesbool
True if pseudoreplicates were generated
used_single_cellbool
True if single-cell level testing was performed
n_repetitionsint
Number of repetitions (1 for direct testing, >1 for pseudoreplicates)
repetition_resultsdict, optional
Individual results from each repetition (only for pseudoreplicates)

Raises:

ValueError –

If fallback_strategy=None and insufficient samples exist - If data is AnnData but group_key or query is not provided - If specified groups/keys don’t exist in the data

Warning

Single-cell fallback testing treats each cell as an independent sample, which inflates test statistics
Pseudoreplicate fallback is more conservative but if a large fraction of cells are sampled, the independence assumption may still be violated.
Results from fallback strategies should be interpreted with caution and ideally validated with independent biological replicates

Examples

n.a.

scbulkde.tl.de

Contents

scbulkde.tl.de#