scbulkde.tl.de#
- scbulkde.tl.de(data, group_key=None, query=None, reference='rest', replicate_key=None, min_cells=50, min_fraction=0.2, min_coverage=0.75, categorical_covariates=None, continuous_covariates=None, continuous_aggregation='mean', layer=None, layer_aggregation='sum', qualify_strategy='or', covariate_strategy='sequence_order', resolve_conflicts=True, n_repetitions=3, resampling_fraction=0.33, min_samples=3, alpha=0.05, alpha_fallback=0.05, correction_method='fdr_bh', engine='anova', engine_kwargs=None, fallback_strategy='pseudoreplicates', seed=42)#
Perform differential expression analysis on pseudobulked single-cell data.
This function integrates pseudobulking and differential expression testing with fallback strategies when insufficient biological replicates exist.
- Parameters:
data (
AnnData|PseudobulkResult) – Input data. Either an AnnData object (will be pseudobulked automatically) or a pre-computed PseudobulkResult frompp.pseudobulk().group_key (
str|None(default:None)) – Column name inadata.obsthat defines the cell groups for comparison (e.g., ‘cell_type’, ‘condition’, ‘cluster’).query (
str|Sequence[str] |None(default:None)) – Cell group(s) to be used as the query/test condition. Must be present inadata.obs[group_key].reference (
str|Sequence[str] (default:'rest')) – Cell group(s) to be used as the reference/control condition. If “rest”, all groups not inqueryare used as reference. Must be present inadata.obs[group_key].replicate_key (
str|None(default:None)) – Column name inadata.obsdefining biological replicates. Required for creating multiple pseudobulk samples per condition, but never included in the design. If None, cells are not stratified by replicate.min_cells (
int|None(default:50)) – Minimum number of cells required per pseudobulk sample. Samples with fewer cells are excluded from analysis.min_fraction (
float|None(default:0.2)) – Minimum fraction of cells of the condition in that pseudobulk sample for it to be considered valid. Samples with a lower fraction are excluded from analysis.min_coverage (
float|None(default:0.75)) – Minimum coverage provided by all valid samples per condition. Conditions with lower coverage are collapsed. Range: [0.0, 1.0].categorical_covariates (
Sequence[str] |None(default:None)) – Column names inadata.obsrepresenting categorical covariates to include in the design (e.g., [‘experiment’, ‘chemistry’, ‘batch’]). These are added as stratification factors along withreplicate_key.continuous_covariates (
Sequence[str] |None(default:None)) – Column names inadata.obsrepresenting continuous covariates to include in the design (e.g., [‘cellcycle’, ‘pct_mito’]). These are aggregated per pseudobulk sample.continuous_aggregation (
Union[Literal['mean','sum','median'],Callable,None] (default:'mean')) – Method to aggregate continuous covariates across cells within each pseudobulk sample. Can be a string specifying a standard aggregation or a custom callable.layer (
str|None(default:None)) – Layer inadata.layersto use for aggregation. If None, usesadata.X.layer_aggregation (
Literal['sum','mean'] (default:'sum')) – Method to aggregate expression values across cells.qualify_strategy (
Literal['and','or'] (default:'or')) – Strategy for sample qualification when multiple criteria are specified: - “and”: Sample candidate must pass bothmin_cellsANDmin_fractionthresholds - “or”: Samples candidate must pass eithermin_cellsORmin_fractionthresholdcovariate_strategy (
Literal['sequence_order','most_levels'] (default:'sequence_order')) – Strategy for ordering covariates in the design formula when conflicts arise: - “sequence_order”: Drop covariates from back to front in the provided list - “most_levels”: Prioritize covariates with more unique levelsresolve_conflicts (
bool(default:True)) – If True, automatically resolve confounded covariates by iteratively removing them to ensure a full-rank design matrix. If False, raise an error when confounding is detected.n_repetitions (
int(default:3)) – Number of pseudoreplicate iterations to generate.resampling_fraction (
float(default:0.33)) – Fraction of cells to sample (with replacement) from a valid pseudobulk to generate a pseudoreplicate.min_samples (
int(default:3)) – Minimum number of pseudobulk samples required per condition for direct DE testing. If fewer exist, falls back according tofallback_strategy.alpha (
float(default:0.05)) – Significance threshold for direct pseudobulk DE testing.alpha_fallback (
float|None(default:0.05)) – Separate significance threshold for fallback methods (pseudoreplicates or single-cell). If None, usesalpha.correction_method (
str(default:'fdr_bh')) – Multiple testing correction method. Options include: - ‘fdr_bh’: Benjamini-Hochberg FDR (recommended) - ‘bonferroni’: Bonferroni correction - Others supported bystatsmodels.stats.multitest.multipletestsengine (
str(default:'anova')) – Statistical engine for DE testing. Available engines are ‘pydeseq2’ and ‘anova’engine_kwargs (
dict|None(default:None)) – Additional keyword arguments passed to the DE engine.fallback_strategy (
Optional[Literal['pseudoreplicates','single_cell']] (default:'pseudoreplicates')) –Strategy when fewer than
min_samplesexist per condition: - ‘pseudoreplicates’: Generate synthetic replicates by resampling cellsand run multiple DE tests, aggregating results
’single_cell’: Perform DE at single-cell resolution using all cells
None: Raise an error if insufficient samples
seed (
int(default:42)) – Random seed for reproducibility of pseudoreplicate generation.
- Return type:
- Returns:
DEResult Container object with differential expression results and metadata:
- resultspd.DataFrame
Main results table with columns:
gene: Gene identifier
baseMean: Mean expression across samples
log2FoldChange: Log2 fold change (query vs reference)
lfcSE: Standard error of log2 fold change
stat: Test statistic
stat_sign: Signed statistic for ranking
pvalue: Raw p-value
padj: Adjusted p-value (FDR)
- querystr or list
Query condition(s) tested
- referencestr or list
Reference condition(s) tested
- designstr
Design formula used for testing
- enginestr
Statistical engine used
- used_pseudoreplicatesbool
True if pseudoreplicates were generated
- used_single_cellbool
True if single-cell level testing was performed
- n_repetitionsint
Number of repetitions (1 for direct testing, >1 for pseudoreplicates)
- repetition_resultsdict, optional
Individual results from each repetition (only for pseudoreplicates)
- Raises:
If
fallback_strategy=Noneand insufficient samples exist - Ifdatais AnnData butgroup_keyorqueryis not provided - If specified groups/keys don’t exist in the data
Warning
Single-cell fallback testing treats each cell as an independent sample, which inflates test statistics
Pseudoreplicate fallback is more conservative but if a large fraction of cells are sampled, the independence assumption may still be violated.
Results from fallback strategies should be interpreted with caution and ideally validated with independent biological replicates
Examples
n.a.