scbulkde.tl.rank_genes_groups#
- scbulkde.tl.rank_genes_groups(adata, groupby, *, mask_var=None, use_raw=None, groups='all', reference='rest', n_genes=None, rankby_abs=False, pts=False, key_added=None, layer=None, copy=False, de_kwargs)#
Pseudobulked differential expression analysis that interfaces with scanpy.
This is a drop-in replacement for scanpy.tl.rank_genes_groups that uses pseudobulk aggregation followed by differential expression testing instead of single-cell statistical tests. This approach is more statistically rigorous for single-cell RNA-seq data with biological replicates.
- Parameters:
adata (
AnnData) – Annotated data matrix.groupby (
str) – The key of the observations grouping to consider (e.g., ‘cell_type’, ‘cluster’).mask_var (
ndarray|str|None(default:None)) – Select subset of genes to use in statistical tests. Can be a boolean array or a string key fromadata.var.use_raw (
bool|None(default:None)) – Use raw attribute of adata if present. The default behavior (None) is to use raw if present. Set toFalseto force use of normalized data.groups (
Union[Literal['all'],Iterable[str]] (default:'all')) – Subset of groups, e.g.['g1', 'g2', 'g3'], to which comparison shall be restricted, or'all'(default), for all groups. Note that ifreference='rest'all groups will still be used as the reference, not just those specified in groups.reference (
str(default:'rest')) – If'rest', compare each group to the union of the rest of the groups. If a group identifier, compare with respect to this specific group. When a specific reference is provided, it will not be tested against itself.n_genes (
int|None(default:None)) – The number of genes that appear in the returned tables. Defaults to all genes.rankby_abs (
bool(default:False)) – Rank genes by the absolute value of the log fold change, not by the log fold change itself. The returned scores are never the absolute values.pts (
bool(default:False)) – Compute the fraction of cells expressing the genes in each group.key_added (
str|None(default:None)) – The key inadata.unswhere information is saved to. Defaults to ‘rank_genes_groups’.layer (
str|None(default:None)) – Key fromadata.layerswhose value will be used to perform tests on. Cannot be used together withuse_raw=True.copy (
bool(default:False)) – Whether to copy adata or modify it inplace.de_kwargs –
Keyword arguments passed to the underlying
de()function. Must include pseudobulk-specific parameters such asreplicate_key. Common parameters include:replicate_keystrColumn in
adata.obsdefining biological replicates (required for pseudobulk).
min_cellsint, default=50Minimum number of cells required per pseudobulk sample.
min_fractionfloat, default=0.2Minimum fraction of cells per pseudobulk sample.
min_coveragefloat, default=0.75Minimum coverage required per condition.
categorical_covariatesSequence[str], optionalCategorical covariates to include in the design.
continuous_covariatesSequence[str], optionalContinuous covariates to include in the design.
enginestr, default=’anova’Statistical engine for DE testing (‘anova’ or ‘pydeseq2’).
fallback_strategy{‘pseudoreplicates’, ‘single_cell’, None}, default=’pseudoreplicates’Strategy when insufficient biological replicates exist.
min_samplesint, default=3Minimum number of pseudobulk samples required per condition for direct testing.
- Return type:
- Returns:
AnnData | None Returns
adataifcopy=True, otherwise returnsNoneand modifiesadatainplace. Results are stored inadata.uns[key_added]with the following structure:namesnumpy.recarrayStructured array with top gene names for each group.
scoresnumpy.recarrayStructured array with test statistics for each group.
logfoldchangesnumpy.recarrayStructured array with log2 fold changes for each group.
pvalsnumpy.recarrayStructured array with p-values for each group.
pvals_adjnumpy.recarrayStructured array with adjusted p-values (FDR) for each group.
ptspd.DataFrame (ifpts=True)Fraction of cells expressing each gene in each group.
pts_restpd.DataFrame (ifpts=Trueandreference='rest')Fraction of cells expressing each gene in the rest of the cells.
paramsdictDictionary containing parameters used for the analysis.
Notes
Unlike scanpy.tl.rank_genes_groups which uses single-cell statistical tests (t-test, Wilcoxon, etc.), this implementation:
Aggregates cells into pseudobulk samples based on biological replicates
Performs differential expression testing on pseudobulk data
Properly accounts for sample-level variation and biological replicates
This approach is more statistically appropriate for single-cell RNA-seq data and reduces false discovery rates [Squair2021].
When insufficient biological replicates are available, the function can fall back to pseudoreplicate generation or single-cell testing (controlled by
de_kwargs).References
[Squair2021]Squair, J.W., et al. (2021) “Confronting false discoveries in single-cell differential expression.” Nature Communications 12, 5692.
See also
deCore differential expression function
pp.pseudobulkPseudobulk aggregation without DE testing
Examples
n.a.