scbulkde.tl.rank_genes_groups#

scbulkde.tl.rank_genes_groups(adata, groupby, *, mask_var=None, use_raw=None, groups='all', reference='rest', n_genes=None, rankby_abs=False, pts=False, key_added=None, layer=None, copy=False, de_kwargs)#

Pseudobulked differential expression analysis that interfaces with scanpy.

This is a drop-in replacement for scanpy.tl.rank_genes_groups that uses pseudobulk aggregation followed by differential expression testing instead of single-cell statistical tests. This approach is more statistically rigorous for single-cell RNA-seq data with biological replicates.

Parameters:

adata (AnnData) – Annotated data matrix.
groupby (str) – The key of the observations grouping to consider (e.g., ‘cell_type’, ‘cluster’).
mask_var (ndarray | str | None (default: None)) – Select subset of genes to use in statistical tests. Can be a boolean array or a string key from adata.var.
use_raw (bool | None (default: None)) – Use raw attribute of adata if present. The default behavior (None) is to use raw if present. Set to False to force use of normalized data.
groups (Union[Literal['all'], Iterable[str]] (default: 'all')) – Subset of groups, e.g. ['g1', 'g2', 'g3'], to which comparison shall be restricted, or 'all' (default), for all groups. Note that if reference='rest' all groups will still be used as the reference, not just those specified in groups.
reference (str (default: 'rest')) – If 'rest', compare each group to the union of the rest of the groups. If a group identifier, compare with respect to this specific group. When a specific reference is provided, it will not be tested against itself.
n_genes (int | None (default: None)) – The number of genes that appear in the returned tables. Defaults to all genes.
rankby_abs (bool (default: False)) – Rank genes by the absolute value of the log fold change, not by the log fold change itself. The returned scores are never the absolute values.
pts (bool (default: False)) – Compute the fraction of cells expressing the genes in each group.
key_added (str | None (default: None)) – The key in adata.uns where information is saved to. Defaults to ‘rank_genes_groups’.
layer (str | None (default: None)) – Key from adata.layers whose value will be used to perform tests on. Cannot be used together with use_raw=True.
copy (bool (default: False)) – Whether to copy adata or modify it inplace.
de_kwargs –
Keyword arguments passed to the underlying de() function. Must include pseudobulk-specific parameters such as replicate_key. Common parameters include:
- replicate_keystr
  Column in adata.obs defining biological replicates (required for pseudobulk).
- min_cellsint, default=50
  Minimum number of cells required per pseudobulk sample.
- min_fractionfloat, default=0.2
  Minimum fraction of cells per pseudobulk sample.
- min_coveragefloat, default=0.75
  Minimum coverage required per condition.
- categorical_covariatesSequence[str], optional
  Categorical covariates to include in the design.
- continuous_covariatesSequence[str], optional
  Continuous covariates to include in the design.
- enginestr, default=’anova’
  Statistical engine for DE testing (‘anova’ or ‘pydeseq2’).
- fallback_strategy{‘pseudoreplicates’, ‘single_cell’, None}, default=’pseudoreplicates’
  Strategy when insufficient biological replicates exist.
- min_samplesint, default=3
  Minimum number of pseudobulk samples required per condition for direct testing.

Return type:

AnnData | None

Returns:

AnnData | None Returns adata if copy=True, otherwise returns None and modifies adata inplace. Results are stored in adata.uns[key_added] with the following structure:

namesnumpy.recarray
Structured array with top gene names for each group.
scoresnumpy.recarray
Structured array with test statistics for each group.
logfoldchangesnumpy.recarray
Structured array with log2 fold changes for each group.
pvalsnumpy.recarray
Structured array with p-values for each group.
pvals_adjnumpy.recarray
Structured array with adjusted p-values (FDR) for each group.
ptspd.DataFrame (if pts=True)
Fraction of cells expressing each gene in each group.
pts_restpd.DataFrame (if pts=True and reference='rest')
Fraction of cells expressing each gene in the rest of the cells.
paramsdict
Dictionary containing parameters used for the analysis.

Notes

Unlike scanpy.tl.rank_genes_groups which uses single-cell statistical tests (t-test, Wilcoxon, etc.), this implementation:

Aggregates cells into pseudobulk samples based on biological replicates
Performs differential expression testing on pseudobulk data
Properly accounts for sample-level variation and biological replicates

This approach is more statistically appropriate for single-cell RNA-seq data and reduces false discovery rates [Squair2021].

When insufficient biological replicates are available, the function can fall back to pseudoreplicate generation or single-cell testing (controlled by de_kwargs).

References

[Squair2021]

Squair, J.W., et al. (2021) “Confronting false discoveries in single-cell differential expression.” Nature Communications 12, 5692.

scbulkde.tl.rank_genes_groups

Contents

scbulkde.tl.rank_genes_groups#