scbulkde.tl.rank_genes_groups

scbulkde.tl.rank_genes_groups#

scbulkde.tl.rank_genes_groups(adata, groupby, *, mask_var=None, use_raw=None, groups='all', reference='rest', n_genes=None, rankby_abs=False, pts=False, key_added=None, layer=None, copy=False, de_kwargs)#

Pseudobulked differential expression analysis that interfaces with scanpy.

This is a drop-in replacement for scanpy.tl.rank_genes_groups that uses pseudobulk aggregation followed by differential expression testing instead of single-cell statistical tests. This approach is more statistically rigorous for single-cell RNA-seq data with biological replicates.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • groupby (str) – The key of the observations grouping to consider (e.g., ‘cell_type’, ‘cluster’).

  • mask_var (ndarray | str | None (default: None)) – Select subset of genes to use in statistical tests. Can be a boolean array or a string key from adata.var.

  • use_raw (bool | None (default: None)) – Use raw attribute of adata if present. The default behavior (None) is to use raw if present. Set to False to force use of normalized data.

  • groups (Union[Literal['all'], Iterable[str]] (default: 'all')) – Subset of groups, e.g. ['g1', 'g2', 'g3'], to which comparison shall be restricted, or 'all' (default), for all groups. Note that if reference='rest' all groups will still be used as the reference, not just those specified in groups.

  • reference (str (default: 'rest')) – If 'rest', compare each group to the union of the rest of the groups. If a group identifier, compare with respect to this specific group. When a specific reference is provided, it will not be tested against itself.

  • n_genes (int | None (default: None)) – The number of genes that appear in the returned tables. Defaults to all genes.

  • rankby_abs (bool (default: False)) – Rank genes by the absolute value of the log fold change, not by the log fold change itself. The returned scores are never the absolute values.

  • pts (bool (default: False)) – Compute the fraction of cells expressing the genes in each group.

  • key_added (str | None (default: None)) – The key in adata.uns where information is saved to. Defaults to ‘rank_genes_groups’.

  • layer (str | None (default: None)) – Key from adata.layers whose value will be used to perform tests on. Cannot be used together with use_raw=True.

  • copy (bool (default: False)) – Whether to copy adata or modify it inplace.

  • de_kwargs

    Keyword arguments passed to the underlying de() function. Must include pseudobulk-specific parameters such as replicate_key. Common parameters include:

    • replicate_keystr

      Column in adata.obs defining biological replicates (required for pseudobulk).

    • min_cellsint, default=50

      Minimum number of cells required per pseudobulk sample.

    • min_fractionfloat, default=0.2

      Minimum fraction of cells per pseudobulk sample.

    • min_coveragefloat, default=0.75

      Minimum coverage required per condition.

    • categorical_covariatesSequence[str], optional

      Categorical covariates to include in the design.

    • continuous_covariatesSequence[str], optional

      Continuous covariates to include in the design.

    • enginestr, default=’anova’

      Statistical engine for DE testing (‘anova’ or ‘pydeseq2’).

    • fallback_strategy{‘pseudoreplicates’, ‘single_cell’, None}, default=’pseudoreplicates’

      Strategy when insufficient biological replicates exist.

    • min_samplesint, default=3

      Minimum number of pseudobulk samples required per condition for direct testing.

Return type:

AnnData | None

Returns:

AnnData | None Returns adata if copy=True, otherwise returns None and modifies adata inplace. Results are stored in adata.uns[key_added] with the following structure:

  • namesnumpy.recarray

    Structured array with top gene names for each group.

  • scoresnumpy.recarray

    Structured array with test statistics for each group.

  • logfoldchangesnumpy.recarray

    Structured array with log2 fold changes for each group.

  • pvalsnumpy.recarray

    Structured array with p-values for each group.

  • pvals_adjnumpy.recarray

    Structured array with adjusted p-values (FDR) for each group.

  • ptspd.DataFrame (if pts=True)

    Fraction of cells expressing each gene in each group.

  • pts_restpd.DataFrame (if pts=True and reference='rest')

    Fraction of cells expressing each gene in the rest of the cells.

  • paramsdict

    Dictionary containing parameters used for the analysis.

Notes

Unlike scanpy.tl.rank_genes_groups which uses single-cell statistical tests (t-test, Wilcoxon, etc.), this implementation:

  1. Aggregates cells into pseudobulk samples based on biological replicates

  2. Performs differential expression testing on pseudobulk data

  3. Properly accounts for sample-level variation and biological replicates

This approach is more statistically appropriate for single-cell RNA-seq data and reduces false discovery rates [Squair2021].

When insufficient biological replicates are available, the function can fall back to pseudoreplicate generation or single-cell testing (controlled by de_kwargs).

References

[Squair2021]

Squair, J.W., et al. (2021) “Confronting false discoveries in single-cell differential expression.” Nature Communications 12, 5692.

See also

de

Core differential expression function

pp.pseudobulk

Pseudobulk aggregation without DE testing

Examples

n.a.