Calculating Gene Program (GP) Importance Scores

This tutorial demonstrates how to calculate and analyze Tripso Gene Program (GP) importance scores from the GPLearner model. GP importance scores quantify how much each gene program contributes to a cell’s identity, enabling:

What are GP Importance Scores?

GP importance scores are computed using an ablation approach:

  • The model is run with each gene program individually masked/removed

  • We compute the cosine similarity between perturbed cell representation (output of the model with the masked GP) and the original cell representation.

  • The output is (1 - cosine similarity with control cell) such that higher scores = more critical for the cell representation

Prerequisites

  • Completed model training with GP ablation enabled (see training tutorials)

  • Ablation results saved in output_global/ablation/with_gp_ablation/

import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. Load GP Importance Scores

The ablation analysis generates AnnData objects where:

  • Cells (observations): Individual cells from your dataset

  • Variables (features): Gene programs (e.g., GP_GATA1, GP_LMO2, etc.)

  • Data matrix (X): GP importance scores for each cell-GP.

Data Loading Options

Option 1 (Recommended for publication): Combine multiple training runs

  • Train model 3 times with different random seeds

  • Average GP importance scores across runs. While patterns across runs are broadly consistentt, this increaases robustness and reduces noise from stochastic training effects

  • (this was the approach for the original Tripso manuscript)

Option 2 (Tutorial): Single run for simplicity

  • Faster to execute

  • Suitable for exploratory analysis

# Load GP importance scores from ablation analysis
# The ablation process masks each GP and measures reconstruction loss change

# # Option 1: Load and combine all data splits
# train = sc.read_h5ad('output_global/ablation/with_gp_ablation/train_set.h5ad')
# val = sc.read_h5ad('output_global/ablation/with_gp_ablation/val_set.h5ad')
# test = sc.read_h5ad('output_global/ablation/with_gp_ablation/test_set.h5ad')
# adata = ad.concat([train, val, test])

# Option 2: Load only test set for faster tutorial execution (commented out)
adata = sc.read_h5ad('output_global/ablation/with_gp_ablation/test_set.h5ad')

adata
/software/cellgen/team292/mm58/venvs/lightning2/lib/python3.10/site-packages/anndata/_core/anndata.py:1754: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
AnnData object with n_obs × n_vars = 26317 × 25
    obs: 'length', 'AuthorCellType', 'AuthorCellType_Broad', 'cell_type', 'Sorting', 'Study', 'donor', 'sex', 'development_stage', 'age_group', 'n_counts', 'idx', 'batch_key', 'cell_type_id', 'age_group_id', 'batch_key_id'

3. Case Study: Dendritic Cell Subtypes

Dendritic cells (DCs) are immune cells with distinct subtypes that have different functional roles:

  • pDC (plasmacytoid DC): Specialize in antiviral responses, produce type I interferon

  • cDC1 (conventional DC type 1): Cross-present antigens, activate CD8+ T cells

  • cDC2 (conventional DC type 2): Present antigens to CD4+ T cells, Th2 responses

Goal: Identify which gene programs distinguish these functionally distinct DC subtypes. In the next steps, we will look at which genes drive these differences.

3.1 Filter to Dendritic Cell Subtypes

# Subset to three dendritic cell subtypes for differential analysis
dc = adata[
    adata.obs['AuthorCellType_Broad'].isin(['pDC',    # Plasmacytoid dendritic cells
                                 'cDC',   # Conventional dendritic cell
                                 ])
]
dc
View of AnnData object with n_obs × n_vars = 1704 × 25
    obs: 'length', 'AuthorCellType', 'AuthorCellType_Broad', 'cell_type', 'Sorting', 'Study', 'donor', 'sex', 'development_stage', 'age_group', 'n_counts', 'idx', 'batch_key', 'cell_type_id', 'age_group_id', 'batch_key_id'

4. Differential GP Analysis

Now we perform differential gene program analysis - analogous to differential gene expression (DGE), but at the gene program level.

4.1 Rank Gene Programs by Cell Type

We use Scanpy’s rank_genes_groups function, which:

  • Treats GP importance scores like gene expression values

  • For each cell type, identifies significantly enriched/depleted GPs

  • Uses statistical tests (default: t-test with overestimation correction)

  • Returns ranked lists with log fold changes and p-values

Since the data matrix contains GP importance scores (not gene expression), this identifies GPs that are differentially important across cell types.

# Perform differential GP analysis across dendritic cell subtypes
# This identifies which GPs are significantly more/less important in each subtype
sc.tl.rank_genes_groups(
    dc,
    groupby='AuthorCellType_Broad',  # Compare GP importance across cell types
)
/software/cellgen/team292/mm58/venvs/lightning2/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:645: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
  adata.uns[key_added] = {}
/software/cellgen/team292/mm58/venvs/lightning2/lib/python3.10/site-packages/anndata/_core/anndata.py:1754: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
/software/cellgen/team292/mm58/venvs/lightning2/lib/python3.10/site-packages/anndata/_core/anndata.py:1754: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")

4.2 Extract Results for pDC

Retrieve the ranked list of GPs for plasmacytoid dendritic cells.

Interpretation:

  • Positive logfoldchanges: GPs more important in cDC1 vs. other DC types

  • Negative logfoldchanges: GPs less important in cDC1

  • pvals_adj < 0.05: Statistically significant differences

# Get differential GP results for cDC1
pdc = sc.get.rank_genes_groups_df(dc, group='pDC')

# Show top GPs enriched in cDC1 (positive log fold change, significant)
pdc.head()
names scores logfoldchanges pvals pvals_adj
0 GP_TAL1 10.450618 0.681092 1.619845e-24 1.349871e-23
1 GP_KLF2 9.903563 1.080405 3.133774e-22 1.566887e-21
2 GP_GATA1 9.898601 0.363505 1.696406e-22 1.060254e-21
3 GP_IRF1 8.198027 0.257220 4.826885e-16 1.723888e-15
4 GP_IRF2 7.918591 1.970680 5.059151e-15 1.580985e-14

4.3 Extract Results for cDC

# Get differential GP results for cDC2
cdc = sc.get.rank_genes_groups_df(dc, group='cDC')

# Display top results
cdc.sort_values(by = 'logfoldchanges', ascending = False).head()
names scores logfoldchanges pvals pvals_adj
0 GP_FOXO3 14.252753 3.728169 1.554398e-41 3.885994e-40
3 GP_ATF4 6.225962 1.933915 7.386285e-10 1.678701e-09
4 GP_JUNB 5.841273 1.342999 6.925215e-09 1.331772e-08
9 GP_NFYB 1.489372 0.845979 1.365762e-01 1.484524e-01
6 GP_FLI1 4.539543 0.830437 6.046568e-06 1.007761e-05

5. Visualize GP Importance Distributions

To better understand how GP importance varies across cell types, we create box plots showing the distribution of importance scores.

def make_barplot(pert_data, gp, fig_size=(5, 5), save=None):
    """
    Create box plot showing GP importance score distribution across cell types.
    
    Parameters:
    -----------
    pert_data : AnnData
        Data containing GP importance scores and cell type annotations
    gp : str
        Gene program name (e.g., 'GP_GATA1')
    fig_size : tuple
        Figure dimensions (width, height)
    save : str, optional
        File path to save figure
    """
    # Color palette (may need customization based on your cell types)
    colors = [
        'lightcoral',   # Color for second cell type
        'lightblue',    # Color for third cell type
        'navy',         # Color for fourth cell type (if applicable)
    ]
    
    # Extract GP importance scores and add to observations for plotting
    # This converts the GP column from the data matrix to a metadata column
    pert_data.obs[gp] = pert_data[:, pert_data.var.index == gp].X.toarray().flatten()

    # Create box plot
    plt.figure(figsize=fig_size)
    ax = sns.boxplot(
        data=pert_data.obs,
        y=gp,                                           # GP importance on y-axis
        x="AuthorCellType_Broad",                                  # Cell types on x-axis
        order=pert_data.obs['AuthorCellType_Broad'].cat.categories, # Preserve categorical order
        palette=colors
    )

    # Formatting
    plt.xticks(rotation=90)
    plt.xlabel("Cell type")
    plt.ylabel(f"{gp} importance score")
    plt.title(gp)
    plt.tight_layout(rect=[0, 0, 0.85, 1])  
    
    if save:
        plt.savefig(save)

    plt.show()
# Visualize GP importance distributions for example gene programs
# These GPs were selected based on known roles in hematopoiesis
for gp in ['GP_TAL1', 'GP_FOXO3']:
    make_barplot(dc, gp)
/tmp/ipykernel_1310618/1647168921.py:29: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax = sns.boxplot(
/tmp/ipykernel_1310618/1647168921.py:29: UserWarning: The palette list has more values (3) than needed (2), which may not be intended.
  ax = sns.boxplot(
../../_images/aee8badb415064b0fff1e396bb370b15809cdd9e9cd1b19fb6310c37189239c2.png
/tmp/ipykernel_1310618/1647168921.py:29: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax = sns.boxplot(
/tmp/ipykernel_1310618/1647168921.py:29: UserWarning: The palette list has more values (3) than needed (2), which may not be intended.
  ax = sns.boxplot(
../../_images/2697b0c99fcbbefe4f5eeaa4bc52c828d8eafa8aa42349dbf20da485fe275aa4.png