Gene Program Database Creation
This notebook creates a “gene program database” file (gpdb_tf.csv) for running Tripso, where column names are gene program names and the values are the genes in the gene program. Here we start from a selection of manually curated gene programs of intetest (gp_curated.csv). Here, the gene programs correspond to transcriptor factor regulons, where the list of input transcription factors were selected based on a literature search, and the target genes were obtained from CollecTRI (paper, github). To ensure that selected gene programs have sufficient gene coverage across cells, we filter gene programs based on their expression patterns in the bone marrow dataset. Empirically, we find that below 5 GP genes expressed per cell on average, Tripso does not learn meaningful representations.
Inputs:
data/processed/zeng.h5ad: h5ad object with gene epxression datagp_curated.csv: Manually curated set of gene programs
Outputs:
gpdb_tf.csv: Filtered gene program database for Tripso model training
Purpose: Quality control step to ensure gene programs are relevant for the dataset of interest before tokenization and model training.
import pandas as pd
import scanpy as sc
import numpy as np
import os
Load data
zeng = sc.read_h5ad('data/processed/zeng.h5ad')
zeng
AnnData object with n_obs × n_vars = 263159 × 27571
obs: 'AuthorCellType', 'AuthorCellType_Broad', 'cell_type', 'Sorting', 'Study', 'donor', 'sex', 'development_stage', 'age_group', 'n_counts'
var: 'HCA_Hay2018', 'Oetjen2018', 'Granja2019', 'Mende2022', 'Setty2019', 'Ainciburu2023', 'HVG_intersect3000', 'nCells_Detected', 'nDatasets_Detected', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type', 'ensembl_id'
For each GP, check average number of genes per cell
→ from this can filter GP with low number of genes
gpdb = pd.read_csv('gp_curated.csv')
print('Selected gene programs:', *gpdb.columns)
Selected gene programs: GP_USF1 GP_NFE2L2 GP_RUNX1 GP_FOXO3 GP_MYB GP_E2F4 GP_IRF1 GP_GATA1 GP_CTCF GP_MYCN GP_ATF4 GP_ATF3 GP_JUNB GP_JUND GP_GATA2 GP_DDIT3 GP_TAL1 GP_FLI1 GP_ELF1 GP_RUNX3 GP_KLF2 GP_PRDM1 GP_IRF2 GP_NR2C2 GP_ERG GP_IKZF1 GP_SNAI2 GP_NFYB GP_HOXA9 GP_IRF5 GP_ZBTB7A GP_KLF1 GP_LMO2 GP_NFIX GP_ETV6 GP_MEIS1 GP_SOX6 GP_NFE2
gp_to_drop = []
for gp in gpdb.columns:
print(f"Processing gene group: {gp}") # Print the name of the current gene group
genes = gpdb[gp].dropna().values # Get the list of non-null gene names in the current group
# Subset adata to only include genes that are in the current list of genes
bdata = zeng[:, zeng.var.index.isin(genes)]
# Calculate the number of genes with non-zero expression in each cell
non_zero_counts = (bdata.X > 0).sum(axis=1) # Sum of non-zero values for each row (cell)
# Calculate the average number of genes with non-zero expression per cell
average_non_zero_genes = non_zero_counts.mean()
print(f" Average number of genes with non-zero expression per cell: {average_non_zero_genes:.2f}")
print('')
if average_non_zero_genes < 5:
gp_to_drop.append(gp)
Processing gene group: GP_USF1
Average number of genes with non-zero expression per cell: 22.98
Processing gene group: GP_NFE2L2
Average number of genes with non-zero expression per cell: 26.42
Processing gene group: GP_RUNX1
Average number of genes with non-zero expression per cell: 21.91
Processing gene group: GP_FOXO3
Average number of genes with non-zero expression per cell: 21.76
Processing gene group: GP_MYB
Average number of genes with non-zero expression per cell: 24.90
Processing gene group: GP_E2F4
Average number of genes with non-zero expression per cell: 26.47
Processing gene group: GP_IRF1
Average number of genes with non-zero expression per cell: 16.65
Processing gene group: GP_GATA1
Average number of genes with non-zero expression per cell: 16.10
Processing gene group: GP_CTCF
Average number of genes with non-zero expression per cell: 14.38
Processing gene group: GP_MYCN
Average number of genes with non-zero expression per cell: 15.64
Processing gene group: GP_ATF4
Average number of genes with non-zero expression per cell: 11.82
Processing gene group: GP_ATF3
Average number of genes with non-zero expression per cell: 11.97
Processing gene group: GP_JUNB
Average number of genes with non-zero expression per cell: 9.24
Processing gene group: GP_JUND
Average number of genes with non-zero expression per cell: 8.55
Processing gene group: GP_GATA2
Average number of genes with non-zero expression per cell: 9.29
Processing gene group: GP_DDIT3
Average number of genes with non-zero expression per cell: 13.03
Processing gene group: GP_TAL1
Average number of genes with non-zero expression per cell: 9.53
Processing gene group: GP_FLI1
Average number of genes with non-zero expression per cell: 9.36
Processing gene group: GP_ELF1
Average number of genes with non-zero expression per cell: 8.37
Processing gene group: GP_RUNX3
Average number of genes with non-zero expression per cell: 7.55
Processing gene group: GP_KLF2
Average number of genes with non-zero expression per cell: 6.60
Processing gene group: GP_PRDM1
Average number of genes with non-zero expression per cell: 7.58
Processing gene group: GP_IRF2
Average number of genes with non-zero expression per cell: 6.21
Processing gene group: GP_NR2C2
Average number of genes with non-zero expression per cell: 4.27
Processing gene group: GP_ERG
Average number of genes with non-zero expression per cell: 4.92
Processing gene group: GP_IKZF1
Average number of genes with non-zero expression per cell: 4.08
Processing gene group: GP_SNAI2
Average number of genes with non-zero expression per cell: 4.42
Processing gene group: GP_NFYB
Average number of genes with non-zero expression per cell: 6.97
Processing gene group: GP_HOXA9
Average number of genes with non-zero expression per cell: 4.52
Processing gene group: GP_IRF5
Average number of genes with non-zero expression per cell: 1.65
Processing gene group: GP_ZBTB7A
Average number of genes with non-zero expression per cell: 5.23
Processing gene group: GP_KLF1
Average number of genes with non-zero expression per cell: 3.53
Processing gene group: GP_LMO2
Average number of genes with non-zero expression per cell: 3.48
Processing gene group: GP_NFIX
Average number of genes with non-zero expression per cell: 0.71
Processing gene group: GP_ETV6
Average number of genes with non-zero expression per cell: 2.44
Processing gene group: GP_MEIS1
Average number of genes with non-zero expression per cell: 1.40
Processing gene group: GP_SOX6
Average number of genes with non-zero expression per cell: 1.81
Processing gene group: GP_NFE2
Average number of genes with non-zero expression per cell: 2.79
gpdb = gpdb.drop(columns = gp_to_drop)
gpdb.shape
(227, 25)
gpdb.to_csv('gpdb_tf.csv', index = False)