{ "cells": [ { "cell_type": "markdown", "id": "bb3e897b", "metadata": {}, "source": [ "# Gene Program Database Creation \n", "\n", "This notebook creates a \"gene program database\" file (gpdb_tf.csv) for running Tripso, where column names are gene program names and the values are the genes in the gene program. Here we start from a selection of manually curated gene programs of intetest (gp_curated.csv). Here, the gene programs correspond to transcriptor factor regulons, where the list of input transcription factors were selected based on a literature search, and the target genes were obtained from CollecTRI ([paper](https://academic.oup.com/nar/article/51/20/10934/7318114), [github](https://github.com/saezlab/CollecTRI)). To ensure that selected gene programs have sufficient gene coverage across cells, we filter gene programs based on their expression patterns in the bone marrow dataset. Empirically, we find that below 5 GP genes expressed per cell on average, Tripso does not learn meaningful representations. \n", "\n", "**Inputs:**\n", "- `data/processed/zeng.h5ad`: h5ad object with gene epxression data\n", "- `gp_curated.csv`: Manually curated set of gene programs\n", "\n", "**Outputs:**\n", "- `gpdb_tf.csv`: Filtered gene program database for Tripso model training\n", "\n", "**Purpose:** Quality control step to ensure gene programs are relevant for the dataset of interest before tokenization and model training." ] }, { "cell_type": "code", "execution_count": 1, "id": "138317a4-141d-4675-bfc6-e3f4f976570c", "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import scanpy as sc\n", "import numpy as np\n", "import os" ] }, { "cell_type": "code", "execution_count": null, "id": "ea426f42-6e2f-449c-8a9e-73b9dbf4a624", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e07e9027-2024-42bb-9294-571d196706dd", "metadata": {}, "source": [ "## Load data" ] }, { "cell_type": "code", "execution_count": 2, "id": "7bd021fb-2e9c-4f2a-b39e-3278c90a35f1", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 263159 × 27571\n", " obs: 'AuthorCellType', 'AuthorCellType_Broad', 'cell_type', 'Sorting', 'Study', 'donor', 'sex', 'development_stage', 'age_group', 'n_counts'\n", " var: 'HCA_Hay2018', 'Oetjen2018', 'Granja2019', 'Mende2022', 'Setty2019', 'Ainciburu2023', 'HVG_intersect3000', 'nCells_Detected', 'nDatasets_Detected', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type', 'ensembl_id'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zeng = sc.read_h5ad('data/processed/zeng.h5ad')\n", "zeng" ] }, { "cell_type": "code", "execution_count": null, "id": "67a4b6e2-561c-42c8-b50a-d4239a64ceb4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d2290fb5-4361-4f2e-a13c-2b85723d456f", "metadata": {}, "source": [ "## For each GP, check average number of genes per cell\n", "→ from this can filter GP with low number of genes" ] }, { "cell_type": "code", "execution_count": null, "id": "93ac9499-9625-4da2-b5d2-182c43de9529", "metadata": { "tags": [] }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 3, "id": "3bf0c07a-42b8-43a6-baf0-57beb30efbc4", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected gene programs: GP_USF1 GP_NFE2L2 GP_RUNX1 GP_FOXO3 GP_MYB GP_E2F4 GP_IRF1 GP_GATA1 GP_CTCF GP_MYCN GP_ATF4 GP_ATF3 GP_JUNB GP_JUND GP_GATA2 GP_DDIT3 GP_TAL1 GP_FLI1 GP_ELF1 GP_RUNX3 GP_KLF2 GP_PRDM1 GP_IRF2 GP_NR2C2 GP_ERG GP_IKZF1 GP_SNAI2 GP_NFYB GP_HOXA9 GP_IRF5 GP_ZBTB7A GP_KLF1 GP_LMO2 GP_NFIX GP_ETV6 GP_MEIS1 GP_SOX6 GP_NFE2\n" ] } ], "source": [ "gpdb = pd.read_csv('gp_curated.csv')\n", "print('Selected gene programs:', *gpdb.columns)" ] }, { "cell_type": "code", "execution_count": 4, "id": "d03d0e58-73fe-4e71-9099-a673e5e586c1", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing gene group: GP_USF1\n", " Average number of genes with non-zero expression per cell: 22.98\n", "\n", "Processing gene group: GP_NFE2L2\n", " Average number of genes with non-zero expression per cell: 26.42\n", "\n", "Processing gene group: GP_RUNX1\n", " Average number of genes with non-zero expression per cell: 21.91\n", "\n", "Processing gene group: GP_FOXO3\n", " Average number of genes with non-zero expression per cell: 21.76\n", "\n", "Processing gene group: GP_MYB\n", " Average number of genes with non-zero expression per cell: 24.90\n", "\n", "Processing gene group: GP_E2F4\n", " Average number of genes with non-zero expression per cell: 26.47\n", "\n", "Processing gene group: GP_IRF1\n", " Average number of genes with non-zero expression per cell: 16.65\n", "\n", "Processing gene group: GP_GATA1\n", " Average number of genes with non-zero expression per cell: 16.10\n", "\n", "Processing gene group: GP_CTCF\n", " Average number of genes with non-zero expression per cell: 14.38\n", "\n", "Processing gene group: GP_MYCN\n", " Average number of genes with non-zero expression per cell: 15.64\n", "\n", "Processing gene group: GP_ATF4\n", " Average number of genes with non-zero expression per cell: 11.82\n", "\n", "Processing gene group: GP_ATF3\n", " Average number of genes with non-zero expression per cell: 11.97\n", "\n", "Processing gene group: GP_JUNB\n", " Average number of genes with non-zero expression per cell: 9.24\n", "\n", "Processing gene group: GP_JUND\n", " Average number of genes with non-zero expression per cell: 8.55\n", "\n", "Processing gene group: GP_GATA2\n", " Average number of genes with non-zero expression per cell: 9.29\n", "\n", "Processing gene group: GP_DDIT3\n", " Average number of genes with non-zero expression per cell: 13.03\n", "\n", "Processing gene group: GP_TAL1\n", " Average number of genes with non-zero expression per cell: 9.53\n", "\n", "Processing gene group: GP_FLI1\n", " Average number of genes with non-zero expression per cell: 9.36\n", "\n", "Processing gene group: GP_ELF1\n", " Average number of genes with non-zero expression per cell: 8.37\n", "\n", "Processing gene group: GP_RUNX3\n", " Average number of genes with non-zero expression per cell: 7.55\n", "\n", "Processing gene group: GP_KLF2\n", " Average number of genes with non-zero expression per cell: 6.60\n", "\n", "Processing gene group: GP_PRDM1\n", " Average number of genes with non-zero expression per cell: 7.58\n", "\n", "Processing gene group: GP_IRF2\n", " Average number of genes with non-zero expression per cell: 6.21\n", "\n", "Processing gene group: GP_NR2C2\n", " Average number of genes with non-zero expression per cell: 4.27\n", "\n", "Processing gene group: GP_ERG\n", " Average number of genes with non-zero expression per cell: 4.92\n", "\n", "Processing gene group: GP_IKZF1\n", " Average number of genes with non-zero expression per cell: 4.08\n", "\n", "Processing gene group: GP_SNAI2\n", " Average number of genes with non-zero expression per cell: 4.42\n", "\n", "Processing gene group: GP_NFYB\n", " Average number of genes with non-zero expression per cell: 6.97\n", "\n", "Processing gene group: GP_HOXA9\n", " Average number of genes with non-zero expression per cell: 4.52\n", "\n", "Processing gene group: GP_IRF5\n", " Average number of genes with non-zero expression per cell: 1.65\n", "\n", "Processing gene group: GP_ZBTB7A\n", " Average number of genes with non-zero expression per cell: 5.23\n", "\n", "Processing gene group: GP_KLF1\n", " Average number of genes with non-zero expression per cell: 3.53\n", "\n", "Processing gene group: GP_LMO2\n", " Average number of genes with non-zero expression per cell: 3.48\n", "\n", "Processing gene group: GP_NFIX\n", " Average number of genes with non-zero expression per cell: 0.71\n", "\n", "Processing gene group: GP_ETV6\n", " Average number of genes with non-zero expression per cell: 2.44\n", "\n", "Processing gene group: GP_MEIS1\n", " Average number of genes with non-zero expression per cell: 1.40\n", "\n", "Processing gene group: GP_SOX6\n", " Average number of genes with non-zero expression per cell: 1.81\n", "\n", "Processing gene group: GP_NFE2\n", " Average number of genes with non-zero expression per cell: 2.79\n", "\n" ] } ], "source": [ "gp_to_drop = []\n", "\n", "for gp in gpdb.columns:\n", " print(f\"Processing gene group: {gp}\") # Print the name of the current gene group\n", " \n", " genes = gpdb[gp].dropna().values # Get the list of non-null gene names in the current group\n", " \n", " # Subset adata to only include genes that are in the current list of genes\n", " bdata = zeng[:, zeng.var.index.isin(genes)] \n", " \n", " # Calculate the number of genes with non-zero expression in each cell\n", " non_zero_counts = (bdata.X > 0).sum(axis=1) # Sum of non-zero values for each row (cell)\n", " \n", " # Calculate the average number of genes with non-zero expression per cell\n", " average_non_zero_genes = non_zero_counts.mean() \n", " \n", " print(f\" Average number of genes with non-zero expression per cell: {average_non_zero_genes:.2f}\")\n", " print('')\n", " \n", " if average_non_zero_genes < 5:\n", " gp_to_drop.append(gp)\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "bf3e6925-0d94-47c1-85d4-d1001704980a", "metadata": { "tags": [] }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 5, "id": "90daa7d2-0054-4abe-9274-57a579a04bd7", "metadata": { "tags": [] }, "outputs": [], "source": [ "gpdb = gpdb.drop(columns = gp_to_drop)" ] }, { "cell_type": "code", "execution_count": 6, "id": "d569ee6c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(227, 25)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpdb.shape" ] }, { "cell_type": "code", "execution_count": 7, "id": "9ee075d7-486b-416d-92c8-edcc91571ab0", "metadata": { "tags": [] }, "outputs": [], "source": [ "gpdb.to_csv('gpdb_tf.csv', index = False)" ] }, { "cell_type": "code", "execution_count": null, "id": "3c337876-624f-4a33-bac0-7f85300670ed", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "982bf448-7cab-4e6f-a2d0-2e2e070fa1c5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "03ca5534-98f7-4da8-ae42-e89b378fe0f0", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "lightning", "language": "python", "name": "lightning" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }