scRNA-seq analysis

In this tutorial, we’ll utilize the 10X PBMCs scRNA-seq dataset to showcase the steps for running SIMBA on scRNA-seq datasets.

[1]:

import os
import simba as si
si.__version__

[1]:

'1.2'

[2]:

workdir = 'result_simba_rnaseq'
si.settings.set_workdir(workdir)

Saving results in: result_simba_rnaseq

[3]:

si.settings.set_figure_params(dpi=80,
                              style='white',
                              fig_size=[5,5],
                              rc={'image.cmap': 'viridis'})

[4]:

# make plots prettier
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('retina')

[ ]:

load example data

[5]:

adata_CG = si.datasets.rna_10xpmbc3k()

Downloading data ...

rna_10xpmbc3k.h5ad: 21.5MB [00:07, 2.75MB/s]

Downloaded to result_simba_rnaseq/data.

[6]:

adata_CG

[6]:

AnnData object with n_obs × n_vars = 2700 × 32738
    obs: 'celltype'
    var: 'gene_ids'

[ ]:

preprocessing

[7]:

si.pp.filter_genes(adata_CG,min_n_cells=3)

Before filtering:
2700 cells, 32738 genes
Filter genes based on min_n_cells
After filtering out low-expressed genes:
2700 cells, 13714 genes

[8]:

si.pp.cal_qc_rna(adata_CG)

[9]:

si.pl.violin(adata_CG,list_obs=['n_counts','n_genes','pct_mt'])

_images/rna_10xpmbc_all_genes_v1.2_14_0.png

Filter out cells if needed:

si.pp.filter_cells_rna(adata,min_n_genes=100)

[10]:

si.pp.normalize(adata_CG,method='lib_size')
si.pp.log_transform(adata_CG)

[ ]:

Optionally, variable gene selection step can be also performed.

si.pp.select_variable_genes(adata_CG, n_top_genes=2000)
si.pl.variable_genes(adata_CG,show_texts=True)

By encoding only variable genes into the graph, the training procedure can be accelerated. However, the embeddings of non-variable genes will not be acquired.

[ ]:

discretize RNA expression

This step groups non-zero gene expression values into distinct levels while preserving the original distribution, enabling recognition of different levels of gene expression during the subsequent training.

[11]:

si.tl.discretize(adata_CG,n_bins=5)
si.pl.discretize(adata_CG,kde=False)

_images/rna_10xpmbc_all_genes_v1.2_22_0.png

[ ]:

generate a graph for training

Observations and variables within each Anndata object are both represented as nodes (entities).
Anndata objects specified in list_CG (or other list parameters like list_adata) will have their nodes automatically matched by comparing their .obs_names and .var_names.
The non-zero values in .layers['simba'] (by default) or .X (if .layers['simba'] does not exist) denote edges between nodes.

For gene expression data, by default, different levels of gene expression are encoded by distinct relations with different weights.

[12]:

si.tl.gen_graph(list_CG=[adata_CG],
                layer='simba',
                use_highly_variable=False,
                dirname='graph0')

relation0: source: C, destination: G
#edges: 590134
relation1: source: C, destination: G
#edges: 1034817
relation2: source: C, destination: G
#edges: 384939
relation3: source: C, destination: G
#edges: 185485
relation4: source: C, destination: G
#edges: 87601
Total number of edges: 2282976
Writing graph file "pbg_graph.txt" to "result_simba_rnaseq/pbg/graph0" ...
Finished.

[ ]:

PBG training

Before training, let’s take a look at the current parameters:

[13]:

si.settings.pbg_params

[13]:

{'entity_path': 'result_simba_rnaseq/pbg/graph0/input/entity',
 'edge_paths': ['result_simba_rnaseq/pbg/graph0/input/edge'],
 'checkpoint_path': '',
 'entities': {'C': {'num_partitions': 1}, 'G': {'num_partitions': 1}},
 'relations': [{'name': 'r0',
   'lhs': 'C',
   'rhs': 'G',
   'operator': 'none',
   'weight': 1.0},
  {'name': 'r1', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 2.0},
  {'name': 'r2', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 3.0},
  {'name': 'r3', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 4.0},
  {'name': 'r4', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 5.0}],
 'dynamic_relations': False,
 'dimension': 50,
 'global_emb': False,
 'comparator': 'dot',
 'num_epochs': 10,
 'workers': 4,
 'num_batch_negs': 50,
 'num_uniform_negs': 50,
 'loss_fn': 'softmax',
 'lr': 0.1,
 'early_stopping': False,
 'regularization_coef': 0.0,
 'wd': 0.0,
 'wd_interval': 50,
 'eval_fraction': 0.05,
 'eval_num_batch_negs': 50,
 'eval_num_uniform_negs': 50,
 'checkpoint_preservation_interval': None}

[ ]:

Typically, the only parameter that may require adjustment is weight decay (wd), as indicated by the pbg metric plots presented below.

In the majority of cases, the automatically determined wd value (achieved by enabling auto_wd=True) performs adequately.

If no adjustments to the parameters are necessary, training can be performed simply by using the following:

si.tl.pbg_train(auto_wd=True, save_wd=True, output='model')

[ ]:

Here we demonstrate how to modify training-related parameters if necessary.

E.g. if we desire to change wd_interval and workers:

[14]:

# modify parameters
dict_config = si.settings.pbg_params.copy()
# dict_config['wd'] = 0.015521
dict_config['wd_interval'] = 10 # we usually set `wd_interval` to 10 for scRNA-seq datasets for a slower but finer training
dict_config['workers'] = 12 #The number of CPUs.

## start training
si.tl.pbg_train(pbg_params = dict_config, auto_wd=True, save_wd=True, output='model')

Auto-estimating weight decay ...
`.settings.pbg_params['wd']` has been updated to 0.015521
Weight decay being used for training is 0.015521
Converting input data ...
[2023-03-31 15:11:33.795264] Using the 5 relation types given in the config
[2023-03-31 15:11:33.795869] Searching for the entities in the edge files...
[2023-03-31 15:11:36.423366] Entity type C:
[2023-03-31 15:11:36.423785] - Found 2700 entities
[2023-03-31 15:11:36.424074] - Removing the ones with fewer than 1 occurrences...
[2023-03-31 15:11:36.424694] - Left with 2700 entities
[2023-03-31 15:11:36.424994] - Shuffling them...
[2023-03-31 15:11:36.426491] Entity type G:
[2023-03-31 15:11:36.426764] - Found 13714 entities
[2023-03-31 15:11:36.427025] - Removing the ones with fewer than 1 occurrences...
[2023-03-31 15:11:36.429263] - Left with 13714 entities
[2023-03-31 15:11:36.429539] - Shuffling them...
[2023-03-31 15:11:36.436280] Preparing counts and dictionaries for entities and relation types:
[2023-03-31 15:11:36.437811] - Writing count of entity type C and partition 0
[2023-03-31 15:11:36.439228] - Writing count of entity type G and partition 0
[2023-03-31 15:11:36.443691] Preparing edge path result_simba_rnaseq/pbg/graph0/input/edge, out of the edges found in result_simba_rnaseq/pbg/graph0/pbg_graph.txt
using fast version
[2023-03-31 15:11:36.444134] Taking the fast train!
[2023-03-31 15:11:36.846039] - Processed 100000 edges so far...
[2023-03-31 15:11:37.247051] - Processed 200000 edges so far...
[2023-03-31 15:11:37.648623] - Processed 300000 edges so far...
[2023-03-31 15:11:38.052075] - Processed 400000 edges so far...
[2023-03-31 15:11:38.454178] - Processed 500000 edges so far...
[2023-03-31 15:11:38.855512] - Processed 600000 edges so far...
[2023-03-31 15:11:39.255567] - Processed 700000 edges so far...
[2023-03-31 15:11:39.655808] - Processed 800000 edges so far...
[2023-03-31 15:11:40.058567] - Processed 900000 edges so far...
[2023-03-31 15:11:40.459357] - Processed 1000000 edges so far...
[2023-03-31 15:11:40.860200] - Processed 1100000 edges so far...
[2023-03-31 15:11:41.262523] - Processed 1200000 edges so far...
[2023-03-31 15:11:41.663912] - Processed 1300000 edges so far...
[2023-03-31 15:11:42.065256] - Processed 1400000 edges so far...
[2023-03-31 15:11:42.465419] - Processed 1500000 edges so far...
[2023-03-31 15:11:42.867537] - Processed 1600000 edges so far...
[2023-03-31 15:11:43.266929] - Processed 1700000 edges so far...
[2023-03-31 15:11:43.667747] - Processed 1800000 edges so far...
[2023-03-31 15:11:44.066249] - Processed 1900000 edges so far...
[2023-03-31 15:11:44.464583] - Processed 2000000 edges so far...
[2023-03-31 15:11:44.860859] - Processed 2100000 edges so far...
[2023-03-31 15:11:45.268806] - Processed 2200000 edges so far...
[2023-03-31 15:11:48.484348] - Processed 2282976 edges in total
Starting training ...
Finished

[ ]:

Plotting training metrics to monitor overfitting and evaluate the final model. Ideally, after a certain number of epochs, the metric curve should stabilize and remain steady.

By default, training loss (lower is better) along with validation loss (lower is better) and MRR (higher is better) are shown.

[15]:

si.pl.pbg_metrics(fig_ncol=1)

_images/rna_10xpmbc_all_genes_v1.2_38_0.png

[ ]:

If further parameter tuning is required, generating new embeddings using updated parameters while training on the same graph can be achieved by updating the output variable, e.g.,

si.tl.pbg_train(pbg_params = dict_config, auto_wd=True, save_wd=True, output="model2")

The new result will be stored in ./result_simba_rnaseq/pbg/graph0/model2.

[ ]:

If needed, the PBG training parameters can be updated using the following function, once the training parameters have been finalized:

si.settings.set_pbg_params(dict_config)

If no parameters are specified, si.settings.set_pbg_params() will reset PBG training parameters to the step before graph generation.

[ ]:

To load the trained result, the following steps can be taken:

By default, it uses the current training result stored in .setting.pbg_params

# load back 'graph0'
si.load_graph_stats()
# load back 'model' that was trained on 'graph0'
si.load_pbg_config()

Users can also specify different paths, e.g.,

# load back 'graph1'
si.load_graph_stats(path='./result_simba_rnaseq/pbg/graph1/')
# load back 'model2' that was trained on 'graph1'
si.load_pbg_config(path='./result_simba_rnaseq/pbg/graph1/model2/')

[ ]:

Post-training analysis

[20]:

# read in entity embeddings obtained from pbg training.
dict_adata = si.read_embedding()
dict_adata

[20]:

{'C': AnnData object with n_obs × n_vars = 2700 × 50,
 'G': AnnData object with n_obs × n_vars = 13714 × 50}

[21]:

adata_C = dict_adata['C']  # embeddings of cells
adata_G = dict_adata['G']  # embeddings of genes

[ ]:

visualize embeddings of cells

[22]:

## Add annotation of celltypes (optional)
adata_C.obs['celltype'] = adata_CG[adata_C.obs_names,:].obs['celltype'].copy()
adata_C

[22]:

AnnData object with n_obs × n_vars = 2700 × 50
    obs: 'celltype'

[23]:

palette_celltype={'B':'#1f77b4',
                  'CD4 T':'#ff7f0e',
                  'CD8 T':'#279e68',
                  'Dendritic':"#aa40fc",
                  'CD14 Monocytes':'#d62728',
                  'FCGR3A Monocytes':'#b5bd61',
                  'Megakaryocytes':'#e377c2',
                  'NK':'#8c564b'}

[24]:

si.tl.umap(adata_C,n_neighbors=15,n_components=2)
si.pl.umap(adata_C,color=['celltype'],
           dict_palette={'celltype': palette_celltype},
           fig_size=(6,4),
           drawing_order='random')

_images/rna_10xpmbc_all_genes_v1.2_53_0.png

[ ]:

visualize embeddings of cells and all genes

[25]:

# embed cells and genes into the same space
adata_all = si.tl.embed(adata_ref=adata_C,list_adata_query=[adata_G])
adata_all.obs.head()

Performing softmax transformation for query data 0;

[25]:

	celltype	id_dataset
GACTCCTGTTGGTG-1	CD14 Monocytes	ref
TCTAACACCAGTTG-1	NK	ref
GAAACCTGTGCTAG-1	CD4 T	ref
CATTACACCAACTG-1	NK	ref
ACTCAGGATTCGTT-1	CD14 Monocytes	ref

[26]:

## add annotations of cells and genes
adata_all.obs['entity_anno'] = ""
adata_all.obs.loc[adata_C.obs_names, 'entity_anno'] = adata_all.obs.loc[adata_C.obs_names, 'celltype']
adata_all.obs.loc[adata_G.obs_names, 'entity_anno'] = 'gene'
adata_all.obs.head()

[26]:

	celltype	id_dataset	entity_anno
GACTCCTGTTGGTG-1	CD14 Monocytes	ref	CD14 Monocytes
TCTAACACCAGTTG-1	NK	ref	NK
GAAACCTGTGCTAG-1	CD4 T	ref	CD4 T
CATTACACCAACTG-1	NK	ref	NK
ACTCAGGATTCGTT-1	CD14 Monocytes	ref	CD14 Monocytes

[27]:

palette_entity_anno = palette_celltype.copy()
palette_entity_anno['gene'] = "#607e95"

[28]:

si.tl.umap(adata_all,n_neighbors=15,n_components=2)
si.pl.umap(adata_all,color=['id_dataset','entity_anno'],
           dict_palette={'entity_anno': palette_entity_anno},
           drawing_order='original',
           fig_size=(6,5))

_images/rna_10xpmbc_all_genes_v1.2_59_0.png

[29]:

# highlight some marker genes in the co-embedding space
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
si.pl.umap(adata_all[::-1,],color=['entity_anno'],dict_palette={'entity_anno': palette_entity_anno},
           drawing_order='original',
           texts=marker_genes + ['GAPDH', 'B2M'],
           show_texts=True,
           fig_size=(8,6))

_images/rna_10xpmbc_all_genes_v1.2_60_0.png

[ ]:

We can also visualize only partial genes (e.g., variable genes) instead of all genes by following the steps below:

# obtain variable genes
si.pp.select_variable_genes(adata_CG, n_top_genes=3000)
var_genes = adata_CG.var_names[adata_CG.var['highly_variable']].tolist()
# obtain SIMBA embeddings of cells and variable genes
adata_all2 = adata_all[list(adata_C.obs_names) + var_genes,].copy()
# visualize them using UMAP
si.tl.umap(adata_all2,n_neighbors=15,n_components=2)
si.pl.umap(adata_all2,color=['entity_anno'],
           dict_palette={'entity_anno': palette_entity_anno},
           drawing_order='original',
           fig_size=(6,5))

[ ]:

clustering-free marker discovery

compute SIMBA metrics including max value, Gini index, standard deviation, and entropy, to assess cell-type specificity of genes:

max: a higher value indicates higher cell-type specificity
gini: a higher value indicates higher cell-type specificity
std: a higher value indicates higher cell-type specificity
entropy: a lower value indicates higher cell-type specificity

[30]:

adata_cmp = si.tl.compare_entities(adata_ref=adata_C, adata_query=adata_G)
adata_cmp

[30]:

AnnData object with n_obs × n_vars = 2700 × 13714
    obs: 'celltype'
    var: 'max', 'std', 'gini', 'entropy'
    layers: 'norm', 'softmax'

[31]:

# genes can be ranked based on either one or multiple metrics
adata_cmp.var.head()

[31]:

	max	std	gini	entropy
YIPF5	0.540598	0.229048	0.129493	7.874485
ZDHHC7	0.652473	0.322003	0.184971	7.846139
PSAP	1.165987	0.656630	0.384751	7.636724
ARL4C	1.068365	0.625148	0.327085	7.730876
MYO19	0.325210	0.109136	0.060507	7.894761

[32]:

# SIMBA metrics can be visualized using the following function:
si.pl.entity_metrics(adata_cmp,
                     x='max',
                     y='gini',
                     show_contour=False,
                     texts=marker_genes + ['GAPDH', 'B2M'],
                     show_texts=True,
                     show_cutoff=True,
                     size=5,
                     text_expand=(1.3,1.5),
                     cutoff_x=1.,
                     cutoff_y=0.3,
                     save_fig=False)

_images/rna_10xpmbc_all_genes_v1.2_68_0.png

[ ]:

SIMBA barcode plot

SIMBA barcode plots visualize feature-cell probability based on edge confidence: - Imbalance indicates cell-type specificity - uniform distribution suggests non-specificity.

SIMBA barcode plots

[33]:

# add annoations of cells
adata_cmp.obs['celltype'] = adata_CG.obs.loc[adata_cmp.obs_names,'celltype']
list_genes = ['CST3', 'NKG7', 'MS4A1', 'GAPDH']
si.pl.entity_barcode(adata_cmp,
                     layer='softmax',
                     entities=list_genes,
                     anno_ref='celltype',
                     show_cutoff=True,
                     cutoff=0.001,
                     palette=palette_celltype,
                     fig_size=(6, 2.5),
                     save_fig=False)

_images/rna_10xpmbc_all_genes_v1.2_73_0.png

[ ]:

[35]:

# The same list of genes can also be visualized on UMAP to confirm their cell type specificity
adata_CG.obsm['X_umap'] = adata_C[adata_CG.obs_names,].obsm['X_umap'].copy()
si.pl.umap(adata_CG,
           color=['CST3', 'NKG7', 'MS4A1', 'GAPDH'],
           drawing_order='sorted',
           size=5,
           alpha=0.9,
           fig_ncol=4,
           fig_size=(4,4),
           save_fig=False)

_images/rna_10xpmbc_all_genes_v1.2_75_0.png

[ ]:

Queries of entities in SIMBA space

The SIMBA embedding space, with cells and features, serves as an informative entity database. Queries can be performed on the “SIMBA database” by examining neighboring entities at the individual-cell or individual-feature level.

For example, - the query for a cell’s neighboring features can be used to identify marker features (e.g., marker genes) at the individual-cell level. - the query for features’ neighboring cells can be used to annotate cells.

[36]:

# First, we visualize the SIMBA co-embedding space, which serves as a database of entities including both cells and features (e.g., genes) and also where queries of entities will be performed
si.pl.umap(adata_all,
           color=['entity_anno'],dict_palette={'entity_anno': palette_entity_anno},
           drawing_order='random',
           show_texts=False,
           fig_size=(7,5))

_images/rna_10xpmbc_all_genes_v1.2_79_0.png

[37]:

# find neighbor genes around the location [6, 16] on UMAP
query_result = si.tl.query(adata_all,
                           pin=[6,16],
                           obsm='X_umap',
                           use_radius=True,r=2,
                           anno_filter='entity_anno',
                           filters=['gene'])
print(query_result.shape)
query_result.head()

(247, 5)

[37]:

	celltype	id_dataset	entity_anno	distance
AC013264.2	nan	query_0	gene	0.001936
PCSK1N	nan	query_0	gene	0.069059
TCEA3	nan	query_0	gene	0.082872
CD27	nan	query_0	gene	0.091110
RP11-18H21.1	nan	query_0	gene	0.105336

[38]:

# show locations of pin point and its neighbor genes
si.pl.query(adata_all,
            show_texts=False,
            color=['entity_anno'], dict_palette={'entity_anno': palette_entity_anno},
            alpha=0.9,
            alpha_bg=0.1,
            fig_size=(7,5))

_images/rna_10xpmbc_all_genes_v1.2_81_0.png

[ ]:

[39]:

# find top 50 neighbor genes around cell "ACTCAGGATTCGTT-1" (CD14 Monocytes) in SIMBA space
query_result = si.tl.query(adata_all,
                           entity=['ACTCAGGATTCGTT-1'],
                           obsm=None,
                           use_radius=False,
                           k=50,
                           anno_filter='entity_anno',
                           filters=['gene'])
print(query_result.shape)
query_result.head()

(50, 5)

[39]:

	celltype	id_dataset	entity_anno	distance	query
JUND	nan	query_0	gene	1.719076	ACTCAGGATTCGTT-1
CLEC4E	nan	query_0	gene	1.728315	ACTCAGGATTCGTT-1
PTPRE	nan	query_0	gene	1.729887	ACTCAGGATTCGTT-1
QPCT	nan	query_0	gene	1.731332	ACTCAGGATTCGTT-1
NUP214	nan	query_0	gene	1.733139	ACTCAGGATTCGTT-1

[40]:

# show locations of entity and its neighbor genes
si.pl.query(adata_all,
            obsm='X_umap',
            color=['entity_anno'], dict_palette={'entity_anno': palette_entity_anno},
            alpha=0.9,
            alpha_bg=0.1,
            fig_size=(7,5))

_images/rna_10xpmbc_all_genes_v1.2_84_0.png

[ ]:

[41]:

# find top 50 neighbor genes for multiples cells in SIMBA space
query_result = si.tl.query(adata_all,entity=['GATGCCCTCTCATT-1', 'CTGAAGTGGCTATG-1'],
                           obsm=None,
                           use_radius=False,
                           k=50,
                           anno_filter='entity_anno',
                           filters=['gene'],
                           )
print(query_result.shape)
query_result.head()

(100, 5)

[41]:

	celltype	id_dataset	entity_anno	distance	query
LINC00176	nan	query_0	gene	1.020504	CTGAAGTGGCTATG-1
CCR7	nan	query_0	gene	1.021033	CTGAAGTGGCTATG-1
FHIT	nan	query_0	gene	1.021289	CTGAAGTGGCTATG-1
LDLRAP1	nan	query_0	gene	1.040456	GATGCCCTCTCATT-1
TSHZ2	nan	query_0	gene	1.047881	GATGCCCTCTCATT-1

[42]:

# show locations of entities and their neighbor genes
si.pl.query(adata_all,
            obsm='X_umap',
            alpha=0.9,
            alpha_bg=0.1,
            fig_size=(5,5))

_images/rna_10xpmbc_all_genes_v1.2_87_0.png

[ ]:

[43]:

# find neighbor entities (both cells and genes) of a given gene on UMAP
query_result = si.tl.query(adata_all,
                           entity=['CD79A'],
                           obsm='X_umap',
                           use_radius=False,
                           k=50
                           )
print(query_result.shape)
query_result.iloc[:10,]

(50, 5)

[43]:

	celltype	id_dataset	entity_anno	distance	query
CD79A	nan	query_0	gene	0.000000	CD79A
CD19	nan	query_0	gene	0.013946	CD79A
TGGATGTGACCTAG-1	B	ref	B	0.014974	CD79A
ACAAGAGAGTTGAC-1	B	ref	B	0.016267	CD79A
TACATCACGCTAAC-1	B	ref	B	0.022173	CD79A
TTATCCGACTAGTG-1	B	ref	B	0.024351	CD79A
GGAACTTGGGTAGG-1	B	ref	B	0.025603	CD79A
TGCGTAGACGGGAA-1	B	ref	B	0.027751	CD79A
TAACTCACGTACAC-1	B	ref	B	0.028147	CD79A
GATTGGACTTTCGT-1	B	ref	B	0.029059	CD79A

[44]:

si.pl.query(adata_all,
            obsm='X_umap',
            color=['entity_anno'], dict_palette={'entity_anno': palette_entity_anno},
            show_texts=False,
            alpha=0.9,
            alpha_bg=0.1,
            fig_size=(7,5))

_images/rna_10xpmbc_all_genes_v1.2_90_0.png

[ ]:

save results

[45]:

adata_CG.write(os.path.join(workdir, 'adata_CG.h5ad'))
adata_C.write(os.path.join(workdir, 'adata_C.h5ad'))
adata_G.write(os.path.join(workdir, 'adata_G.h5ad'))
adata_all.write(os.path.join(workdir, 'adata_all.h5ad'))
adata_cmp.write(os.path.join(workdir, 'adata_cmp.h5ad'))

[ ]:

Read back anndata objects

adata_CG = si.read_h5ad(os.path.join(workdir, 'adata_CG.h5ad'))
adata_C = si.read_h5ad(os.path.join(workdir, 'adata_C.h5ad'))
adata_G = si.read_h5ad(os.path.join(workdir, 'adata_G.h5ad'))
adata_all = si.read_h5ad(os.path.join(workdir, 'adata_all.h5ad'))
adata_cmp = si.read_h5ad(os.path.join(workdir, 'adata_cmp.h5ad'))