Use edge weigthts

In this tutorial, we will show how to incorporate weights into the graph (add_edge_weights=True) and enable these weights for training (use_edge_weights=True)

[1]:

import os
import simba as si
si.__version__

[1]:

'1.2'

[2]:

workdir = 'result_simba_edge_weights'
si.settings.set_workdir(workdir)

Saving results in: result_simba_edge_weights

[3]:

si.settings.set_figure_params(dpi=80,
                              style='white',
                              fig_size=[5,5],
                              rc={'image.cmap': 'viridis'})

[4]:

# make plots prettier
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('retina')

[5]:

adata_CG = si.datasets.rna_10xpmbc3k()

Downloading data ...

rna_10xpmbc3k.h5ad: 21.5MB [00:03, 5.60MB/s]

Downloaded to result_simba_edge_weights/data.

[6]:

adata_CG

[6]:

AnnData object with n_obs × n_vars = 2700 × 32738
    obs: 'celltype'
    var: 'gene_ids'

[7]:

si.pp.filter_genes(adata_CG,min_n_cells=3)

Before filtering:
2700 cells, 32738 genes
Filter genes based on min_n_cells
After filtering out low-expressed genes:
2700 cells, 13714 genes

[8]:

si.pp.cal_qc_rna(adata_CG)
si.pl.violin(adata_CG,list_obs=['n_counts','n_genes','pct_mt'])

[9]:

si.pp.normalize(adata_CG,method='lib_size')
si.pp.log_transform(adata_CG)

[10]:

si.pp.select_variable_genes(adata_CG, n_top_genes=2000)
si.pl.variable_genes(adata_CG,show_texts=True)

2000 variable genes are selected.

[11]:

si.tl.discretize(adata_CG,n_bins=5)
si.pl.discretize(adata_CG,kde=False)

[ ]:

Using discretized gene expression - all genes

[12]:

si.tl.gen_graph(list_CG=[adata_CG],
                copy=False,
                use_highly_variable=False,
                layer='simba',
                add_edge_weights=False,
                dirname='graph0')

relation0: source: C, destination: G
#edges: 590134
relation1: source: C, destination: G
#edges: 1034817
relation2: source: C, destination: G
#edges: 384939
relation3: source: C, destination: G
#edges: 185485
relation4: source: C, destination: G
#edges: 87601
Total number of edges: 2282976
Writing graph file "pbg_graph.txt" to "result_simba_edge_weights/pbg/graph0" ...
Finished.

[13]:

si.settings.pbg_params

[13]:

{'entity_path': 'result_simba_edge_weights/pbg/graph0/input/entity',
 'edge_paths': ['result_simba_edge_weights/pbg/graph0/input/edge'],
 'checkpoint_path': '',
 'entities': {'C': {'num_partitions': 1}, 'G': {'num_partitions': 1}},
 'relations': [{'name': 'r0',
   'lhs': 'C',
   'rhs': 'G',
   'operator': 'none',
   'weight': 1.0},
  {'name': 'r1', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 2.0},
  {'name': 'r2', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 3.0},
  {'name': 'r3', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 4.0},
  {'name': 'r4', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 5.0}],
 'dynamic_relations': False,
 'dimension': 50,
 'global_emb': False,
 'comparator': 'dot',
 'num_epochs': 10,
 'workers': 4,
 'num_batch_negs': 50,
 'num_uniform_negs': 50,
 'loss_fn': 'softmax',
 'lr': 0.1,
 'early_stopping': False,
 'regularization_coef': 0.0,
 'wd': 0.0,
 'wd_interval': 50,
 'eval_fraction': 0.05,
 'eval_num_batch_negs': 50,
 'eval_num_uniform_negs': 50,
 'checkpoint_preservation_interval': None}

[14]:

# modify parameters
dict_config = si.settings.pbg_params.copy()
# dict_config['wd'] = 0.015521
dict_config['wd_interval'] = 10 # we usually set `wd_interval` to 10 for scRNA-seq datasets for a slower but finer training
dict_config['workers'] = 4 #The number of CPUs.

## start training
si.tl.pbg_train(pbg_params = dict_config, auto_wd=True, save_wd=True, output='model')

Auto-estimated weight decay is 0.015521
`.settings.pbg_params['wd']` has been updated to 0.015521
Converting input data ...
[2022-10-10 13:43:03.453400] Using the 5 relation types given in the config
[2022-10-10 13:43:03.453905] Searching for the entities in the edge files...
[2022-10-10 13:43:06.243132] Entity type C:
[2022-10-10 13:43:06.243681] - Found 2700 entities
[2022-10-10 13:43:06.244079] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:43:06.244885] - Left with 2700 entities
[2022-10-10 13:43:06.245435] - Shuffling them...
[2022-10-10 13:43:06.247549] Entity type G:
[2022-10-10 13:43:06.248129] - Found 13714 entities
[2022-10-10 13:43:06.248461] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:43:06.250226] - Left with 13714 entities
[2022-10-10 13:43:06.250876] - Shuffling them...
[2022-10-10 13:43:06.259329] Preparing counts and dictionaries for entities and relation types:
[2022-10-10 13:43:06.260700] - Writing count of entity type C and partition 0
[2022-10-10 13:43:06.264633] - Writing count of entity type G and partition 0
[2022-10-10 13:43:06.272220] Preparing edge path result_simba_edge_weights/pbg/graph0/input/edge, out of the edges found in result_simba_edge_weights/pbg/graph0/pbg_graph.txt
using fast version
[2022-10-10 13:43:06.272903] Taking the fast train!
[2022-10-10 13:43:06.731291] - Processed 100000 edges so far...
[2022-10-10 13:43:07.196747] - Processed 200000 edges so far...
[2022-10-10 13:43:07.699447] - Processed 300000 edges so far...
[2022-10-10 13:43:08.199365] - Processed 400000 edges so far...
[2022-10-10 13:43:08.673068] - Processed 500000 edges so far...
[2022-10-10 13:43:09.150632] - Processed 600000 edges so far...
[2022-10-10 13:43:09.614157] - Processed 700000 edges so far...
[2022-10-10 13:43:10.094486] - Processed 800000 edges so far...
[2022-10-10 13:43:10.555303] - Processed 900000 edges so far...
[2022-10-10 13:43:11.048349] - Processed 1000000 edges so far...
[2022-10-10 13:43:11.544103] - Processed 1100000 edges so far...
[2022-10-10 13:43:12.053794] - Processed 1200000 edges so far...
[2022-10-10 13:43:12.541551] - Processed 1300000 edges so far...
[2022-10-10 13:43:13.064220] - Processed 1400000 edges so far...
[2022-10-10 13:43:13.527216] - Processed 1500000 edges so far...
[2022-10-10 13:43:13.981077] - Processed 1600000 edges so far...
[2022-10-10 13:43:14.459946] - Processed 1700000 edges so far...
[2022-10-10 13:43:14.970111] - Processed 1800000 edges so far...
[2022-10-10 13:43:15.444465] - Processed 1900000 edges so far...
[2022-10-10 13:43:15.988276] - Processed 2000000 edges so far...
[2022-10-10 13:43:16.434461] - Processed 2100000 edges so far...
[2022-10-10 13:43:16.889269] - Processed 2200000 edges so far...
[2022-10-10 13:43:19.983264] - Processed 2282976 edges in total
Starting training ...
Finished

[15]:

si.pl.pbg_metrics(fig_ncol=1)

[16]:

palette_celltype={'B':'#1f77b4',
                  'CD4 T':'#ff7f0e',
                  'CD8 T':'#279e68',
                  'Dendritic':"#aa40fc",
                  'CD14 Monocytes':'#d62728',
                  'FCGR3A Monocytes':'#b5bd61',
                  'Megakaryocytes':'#e377c2',
                  'NK':'#8c564b'}

dict_adata = si.read_embedding()

adata_C = dict_adata['C']  # embeddings for cells
adata_G = dict_adata['G']  # embeddings for genes

## Add annotation of celltypes (optional)
adata_C.obs['celltype'] = adata_CG[adata_C.obs_names,:].obs['celltype'].copy()
adata_C

[16]:

AnnData object with n_obs × n_vars = 2700 × 50
    obs: 'celltype'

[17]:

si.tl.umap(adata_C,n_neighbors=15,n_components=2)
si.pl.umap(adata_C,color=['celltype'],
           dict_palette={'celltype': palette_celltype},
           fig_size=(6,4),
           drawing_order='random')

[ ]:

Using discretized gene expression - only variable genes

[18]:

si.tl.gen_graph(list_CG=[adata_CG],
                copy=False,
                use_highly_variable=True,
                layer='simba',
                add_edge_weights=False,
                dirname='graph1')

relation0: source: C, destination: G
#edges: 138727
relation1: source: C, destination: G
#edges: 243405
relation2: source: C, destination: G
#edges: 82172
relation3: source: C, destination: G
#edges: 27876
relation4: source: C, destination: G
#edges: 17250
Total number of edges: 509430
Writing graph file "pbg_graph.txt" to "result_simba_edge_weights/pbg/graph1" ...
Finished.

[19]:

si.settings.pbg_params

[19]:

{'entity_path': 'result_simba_edge_weights/pbg/graph1/input/entity',
 'edge_paths': ['result_simba_edge_weights/pbg/graph1/input/edge'],
 'checkpoint_path': 'result_simba_edge_weights/pbg/graph0/model',
 'entities': {'C': {'num_partitions': 1}, 'G': {'num_partitions': 1}},
 'relations': [{'name': 'r0',
   'lhs': 'C',
   'rhs': 'G',
   'operator': 'none',
   'weight': 1.0},
  {'name': 'r1', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 2.0},
  {'name': 'r2', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 3.0},
  {'name': 'r3', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 4.0},
  {'name': 'r4', 'lhs': 'C', 'rhs': 'G', 'operator': 'none', 'weight': 5.0}],
 'dynamic_relations': False,
 'dimension': 50,
 'global_emb': False,
 'comparator': 'dot',
 'num_epochs': 10,
 'workers': 4,
 'num_batch_negs': 50,
 'num_uniform_negs': 50,
 'loss_fn': 'softmax',
 'lr': 0.1,
 'early_stopping': False,
 'regularization_coef': 0.0,
 'wd': 0.015521,
 'wd_interval': 50,
 'eval_fraction': 0.05,
 'eval_num_batch_negs': 50,
 'eval_num_uniform_negs': 50,
 'checkpoint_preservation_interval': None}

[20]:

# modify parameters
dict_config = si.settings.pbg_params.copy()
dict_config['wd_interval'] = 10 # we usually set `wd_interval` to 10 for scRNA-seq datasets for a slower but finer training
dict_config['workers'] = 4 #The number of CPUs.

## start training
si.tl.pbg_train(pbg_params = dict_config, auto_wd=True, save_wd=True, output='model')

Auto-estimated weight decay is 0.069558
`.settings.pbg_params['wd']` has been updated to 0.069558
Converting input data ...
[2022-10-10 13:44:46.615190] Using the 5 relation types given in the config
[2022-10-10 13:44:46.615748] Searching for the entities in the edge files...
[2022-10-10 13:44:47.245409] Entity type C:
[2022-10-10 13:44:47.246209] - Found 2700 entities
[2022-10-10 13:44:47.246714] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:44:47.247570] - Left with 2700 entities
[2022-10-10 13:44:47.247964] - Shuffling them...
[2022-10-10 13:44:47.249921] Entity type G:
[2022-10-10 13:44:47.250608] - Found 2000 entities
[2022-10-10 13:44:47.250971] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:44:47.251637] - Left with 2000 entities
[2022-10-10 13:44:47.252242] - Shuffling them...
[2022-10-10 13:44:47.253708] Preparing counts and dictionaries for entities and relation types:
[2022-10-10 13:44:47.254991] - Writing count of entity type C and partition 0
[2022-10-10 13:44:47.258398] - Writing count of entity type G and partition 0
[2022-10-10 13:44:47.260535] Preparing edge path result_simba_edge_weights/pbg/graph1/input/edge, out of the edges found in result_simba_edge_weights/pbg/graph1/pbg_graph.txt
using fast version
[2022-10-10 13:44:47.261206] Taking the fast train!
[2022-10-10 13:44:47.724024] - Processed 100000 edges so far...
[2022-10-10 13:44:48.175073] - Processed 200000 edges so far...
[2022-10-10 13:44:48.618815] - Processed 300000 edges so far...
[2022-10-10 13:44:49.075198] - Processed 400000 edges so far...
[2022-10-10 13:44:49.545616] - Processed 500000 edges so far...
[2022-10-10 13:44:50.091471] - Processed 509430 edges in total
Starting training ...
Finished

[21]:

si.pl.pbg_metrics(fig_ncol=1)

[22]:

palette_celltype={'B':'#1f77b4',
                  'CD4 T':'#ff7f0e',
                  'CD8 T':'#279e68',
                  'Dendritic':"#aa40fc",
                  'CD14 Monocytes':'#d62728',
                  'FCGR3A Monocytes':'#b5bd61',
                  'Megakaryocytes':'#e377c2',
                  'NK':'#8c564b'}

dict_adata = si.read_embedding()

adata_C = dict_adata['C']  # embeddings for cells
adata_G = dict_adata['G']  # embeddings for genes

## Add annotation of celltypes (optional)
adata_C.obs['celltype'] = adata_CG[adata_C.obs_names,:].obs['celltype'].copy()
adata_C

[22]:

AnnData object with n_obs × n_vars = 2700 × 50
    obs: 'celltype'

[23]:

si.tl.umap(adata_C,n_neighbors=15,n_components=2)
si.pl.umap(adata_C,color=['celltype'],
           dict_palette={'celltype': palette_celltype},
           fig_size=(6,4),
           drawing_order='random')

[ ]:

Using edge weights (raw gene expression) - all genes

[24]:

si.tl.gen_graph(list_CG=[adata_CG],
                copy=False,
                use_highly_variable=False,
                layer=None,
                add_edge_weights=True,
                dirname='graph2')

relation0: source: C, destination: G
#edges: 2282976
Total number of edges: 2282976
Writing graph file "pbg_graph.txt" to "result_simba_edge_weights/pbg/graph2" ...
Finished.

[25]:

si.settings.pbg_params

[25]:

{'entity_path': 'result_simba_edge_weights/pbg/graph2/input/entity',
 'edge_paths': ['result_simba_edge_weights/pbg/graph2/input/edge'],
 'checkpoint_path': 'result_simba_edge_weights/pbg/graph1/model',
 'entities': {'C': {'num_partitions': 1}, 'G': {'num_partitions': 1}},
 'relations': [{'name': 'r0',
   'lhs': 'C',
   'rhs': 'G',
   'operator': 'none',
   'weight': 1.0}],
 'dynamic_relations': False,
 'dimension': 50,
 'global_emb': False,
 'comparator': 'dot',
 'num_epochs': 10,
 'workers': 4,
 'num_batch_negs': 50,
 'num_uniform_negs': 50,
 'loss_fn': 'softmax',
 'lr': 0.1,
 'early_stopping': False,
 'regularization_coef': 0.0,
 'wd': 0.069558,
 'wd_interval': 50,
 'eval_fraction': 0.05,
 'eval_num_batch_negs': 50,
 'eval_num_uniform_negs': 50,
 'checkpoint_preservation_interval': None}

[26]:

# modify parameters
dict_config = si.settings.pbg_params.copy()
dict_config['wd_interval'] = 10 # we usually set `wd_interval` to 10 for scRNA-seq datasets for a slower but finer training
dict_config['workers'] = 4 #The number of CPUs.

## start training
si.tl.pbg_train(pbg_params = dict_config, auto_wd=True, save_wd=True, output='model', use_edge_weights=True)

Auto-estimated weight decay is 0.015521
`.settings.pbg_params['wd']` has been updated to 0.015521
Converting input data ...
Edge weights are being used ...
[2022-10-10 13:47:20.130429] Using the 1 relation types given in the config
[2022-10-10 13:47:20.131066] Searching for the entities in the edge files...
[2022-10-10 13:47:23.633055] Entity type C:
[2022-10-10 13:47:23.633609] - Found 2700 entities
[2022-10-10 13:47:23.633978] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:47:23.634907] - Left with 2700 entities
[2022-10-10 13:47:23.635662] - Shuffling them...
[2022-10-10 13:47:23.637645] Entity type G:
[2022-10-10 13:47:23.638327] - Found 13714 entities
[2022-10-10 13:47:23.638688] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:47:23.641269] - Left with 13714 entities
[2022-10-10 13:47:23.641881] - Shuffling them...
[2022-10-10 13:47:23.654717] Preparing counts and dictionaries for entities and relation types:
[2022-10-10 13:47:23.656340] - Writing count of entity type C and partition 0
[2022-10-10 13:47:23.659567] - Writing count of entity type G and partition 0
[2022-10-10 13:47:23.672482] Preparing edge path result_simba_edge_weights/pbg/graph2/input/edge, out of the edges found in result_simba_edge_weights/pbg/graph2/pbg_graph.txt
using fast version
[2022-10-10 13:47:23.673757] Taking the fast train!
[2022-10-10 13:47:24.203203] - Processed 100000 edges so far...
[2022-10-10 13:47:24.687041] - Processed 200000 edges so far...
[2022-10-10 13:47:25.188701] - Processed 300000 edges so far...
[2022-10-10 13:47:25.682914] - Processed 400000 edges so far...
[2022-10-10 13:47:26.187692] - Processed 500000 edges so far...
[2022-10-10 13:47:26.690817] - Processed 600000 edges so far...
[2022-10-10 13:47:27.297040] - Processed 700000 edges so far...
[2022-10-10 13:47:27.926251] - Processed 800000 edges so far...
[2022-10-10 13:47:28.443981] - Processed 900000 edges so far...
[2022-10-10 13:47:28.980157] - Processed 1000000 edges so far...
[2022-10-10 13:47:29.501237] - Processed 1100000 edges so far...
[2022-10-10 13:47:30.028912] - Processed 1200000 edges so far...
[2022-10-10 13:47:30.522380] - Processed 1300000 edges so far...
[2022-10-10 13:47:31.095499] - Processed 1400000 edges so far...
[2022-10-10 13:47:31.593399] - Processed 1500000 edges so far...
[2022-10-10 13:47:32.110170] - Processed 1600000 edges so far...
[2022-10-10 13:47:32.622942] - Processed 1700000 edges so far...
[2022-10-10 13:47:33.144220] - Processed 1800000 edges so far...
[2022-10-10 13:47:33.635787] - Processed 1900000 edges so far...
[2022-10-10 13:47:34.151326] - Processed 2000000 edges so far...
[2022-10-10 13:47:34.648824] - Processed 2100000 edges so far...
[2022-10-10 13:47:35.203821] - Processed 2200000 edges so far...
[2022-10-10 13:47:38.466443] - Processed 2282976 edges in total
Starting training ...
Finished

[27]:

si.pl.pbg_metrics(fig_ncol=1)

[28]:

palette_celltype={'B':'#1f77b4',
                  'CD4 T':'#ff7f0e',
                  'CD8 T':'#279e68',
                  'Dendritic':"#aa40fc",
                  'CD14 Monocytes':'#d62728',
                  'FCGR3A Monocytes':'#b5bd61',
                  'Megakaryocytes':'#e377c2',
                  'NK':'#8c564b'}

dict_adata = si.read_embedding()

adata_C = dict_adata['C']  # embeddings for cells
adata_G = dict_adata['G']  # embeddings for genes

## Add annotation of celltypes (optional)
adata_C.obs['celltype'] = adata_CG[adata_C.obs_names,:].obs['celltype'].copy()
adata_C

[28]:

AnnData object with n_obs × n_vars = 2700 × 50
    obs: 'celltype'

[29]:

si.tl.umap(adata_C,n_neighbors=15,n_components=2)
si.pl.umap(adata_C,color=['celltype'],
           dict_palette={'celltype': palette_celltype},
           fig_size=(6,4),
           drawing_order='random')

[ ]:

Using edge weights (raw gene expression) - only variable genes

[30]:

si.tl.gen_graph(list_CG=[adata_CG],
                copy=False,
                use_highly_variable=True,
                layer=None,
                add_edge_weights=True,
                dirname='graph3')

relation0: source: C, destination: G
#edges: 509430
Total number of edges: 509430
Writing graph file "pbg_graph.txt" to "result_simba_edge_weights/pbg/graph3" ...
Finished.

[31]:

si.settings.pbg_params

[31]:

{'entity_path': 'result_simba_edge_weights/pbg/graph3/input/entity',
 'edge_paths': ['result_simba_edge_weights/pbg/graph3/input/edge'],
 'checkpoint_path': 'result_simba_edge_weights/pbg/graph2/model',
 'entities': {'C': {'num_partitions': 1}, 'G': {'num_partitions': 1}},
 'relations': [{'name': 'r0',
   'lhs': 'C',
   'rhs': 'G',
   'operator': 'none',
   'weight': 1.0}],
 'dynamic_relations': False,
 'dimension': 50,
 'global_emb': False,
 'comparator': 'dot',
 'num_epochs': 10,
 'workers': 4,
 'num_batch_negs': 50,
 'num_uniform_negs': 50,
 'loss_fn': 'softmax',
 'lr': 0.1,
 'early_stopping': False,
 'regularization_coef': 0.0,
 'wd': 0.015521,
 'wd_interval': 50,
 'eval_fraction': 0.05,
 'eval_num_batch_negs': 50,
 'eval_num_uniform_negs': 50,
 'checkpoint_preservation_interval': None}

[32]:

# modify parameters
dict_config = si.settings.pbg_params.copy()
dict_config['wd_interval'] = 10 # we usually set `wd_interval` to 10 for scRNA-seq datasets for a slower but finer training
dict_config['workers'] = 4 #The number of CPUs.

## start training
si.tl.pbg_train(pbg_params = dict_config, auto_wd=True, save_wd=True, output='model', use_edge_weights=True)

Auto-estimated weight decay is 0.069558
`.settings.pbg_params['wd']` has been updated to 0.069558
Converting input data ...
Edge weights are being used ...
[2022-10-10 13:50:24.363063] Using the 1 relation types given in the config
[2022-10-10 13:50:24.363674] Searching for the entities in the edge files...
[2022-10-10 13:50:25.083892] Entity type C:
[2022-10-10 13:50:25.084614] - Found 2700 entities
[2022-10-10 13:50:25.085133] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:50:25.086340] - Left with 2700 entities
[2022-10-10 13:50:25.087150] - Shuffling them...
[2022-10-10 13:50:25.089147] Entity type G:
[2022-10-10 13:50:25.089771] - Found 2000 entities
[2022-10-10 13:50:25.090276] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:50:25.091244] - Left with 2000 entities
[2022-10-10 13:50:25.092002] - Shuffling them...
[2022-10-10 13:50:25.093553] Preparing counts and dictionaries for entities and relation types:
[2022-10-10 13:50:25.094980] - Writing count of entity type C and partition 0
[2022-10-10 13:50:25.097761] - Writing count of entity type G and partition 0
[2022-10-10 13:50:25.100081] Preparing edge path result_simba_edge_weights/pbg/graph3/input/edge, out of the edges found in result_simba_edge_weights/pbg/graph3/pbg_graph.txt
using fast version
[2022-10-10 13:50:25.100748] Taking the fast train!
[2022-10-10 13:50:25.579906] - Processed 100000 edges so far...
[2022-10-10 13:50:26.069180] - Processed 200000 edges so far...
[2022-10-10 13:50:26.567087] - Processed 300000 edges so far...
[2022-10-10 13:50:27.100423] - Processed 400000 edges so far...
[2022-10-10 13:50:27.582349] - Processed 500000 edges so far...
[2022-10-10 13:50:28.612344] - Processed 509430 edges in total
Starting training ...
Finished

[33]:

si.pl.pbg_metrics(fig_ncol=1)

[34]:

palette_celltype={'B':'#1f77b4',
                  'CD4 T':'#ff7f0e',
                  'CD8 T':'#279e68',
                  'Dendritic':"#aa40fc",
                  'CD14 Monocytes':'#d62728',
                  'FCGR3A Monocytes':'#b5bd61',
                  'Megakaryocytes':'#e377c2',
                  'NK':'#8c564b'}

dict_adata = si.read_embedding()

adata_C = dict_adata['C']  # embeddings for cells
adata_G = dict_adata['G']  # embeddings for genes

## Add annotation of celltypes (optional)
adata_C.obs['celltype'] = adata_CG[adata_C.obs_names,:].obs['celltype'].copy()
adata_C

[34]:

AnnData object with n_obs × n_vars = 2700 × 50
    obs: 'celltype'

[35]:

si.tl.umap(adata_C,n_neighbors=15,n_components=2)
si.pl.umap(adata_C,color=['celltype'],
           dict_palette={'celltype': palette_celltype},
           fig_size=(6,4),
           drawing_order='random')

[ ]:

[ ]:

Using edge weights (discretized gene expression) - all genes

[36]:

si.tl.gen_graph(list_CG=[adata_CG],
                copy=False,
                use_highly_variable=False,
                layer='simba',
                add_edge_weights=True,
                dirname='graph4')

relation0: source: C, destination: G
#edges: 2282976
Total number of edges: 2282976
Writing graph file "pbg_graph.txt" to "result_simba_edge_weights/pbg/graph4" ...
Finished.

[37]:

si.settings.pbg_params

[37]:

{'entity_path': 'result_simba_edge_weights/pbg/graph4/input/entity',
 'edge_paths': ['result_simba_edge_weights/pbg/graph4/input/edge'],
 'checkpoint_path': 'result_simba_edge_weights/pbg/graph3/model',
 'entities': {'C': {'num_partitions': 1}, 'G': {'num_partitions': 1}},
 'relations': [{'name': 'r0',
   'lhs': 'C',
   'rhs': 'G',
   'operator': 'none',
   'weight': 1.0}],
 'dynamic_relations': False,
 'dimension': 50,
 'global_emb': False,
 'comparator': 'dot',
 'num_epochs': 10,
 'workers': 4,
 'num_batch_negs': 50,
 'num_uniform_negs': 50,
 'loss_fn': 'softmax',
 'lr': 0.1,
 'early_stopping': False,
 'regularization_coef': 0.0,
 'wd': 0.069558,
 'wd_interval': 50,
 'eval_fraction': 0.05,
 'eval_num_batch_negs': 50,
 'eval_num_uniform_negs': 50,
 'checkpoint_preservation_interval': None}

[38]:

# modify parameters
dict_config = si.settings.pbg_params.copy()
dict_config['wd_interval'] = 10 # we usually set `wd_interval` to 10 for scRNA-seq datasets for a slower but finer training
dict_config['workers'] = 4 #The number of CPUs.

## start training
si.tl.pbg_train(pbg_params = dict_config, auto_wd=True, save_wd=True, output='model', use_edge_weights=True)

Auto-estimated weight decay is 0.015521
`.settings.pbg_params['wd']` has been updated to 0.015521
Converting input data ...
Edge weights are being used ...
[2022-10-10 13:53:33.429708] Using the 1 relation types given in the config
[2022-10-10 13:53:33.430294] Searching for the entities in the edge files...
[2022-10-10 13:53:36.640605] Entity type C:
[2022-10-10 13:53:36.641172] - Found 2700 entities
[2022-10-10 13:53:36.641698] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:53:36.642450] - Left with 2700 entities
[2022-10-10 13:53:36.643278] - Shuffling them...
[2022-10-10 13:53:36.645758] Entity type G:
[2022-10-10 13:53:36.646605] - Found 13714 entities
[2022-10-10 13:53:36.647252] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:53:36.651858] - Left with 13714 entities
[2022-10-10 13:53:36.652631] - Shuffling them...
[2022-10-10 13:53:36.665425] Preparing counts and dictionaries for entities and relation types:
[2022-10-10 13:53:36.666976] - Writing count of entity type C and partition 0
[2022-10-10 13:53:36.670722] - Writing count of entity type G and partition 0
[2022-10-10 13:53:36.679475] Preparing edge path result_simba_edge_weights/pbg/graph4/input/edge, out of the edges found in result_simba_edge_weights/pbg/graph4/pbg_graph.txt
using fast version
[2022-10-10 13:53:36.680357] Taking the fast train!
[2022-10-10 13:53:37.234925] - Processed 100000 edges so far...
[2022-10-10 13:53:37.716576] - Processed 200000 edges so far...
[2022-10-10 13:53:38.195279] - Processed 300000 edges so far...
[2022-10-10 13:53:38.651373] - Processed 400000 edges so far...
[2022-10-10 13:53:39.128663] - Processed 500000 edges so far...
[2022-10-10 13:53:39.627089] - Processed 600000 edges so far...
[2022-10-10 13:53:40.143111] - Processed 700000 edges so far...
[2022-10-10 13:53:40.673668] - Processed 800000 edges so far...
[2022-10-10 13:53:41.186034] - Processed 900000 edges so far...
[2022-10-10 13:53:41.690377] - Processed 1000000 edges so far...
[2022-10-10 13:53:42.202142] - Processed 1100000 edges so far...
[2022-10-10 13:53:42.688485] - Processed 1200000 edges so far...
[2022-10-10 13:53:43.175392] - Processed 1300000 edges so far...
[2022-10-10 13:53:43.648967] - Processed 1400000 edges so far...
[2022-10-10 13:53:44.185160] - Processed 1500000 edges so far...
[2022-10-10 13:53:44.670105] - Processed 1600000 edges so far...
[2022-10-10 13:53:45.185420] - Processed 1700000 edges so far...
[2022-10-10 13:53:45.696358] - Processed 1800000 edges so far...
[2022-10-10 13:53:46.206768] - Processed 1900000 edges so far...
[2022-10-10 13:53:46.689835] - Processed 2000000 edges so far...
[2022-10-10 13:53:47.217289] - Processed 2100000 edges so far...
[2022-10-10 13:53:47.732357] - Processed 2200000 edges so far...
[2022-10-10 13:53:51.257237] - Processed 2282976 edges in total
Starting training ...
Finished

[39]:

si.pl.pbg_metrics(fig_ncol=1)

[40]:

palette_celltype={'B':'#1f77b4',
                  'CD4 T':'#ff7f0e',
                  'CD8 T':'#279e68',
                  'Dendritic':"#aa40fc",
                  'CD14 Monocytes':'#d62728',
                  'FCGR3A Monocytes':'#b5bd61',
                  'Megakaryocytes':'#e377c2',
                  'NK':'#8c564b'}

dict_adata = si.read_embedding()

adata_C = dict_adata['C']  # embeddings for cells
adata_G = dict_adata['G']  # embeddings for genes

## Add annotation of celltypes (optional)
adata_C.obs['celltype'] = adata_CG[adata_C.obs_names,:].obs['celltype'].copy()
adata_C

[40]:

AnnData object with n_obs × n_vars = 2700 × 50
    obs: 'celltype'

[41]:

si.tl.umap(adata_C,n_neighbors=15,n_components=2)
si.pl.umap(adata_C,color=['celltype'],
           dict_palette={'celltype': palette_celltype},
           fig_size=(6,4),
           drawing_order='random')

[ ]:

Using edge weights (discretized gene expression) - only variable genes

[43]:

si.tl.gen_graph(list_CG=[adata_CG],
                copy=False,
                use_highly_variable=True,
                layer='simba',
                add_edge_weights=True,
                dirname='graph5')

relation0: source: C, destination: G
#edges: 509430
Total number of edges: 509430
Writing graph file "pbg_graph.txt" to "result_simba_edge_weights/pbg/graph5" ...
Finished.

[44]:

si.settings.pbg_params

[44]:

{'entity_path': 'result_simba_edge_weights/pbg/graph5/input/entity',
 'edge_paths': ['result_simba_edge_weights/pbg/graph5/input/edge'],
 'checkpoint_path': 'result_simba_edge_weights/pbg/graph4/model',
 'entities': {'C': {'num_partitions': 1}, 'G': {'num_partitions': 1}},
 'relations': [{'name': 'r0',
   'lhs': 'C',
   'rhs': 'G',
   'operator': 'none',
   'weight': 1.0}],
 'dynamic_relations': False,
 'dimension': 50,
 'global_emb': False,
 'comparator': 'dot',
 'num_epochs': 10,
 'workers': 4,
 'num_batch_negs': 50,
 'num_uniform_negs': 50,
 'loss_fn': 'softmax',
 'lr': 0.1,
 'early_stopping': False,
 'regularization_coef': 0.0,
 'wd': 0.015521,
 'wd_interval': 50,
 'eval_fraction': 0.05,
 'eval_num_batch_negs': 50,
 'eval_num_uniform_negs': 50,
 'checkpoint_preservation_interval': None}

[45]:

# modify parameters
dict_config = si.settings.pbg_params.copy()
dict_config['wd_interval'] = 10 # we usually set `wd_interval` to 10 for scRNA-seq datasets for a slower but finer training
dict_config['workers'] = 4 #The number of CPUs.

## start training
si.tl.pbg_train(pbg_params = dict_config, auto_wd=True, save_wd=True, output='model', use_edge_weights=True)

Auto-estimated weight decay is 0.069558
`.settings.pbg_params['wd']` has been updated to 0.069558
Converting input data ...
Edge weights are being used ...
[2022-10-10 13:57:37.849447] Using the 1 relation types given in the config
[2022-10-10 13:57:37.849988] Searching for the entities in the edge files...
[2022-10-10 13:57:38.531957] Entity type C:
[2022-10-10 13:57:38.532530] - Found 2700 entities
[2022-10-10 13:57:38.532959] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:57:38.533803] - Left with 2700 entities
[2022-10-10 13:57:38.534463] - Shuffling them...
[2022-10-10 13:57:38.536572] Entity type G:
[2022-10-10 13:57:38.537305] - Found 2000 entities
[2022-10-10 13:57:38.537740] - Removing the ones with fewer than 1 occurrences...
[2022-10-10 13:57:38.538544] - Left with 2000 entities
[2022-10-10 13:57:38.539200] - Shuffling them...
[2022-10-10 13:57:38.540654] Preparing counts and dictionaries for entities and relation types:
[2022-10-10 13:57:38.542192] - Writing count of entity type C and partition 0
[2022-10-10 13:57:38.545274] - Writing count of entity type G and partition 0
[2022-10-10 13:57:38.547926] Preparing edge path result_simba_edge_weights/pbg/graph5/input/edge, out of the edges found in result_simba_edge_weights/pbg/graph5/pbg_graph.txt
using fast version
[2022-10-10 13:57:38.548590] Taking the fast train!
[2022-10-10 13:57:39.004769] - Processed 100000 edges so far...
[2022-10-10 13:57:39.472928] - Processed 200000 edges so far...
[2022-10-10 13:57:39.980631] - Processed 300000 edges so far...
[2022-10-10 13:57:40.442172] - Processed 400000 edges so far...
[2022-10-10 13:57:40.914835] - Processed 500000 edges so far...
[2022-10-10 13:57:41.500177] - Processed 509430 edges in total
Starting training ...
Finished

[46]:

si.pl.pbg_metrics(fig_ncol=1)

[47]:

palette_celltype={'B':'#1f77b4',
                  'CD4 T':'#ff7f0e',
                  'CD8 T':'#279e68',
                  'Dendritic':"#aa40fc",
                  'CD14 Monocytes':'#d62728',
                  'FCGR3A Monocytes':'#b5bd61',
                  'Megakaryocytes':'#e377c2',
                  'NK':'#8c564b'}

dict_adata = si.read_embedding()

adata_C = dict_adata['C']  # embeddings for cells
adata_G = dict_adata['G']  # embeddings for genes

## Add annotation of celltypes (optional)
adata_C.obs['celltype'] = adata_CG[adata_C.obs_names,:].obs['celltype'].copy()
adata_C

[47]:

AnnData object with n_obs × n_vars = 2700 × 50
    obs: 'celltype'

[48]:

si.tl.umap(adata_C,n_neighbors=15,n_components=2)
si.pl.umap(adata_C,color=['celltype'],
           dict_palette={'celltype': palette_celltype},
           fig_size=(6,4),
           drawing_order='random')