simba.tl.gen_graph

simba.tl.gen_graph(list_CP=None, list_PM=None, list_PK=None, list_CG=None, list_CC=None, list_adata=None, prefix_C='C', prefix_P='P', prefix_M='M', prefix_K='K', prefix_G='G', prefix='E', layer='simba', copy=False, dirname='graph0', add_edge_weights=None, use_highly_variable=True, use_top_pcs=True, use_top_pcs_CP=None, use_top_pcs_PM=None, use_top_pcs_PK=None)[source]

Generate graph for PBG training.

Observations and variables of each Anndata object will be encoded as nodes (entities). The non-zero values in .layers[‘simba’] (by default) or .X (if .layers[‘simba’] does not exist) indicate the edges between nodes. The values of .layers[‘simba’] or .X will be used as the edge weights if add_edge_weights True.

When list_adata is specified, nodes between different anndata objects in it will be automatically matched based on .obs_names and .var_names. It is a generalized parameter that encompasses data-specific parameters such as list_CG, list_CP, list_PK, etc. Each anndata object indicates one or more relation types.

It also generates an accompanying file ‘entity_alias.tsv’ to map the indices to the aliases used in the graph.

Note when add_edge_weights is True, list_CG will only generate one relation of cells and genes, as opposed to multiple relations based on discretized levels.

Parameters:

list_CP (list, optional (default: None)) – A list of anndata objects that store ATAC-seq data (Cells by Peaks) The default weight of cell-peak relation type is 1.0. Ignored when list_adata is specified.
list_PM (list, optional (default: None)) – A list of anndata objects that store relation between Peaks and Motifs. Ignored when list_adata is specified.
list_PK (list, optional (default: None)) – A list of anndata objects that store relation between Peaks and Kmers Ignored when list_adata is specified.
list_CG (list, optional (default: None)) – A list of anndata objects that store RNA-seq data (Cells by Genes). Ignored when list_adata is specified.
list_CC (list, optional (default: None)) – A list of anndata objects that store relation between Cells from two conditions Ignored when list_adata is specified.
list_adata (list, optional (default: None)) – A list of anndata objects. .obs_names and .var_names between anndata objects will be automatically matched. If list_adata is specified, the other lists including list_CP, list_PM,`list_PK`, list_CG, list_CC will be ignored.
prefix_C (str, optional (default: ‘C’)) – Prefix to indicate the entity type of cells Ignored when list_adata is specified.
prefix_G (str, optional (default: ‘G’)) – Prefix to indicate the entity type of genes Ignored when list_adata is specified.
prefix (str, optional (default: ‘E’)) – Prefix to indicate general entities in list_adata
layer (str, optional (default: ‘simba’)) – The layer in AnnData to use for constructing the graph. If layer is None or the specificed layer does not exist, .X in AnnData will be used instead.
dirname (str, (default: ‘graph0’)) – The name of the directory in which each graph will be stored
add_edge_weights (bool, optional (default: None)) – If True, the column of edge weigths will be added. If list_adata is specified, add_edge_weights is set True by default. Otherwise, it is set False.
use_highly_variable (bool, optional (default: True)) – Use highly variable genes. Only valid for list_CG. Ignored when list_adata is specified.
use_top_pcs (bool, optional (default: True)) – Use top-PCs-associated features for CP, PM, PK Only valid for list_PM,`list_PK`, list_CP. Ignored when list_adata is specified.
use_top_pcs_CP (bool, optional (default: None)) – Use top-PCs-associated features for CP Only valid for list_CP. Once specified, it will overwrite use_top_pcs Ignored when list_adata is specified.
use_top_pcs_PM (bool, optional (default: None)) – Use top-PCs-associated features for PM Only valid for list_PM. Once specified, it will overwrite use_top_pcs Ignored when list_adata is specified.
use_top_pcs_PK (bool, optional (default: None)) –

Use top-PCs-associated features for PK
Only valid for list_PK.

Once specified, it will overwrite use_top_pcs Ignored when `list_adata is specified.
copy (bool, optional (default: False)) – If True, it returns the graph file as a data frame

Returns:

If copy is True,
edges (pd.DataFrame) – The edges of the graph used for PBG training. Each line contains information about one edge. Using tabs as separators, each line contains the identifiers of the source entities, the relation types and the target entities.
updates .settings.pbg_params with the following parameters.
entity_path (str) – The path of the directory containing entity count files.
edge_paths (list) – A list of paths to directories containing (partitioned) edgelists. Typically a single path is provided.
entities (dict) – The entity types.
relations (list) – The relation types.
updates .settings.graph_stats with the following parameters.
dirname (dict) – Statistics of input graph