About SIMBA

SIMBA ( SIngle-cell eMBedding Along with features) is a graph embedding method that jointly embeds single cells and their defining features, such as genes, chromatin accessible regions, and DNA sequences into a common latent space. SIMBA explicitly learns low-dimensional representations of cells and features, and implicitly enables the possibility of clustering-free marker discovery, batch effect removal and multi-omics integration. Importantly, SIMBA introduces several crucial procedures including Softmax transformation, weight decay for controlling overfitting, and entity-type constraints to generate comparable embeddings (co-embeddings) of cells and features and to address unique challenges in single-cell data.

SIMBA first encodes different types of entities such as cells, genes, open chromatin regions (peaks or bins), transcription factor (TF) motifs, and k-mers (short sequences of a specific length, k), into a single graph, where each node represents an individual entity and edges indicate relations between entities. Unlike existing methods that primarily focus on learning cell states, SIMBA treats both cells and features as equal nodes in the same graph.

In SIMBA, edges may be added in two ways: 1) measured experimentally; 2) inferred computationally. For edges that are measured experimentally, each cell-feature edge corresponds to a single-cell measurement (e.g., the expression value of a gene or a chromatin-accessible peak observed in a cell). For example, if a gene is expressed in a cell, an edge is created between the gene and cell. The weight of this edge is determined by the gene expression level. Similarly, an edge is added between a cell and a chromatin region if the region is open in this cell. Edges are also allowed between different features to capture and model the underlying regulatory mechanisms. For example, an edge between a chromatin region and a TF-motif (or k-mer) captures the notion that a TF may bind to a regulatory region containing a specific DNA sequence. For edges that cannot be directly measured, they are inferred computationally by summarizing features of the same or different types. Each edge between cells of different batches or modalities indicates the cellular functional or structural similarity.

Once the input graph is constructed, SIMBA applies a multi-entity graph embedding algorithm as well as a Softmax-based transformation to embed the nodes/entities into a common low-dimensional space wherein cells and features are comparable and can be analyzed based on their distance. Graph construction is inherently flexible, enabling SIMBA to be applied to a wide variety of single-cell tasks.

Overall, SIMBA is versatile and can accommodate features of various domains as long as they can be encoded into a connected graph. It can readily extend to new single-cell modalities and tasks. SIMBA provides a single generalizable framework that allows diverse single-cell problems to be formulated in a unified way and thus simplifies the development of new analyses and extension to new single-cell modalities.