Introduction

In this demo code, we demonstrate how to use Portal to integrate SS2 and 10X scRNA-seq datasets from the Tabula Muris project. We use the integration of two bladder datasets as an example.

Download data

The datasets can be downloaded from https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733, and saved in folder "./Raw_data_TabulaMuris", including 10X scRNA-seq dataset (Single-cell RNA-seq data from microfluidic emulsion (v2)) and SS2 scRNA-seq dataset ( Single-cell RNA-seq data from Smart-seq2 sequencing of FACS sorted cells (v2)). The files need to be unzipped after downloading.

Preprocess

In [1]:
import os
import numpy as np
import pandas as pd
import scanpy as sc

sc.settings.verbosity = 3
sc.logging.print_header()

# Create a folder for saving preprocessed data
tissue = "Bladder"
data_path = "data/" + tissue
if not os.path.exists(data_path):
    os.makedirs(data_path)

path_SS2 = "./Raw_data_TabulaMuris/FACS"
path_10X = "./Raw_data_TabulaMuris/droplet"

# Read annotation
meta_SS2 = pd.read_csv("./Raw_data_TabulaMuris/annotations_facs.csv", 
                       keep_default_na=False)
meta_10X = pd.read_csv("./Raw_data_TabulaMuris/annotations_droplet.csv", 
                       keep_default_na=False)

# Read SS2 cell-by-gene counts
adata_SS2 = sc.read_csv(os.path.join(path_SS2, "%s-counts.csv" % tissue)).transpose()
ERCC_idx = pd.Series(adata_SS2.var.index).str.startswith('ERCC')
cell_idx = adata_SS2.obs.index.isin(meta_SS2[(meta_SS2.cell_ontology_class != 0) & 
                                             (meta_SS2.cell_ontology_class != "")].cell)
adata_SS2 = adata_SS2[cell_idx, -ERCC_idx]

# Read 10X cell-by-gene counts
channels = sorted(set(meta_10X[meta_10X.tissue == tissue].channel))
for i, channel in enumerate(channels):
    if i == 0:
        adata_10X = sc.read_10x_mtx(path_10X + '/%s-%s/' % (tissue, channel), 
                                    var_names='gene_symbols',cache=False)
        adata_10X.obs.index = channel + "_" + adata_10X.obs.index
        adata_10X.obs.index = adata_10X.obs.index.map(lambda x: x[:-2])
        cell_idx = adata_10X.obs.index.isin(meta_10X[(meta_10X.cell_ontology_class != 0) &
                                                     (meta_10X.cell_ontology_class != "")].cell)
        adata_10X = adata_10X[cell_idx, :]
    else:
        tmp = sc.read_10x_mtx(path_10X + '/%s-%s/' % (tissue, channel), 
                              var_names='gene_symbols',cache=False)
        tmp.obs.index = channel + "_" + tmp.obs.index
        tmp.obs.index = tmp.obs.index.map(lambda x: x[:-2])
        cell_idx = tmp.obs.index.isin(meta_10X[(meta_10X.cell_ontology_class != 0) &
                                               (meta_10X.cell_ontology_class != "")].cell)
        adata_10X = adata_10X.concatenate(tmp[cell_idx, :], index_unique=None)

celltype_SS2 = meta_SS2[meta_SS2.cell.isin(adata_SS2.obs.index)][["cell", "cell_ontology_class"]].set_index("cell")
celltype_SS2["method"] = "SS2"
celltype_10X = meta_10X[meta_10X.cell.isin(adata_10X.obs.index)][["cell", "cell_ontology_class"]].set_index("cell")
celltype_10X["method"] = "10X"
meta = pd.concat([celltype_SS2, celltype_10X]).rename(columns={"cell_ontology_class": "celltype"})
meta.to_pickle(os.path.join(data_path, "meta_raw.pkl"))

sc.pp.filter_cells(adata_SS2, min_genes=200)
sc.pp.filter_genes(adata_SS2, min_cells=3)
sc.pp.filter_cells(adata_10X, min_genes=200)
sc.pp.filter_genes(adata_10X, min_cells=3)

meta = pd.read_pickle(os.path.join(data_path, "meta_raw.pkl"))
meta = meta.loc[list(adata_SS2.obs.index) + list(adata_10X.obs.index), ]
meta.to_pickle(os.path.join(data_path, "meta.pkl"))
scanpy==1.7.2 anndata==0.7.6 umap==0.4.6 numpy==1.19.2 scipy==1.6.2 pandas==1.1.5 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.9.8 louvain==0.7.0
--> This might be very slow. Consider passing `cache=True`, which enables much faster reading from a cache file.
--> This might be very slow. Consider passing `cache=True`, which enables much faster reading from a cache file.
--> This might be very slow. Consider passing `cache=True`, which enables much faster reading from a cache file.
Trying to set attribute `.obs` of view, copying.
filtered out 6439 genes that are detected in less than 3 cells
filtered out 8048 genes that are detected in less than 3 cells

Integration using Portal (without tuning lambda_cos)

In [2]:
import portal

# Specify the GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Create a folder for saving results
result_path = "result"
if not os.path.exists(result_path):
    os.makedirs(result_path)
In [3]:
model = portal.model.Model(training_steps=1000)
model.preprocess(adata_SS2, adata_10X) # perform preprocess and PCA
model.train() # train the model
model.eval() # get integrated latent representation of cells
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
Finding highly variable genes...
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'variances_norm', float vector (adata.var)
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'variances_norm', float vector (adata.var)
normalizing counts per cell
    finished (0:00:00)
Normalizing and scaling...
/home/bnmath/.local/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py:845: UserWarning: Revieved a view of an AnnData. Making a copy.
  view_to_actual(adata)
normalizing counts per cell
    finished (0:00:00)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
Dimensionality reduction via PCA...
Begining time:  Tue Nov 16 16:09:56 2021
step 0, loss_D=15.624022, loss_GAN=3.339816, loss_AE=396.819763, loss_cos=37.025600, loss_LA=273.675903
step 200, loss_D=8.727321, loss_GAN=8.370888, loss_AE=10.740265, loss_cos=4.466184, loss_LA=2.677769
step 400, loss_D=4.365205, loss_GAN=3.952873, loss_AE=5.610289, loss_cos=2.967794, loss_LA=1.313320
step 600, loss_D=4.292342, loss_GAN=4.257951, loss_AE=4.406325, loss_cos=2.386177, loss_LA=1.003102
step 800, loss_D=4.170739, loss_GAN=4.583523, loss_AE=3.566488, loss_cos=1.938652, loss_LA=0.706948
Ending time:  Tue Nov 16 16:10:24 2021
Training takes 27.85 seconds
Begining time:  Tue Nov 16 16:10:24 2021
Ending time:  Tue Nov 16 16:10:24 2021
Evaluating takes 0.05 seconds
In [4]:
# The integration result can be visualized by portal.utils.plot_UMAP
portal.utils.plot_UMAP(model.latent, meta, colors=["method", "celltype"], save=True, result_path=result_path)
UMAP(angular_rp_forest=True, local_connectivity=1, metric='correlation',
     min_dist=0.3, n_neighbors=30, random_state=1234, repulsion_strength=1,
     verbose=True)
Construct fuzzy simplicial set
Tue Nov 16 16:10:31 2021 Finding Nearest Neighbors
Tue Nov 16 16:10:33 2021 Finished Nearest Neighbor Search
Tue Nov 16 16:10:34 2021 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
... storing 'celltype' as categorical
... storing 'method' as categorical
Tue Nov 16 16:10:47 2021 Finished embedding

Integration using Portal (with tuning lambda_cos)

To deal with different scenarios, our package enables tuning the parameter lambda_cos to achieve a better mixing performance:

In [5]:
lowdim_list = portal.utils.preprocess_datasets([adata_SS2, adata_10X], hvg_num=4000, data_path=data_path)
integrated_data = portal.utils.integrate_datasets(lowdim_list, 
                                                  search_cos=True, 
                                                  training_steps=1000,
                                                  data_path=data_path)

portal.utils.plot_UMAP(integrated_data, meta, colors=["method", "celltype"], save=True, result_path=result_path)
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
Finding highly variable genes...
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'variances_norm', float vector (adata.var)
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'variances_norm', float vector (adata.var)
normalizing counts per cell
    finished (0:00:00)
Normalizing and scaling...
/home/bnmath/.local/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py:845: UserWarning: Revieved a view of an AnnData. Making a copy.
  view_to_actual(adata)
normalizing counts per cell
    finished (0:00:00)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
Dimensionality reduction via PCA...
Incrementally integrating 2 datasets...
Integrating the 2-th dataset to the 1-st dataset...
Begining time:  Tue Nov 16 16:10:50 2021
step 0, loss_D=15.624022, loss_GAN=3.339849, loss_AE=396.819611, loss_cos=27.769196, loss_LA=273.675873
step 200, loss_D=8.801355, loss_GAN=6.034001, loss_AE=10.459867, loss_cos=3.955286, loss_LA=2.455486
step 400, loss_D=7.501076, loss_GAN=5.960483, loss_AE=5.556993, loss_cos=2.508996, loss_LA=1.272639
step 600, loss_D=7.781531, loss_GAN=6.539917, loss_AE=4.266128, loss_cos=1.916743, loss_LA=0.897105
step 800, loss_D=7.512965, loss_GAN=6.151743, loss_AE=3.608524, loss_cos=1.953688, loss_LA=0.809379
Ending time:  Tue Nov 16 16:11:12 2021
Training takes 22.39 seconds
Begining time:  Tue Nov 16 16:11:12 2021
Ending time:  Tue Nov 16 16:11:12 2021
Evaluating takes 0.05 seconds
lambda_cos: 15.000000, mixing metric: 16.550284 

Begining time:  Tue Nov 16 16:11:17 2021
step 0, loss_D=15.624022, loss_GAN=3.339849, loss_AE=396.819611, loss_cos=37.025597, loss_LA=273.675873
step 200, loss_D=9.259478, loss_GAN=6.824213, loss_AE=10.820528, loss_cos=4.772138, loss_LA=2.750310
step 400, loss_D=4.437115, loss_GAN=4.083629, loss_AE=5.634187, loss_cos=2.972851, loss_LA=1.316755
step 600, loss_D=4.237143, loss_GAN=4.266238, loss_AE=4.401846, loss_cos=2.519335, loss_LA=1.000605
step 800, loss_D=4.190681, loss_GAN=4.427482, loss_AE=3.582652, loss_cos=1.904705, loss_LA=0.771611
Ending time:  Tue Nov 16 16:11:40 2021
Training takes 22.45 seconds
Begining time:  Tue Nov 16 16:11:40 2021
Ending time:  Tue Nov 16 16:11:40 2021
Evaluating takes 0.05 seconds
lambda_cos: 20.000000, mixing metric: 14.286359 

Begining time:  Tue Nov 16 16:11:45 2021
step 0, loss_D=15.624022, loss_GAN=3.339849, loss_AE=396.819611, loss_cos=46.281994, loss_LA=273.675873
step 200, loss_D=7.222836, loss_GAN=7.154135, loss_AE=11.274676, loss_cos=6.659029, loss_LA=3.633636
step 400, loss_D=3.145437, loss_GAN=2.247764, loss_AE=5.928292, loss_cos=3.422266, loss_LA=1.647744
step 600, loss_D=2.285192, loss_GAN=2.448932, loss_AE=4.404011, loss_cos=2.628450, loss_LA=0.960853
step 800, loss_D=2.155079, loss_GAN=2.500972, loss_AE=3.638551, loss_cos=2.240901, loss_LA=0.732962
Ending time:  Tue Nov 16 16:12:08 2021
Training takes 22.45 seconds
Begining time:  Tue Nov 16 16:12:08 2021
Ending time:  Tue Nov 16 16:12:08 2021
Evaluating takes 0.05 seconds
lambda_cos: 25.000000, mixing metric: 17.289582 

Begining time:  Tue Nov 16 16:12:13 2021
step 0, loss_D=15.624022, loss_GAN=3.339849, loss_AE=396.819611, loss_cos=55.538391, loss_LA=273.675873
step 200, loss_D=7.109078, loss_GAN=5.707469, loss_AE=11.326015, loss_cos=6.665643, loss_LA=3.872253
step 400, loss_D=3.617905, loss_GAN=4.567375, loss_AE=5.627323, loss_cos=3.276583, loss_LA=1.388671
step 600, loss_D=3.041740, loss_GAN=2.737360, loss_AE=4.567602, loss_cos=3.182228, loss_LA=1.243089
step 800, loss_D=2.137332, loss_GAN=2.539555, loss_AE=3.665162, loss_cos=2.545967, loss_LA=0.770632
Ending time:  Tue Nov 16 16:12:37 2021
Training takes 24.22 seconds
Begining time:  Tue Nov 16 16:12:37 2021
Ending time:  Tue Nov 16 16:12:37 2021
Evaluating takes 0.05 seconds
lambda_cos: 30.000000, mixing metric: 14.001289 

Begining time:  Tue Nov 16 16:12:43 2021
step 0, loss_D=15.624022, loss_GAN=3.339849, loss_AE=396.819611, loss_cos=64.794792, loss_LA=273.675873
step 200, loss_D=8.041717, loss_GAN=5.589967, loss_AE=11.428196, loss_cos=7.264814, loss_LA=3.728649
step 400, loss_D=2.230929, loss_GAN=2.534016, loss_AE=5.687901, loss_cos=3.515642, loss_LA=1.420084
step 600, loss_D=2.209575, loss_GAN=2.661130, loss_AE=4.439455, loss_cos=3.163093, loss_LA=1.028828
step 800, loss_D=2.027751, loss_GAN=3.134364, loss_AE=3.736992, loss_cos=2.653662, loss_LA=0.881943
Ending time:  Tue Nov 16 16:13:05 2021
Training takes 22.50 seconds
Begining time:  Tue Nov 16 16:13:05 2021
Ending time:  Tue Nov 16 16:13:05 2021
Evaluating takes 0.05 seconds
lambda_cos: 35.000000, mixing metric: 14.312274 

Begining time:  Tue Nov 16 16:13:11 2021
step 0, loss_D=15.624022, loss_GAN=3.339849, loss_AE=396.819611, loss_cos=74.051193, loss_LA=273.675873
step 200, loss_D=5.145994, loss_GAN=5.475229, loss_AE=11.031952, loss_cos=7.149979, loss_LA=3.092530
step 400, loss_D=2.203688, loss_GAN=2.662040, loss_AE=5.656063, loss_cos=3.500462, loss_LA=1.390441
step 600, loss_D=1.943552, loss_GAN=2.811253, loss_AE=4.377134, loss_cos=3.111832, loss_LA=0.968454
step 800, loss_D=1.937862, loss_GAN=3.054844, loss_AE=3.728965, loss_cos=2.859192, loss_LA=0.761281
Ending time:  Tue Nov 16 16:13:33 2021
Training takes 22.45 seconds
Begining time:  Tue Nov 16 16:13:33 2021
Ending time:  Tue Nov 16 16:13:33 2021
Evaluating takes 0.06 seconds
lambda_cos: 40.000000, mixing metric: 14.620810 

Begining time:  Tue Nov 16 16:13:38 2021
step 0, loss_D=15.624022, loss_GAN=3.339849, loss_AE=396.819611, loss_cos=83.307587, loss_LA=273.675873
step 200, loss_D=6.044596, loss_GAN=6.885673, loss_AE=10.483496, loss_cos=6.144061, loss_LA=2.656277
step 400, loss_D=6.149721, loss_GAN=6.866948, loss_AE=5.733706, loss_cos=3.753596, loss_LA=1.516103
step 600, loss_D=5.795425, loss_GAN=6.995213, loss_AE=4.341867, loss_cos=2.713069, loss_LA=0.872775
step 800, loss_D=5.904902, loss_GAN=6.883029, loss_AE=3.698261, loss_cos=2.498075, loss_LA=0.799929
Ending time:  Tue Nov 16 16:14:01 2021
Training takes 22.75 seconds
Begining time:  Tue Nov 16 16:14:01 2021
Ending time:  Tue Nov 16 16:14:01 2021
Evaluating takes 0.04 seconds
lambda_cos: 45.000000, mixing metric: 19.173672 

Begining time:  Tue Nov 16 16:14:07 2021
step 0, loss_D=15.624022, loss_GAN=3.339849, loss_AE=396.819611, loss_cos=92.563988, loss_LA=273.675873
step 200, loss_D=6.041385, loss_GAN=6.827381, loss_AE=10.509405, loss_cos=6.129251, loss_LA=2.696722
step 400, loss_D=6.318241, loss_GAN=6.757929, loss_AE=5.762877, loss_cos=3.816188, loss_LA=1.561063
step 600, loss_D=5.813315, loss_GAN=7.001161, loss_AE=4.357763, loss_cos=2.958264, loss_LA=0.898465
step 800, loss_D=5.631570, loss_GAN=7.064974, loss_AE=3.649507, loss_cos=2.415012, loss_LA=0.722382
Ending time:  Tue Nov 16 16:14:29 2021
Training takes 22.53 seconds
Begining time:  Tue Nov 16 16:14:29 2021
Ending time:  Tue Nov 16 16:14:29 2021
Evaluating takes 0.05 seconds
lambda_cos: 50.000000, mixing metric: 17.809954 

UMAP(angular_rp_forest=True, local_connectivity=1, metric='correlation',
     min_dist=0.3, n_neighbors=30, random_state=1234, repulsion_strength=1,
     verbose=True)
Construct fuzzy simplicial set
Tue Nov 16 16:14:42 2021 Finding Nearest Neighbors
Tue Nov 16 16:14:42 2021 Finished Nearest Neighbor Search
Tue Nov 16 16:14:42 2021 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
... storing 'celltype' as categorical
... storing 'method' as categorical
Tue Nov 16 16:14:54 2021 Finished embedding