SCE dataset
Single-cell explorer requires several files to be present for each dataset:
dataset.json
(required) contains description of the datastet.plot_data.json
(required) contains calculated annotations for every cell (like clustering and tSNE coordinates) as well as precalculated annotations (like cluster borders)exp_data.json
contains gene names and cell barcodes in the same order as they appear in expression matrix, as well as number of total UMIs in the cell.data.h5
is an expression (count) matrix. HDF5 allows to store counts effectively: since for the explorer we mostly need to look expression of a gene in the datasets, HDF5 can effectively compress columns of integers.markers.json
json file describing gene expression markers (optional)files
is a directory where you can put any addtional files of choice (optional)
dataset.json
This file is a main file of your dataset in file system.
SCE will go through provided folder and will try to find all folders that contain this file
and will consider every valid dataset.json
as a descriptor of a SCE dataset.
This file contains several fields:
token
- ID of your dataset in SCE. All datasets in SCE must have different IDs. (required)name
- Name of your dataset (if the dataset is displayed on the front page) (optional, default value istoken
)description
- Description of your dataset. (optional, defaults values is""
)link
- If you can provide a link to your dataset (GEO, SRA or publication). (optional, defaults values is""
)organism
- Abbreviation of the species (without genome version). Likehs
ormm
. (optional, default value is""
)cells
- Number of cells in the dataset. (optional, default is 0)public
- If dataset listed as public it will be shown on the main page. (optional, default isfalse
)curated
- If dataset listed as public and as curated it will be shown on the main page as curated. (optional, default isfalse
)
Example of valid dataset.json
:
{
"token": "HCA_hematopoiesis",
"name": "HCA: Profiling of CD34+ cells from human bone marrow to understand hematopoiesis",
"description": "Differentiation is among the most fundamental processes in cell biology. Single cell RNA-seq studies have demonstrated that differentiation is a continuous process and in particular cell states are observed to reside on largely continuous spaces. We have developed Palantir, a graph based algorithm to model continuities in cell state transitions and cell fate choices. Modeling differentiation as a Markov chain, Palantir determines probabilities of reaching terminal states from cells in each intermediate state. The entropy of these probabilities represent the differentiation potential of the cell in the corresponding state. Applied to single cell RNA-seq dataset of CD34+ hematopoietic cells from human bone marrows, Palantir accurately identified key events leading up to cell fate commitment. Integration with ATAC-seq data from bulk sorted populations helped identify key regulators that correlate with cell fate specification and commitment.",
"link": "https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45",
"organism": "hs",
"cells": 33829,
"public": true,
"curated": true
}
plot_data.json
Plot data contains of three parts: fields
, data
and annotations
.
fields
Fields describe which fields for each cell are present, what type they are (numeric or factor) and value range / factor levels.
Currently we only support numeric and factor variables. Example for a valid fields
structure might look like:
{
"tSNE_1": {
"type": "numeric",
"range": [
-68.8883,
66.9387
]
},
"tSNE_2": {
"type": "numeric",
"range": [
-54.9217,
70.4228
]
},
"UMAP_1": {
"type": "numeric",
"range": [
-12.9358,
13.655
]
},
"UMAP_2": {
"type": "numeric",
"range": [
-15.3023,
16.4768
]
},
"Cluster": {
"type": "factor",
"levels": [
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
"10",
"11",
"12",
"13",
"14",
"15",
"16",
"17",
"18",
"19"
]
},
"nUmi": {
"type": "numeric",
"range": [
669818,
1158590
]
},
"nGene": {
"type": "numeric",
"range": [
353,
11159
]
},
"nUmiLog2": {
"type": "numeric",
"range": [
19.3534,
20.1439
]
},
"nGeneLog2": {
"type": "numeric",
"range": [
8.4635,
13.4459
]
}
}
data
Data simply contains information about every cell in the dataset. Data may contain extra fields (that are not present infields
), however, these fields won't show up in the explorer.
Example of valid data
field is shown below:
[
{
"tSNE_1": 64.3251,
"tSNE_2": 11.3943,
"UMAP_1": 1.3749,
"UMAP_2": 16.074,
"Cluster": "6",
"nUmi": 890612,
"nGene": 6899,
"nUmiLog2": 19.7644,
"nGeneLog2": 12.7522,
"_row": "00ca0d37-b787-41a4-be59-2aff5b13b0bd"
},
{
"tSNE_1": -10.7318,
"tSNE_2": 8.2139,
"UMAP_1": 1.8595,
"UMAP_2": -2.9429,
"Cluster": "10",
"nUmi": 939514,
"nGene": 3142,
"nUmiLog2": 19.8416,
"nGeneLog2": 11.6175,
"_row": "0103aed0-29c2-4b29-a02a-2b58036fe875"
},
{
"tSNE_1": -41.7568,
"tSNE_2": -31.4512,
"UMAP_1": -6.6463,
"UMAP_2": 4.0747,
"Cluster": "0",
"nUmi": 918941,
"nGene": 3802,
"nUmiLog2": 19.8096,
"nGeneLog2": 11.8925,
"_row": "01a5dd09-db87-47ac-be78-506c690c4efc"
},
...
]
annotations
Annotations are usualy shown on top of the plot. For that you will need to tell SCE type
of annotation (text, polygon or arrows), which fields to use as coordinates (coords
) and coordinates of annotation. value
field is used to take the actual value from the data
Below is an example of valid annotations
{
"tsne_Cluster_centers": {
"type": "text",
"value": "Cluster",
"coords": [
"tSNE_1",
"tSNE_2"
],
"data": [
{
"Cluster": "0",
"tSNE_1": -35.5756,
"tSNE_2": -32.0876,
"Text": "0"
},
{
"Cluster": "1",
"tSNE_1": -0.7239,
"tSNE_2": -17.0211,
"Text": "1"
},
...
]
},
"tsne_Cluster_borders": {
"type": "polygon",
"value": "group",
"coords": [
"tSNE_1",
"tSNE_2"
],
"data": [
{
"tSNE_1": -1.3811,
"tSNE_2": -56.3444,
"Cluster": "13",
"group": "gr1_1"
},
{
"tSNE_1": -2.8141,
"tSNE_2": -55.6834,
"Cluster": "13",
"group": "gr1_1"
},
{
"tSNE_1": -4.2503,
"tSNE_2": -55.0224,
"Cluster": "13",
"group": "gr1_1"
},
...
]
}
}
Text will just appear on top of the plot. Polygon vertices are connected one by one in the order they are listed in the data
field.
arrows
type is somewhat special. It has data_start
and data_end
fields that specify arrow coordinates. arrows
were mostly designed to show RNA velocity annotation on top of the plot. Valid json object for arrows
type will look something like:
{
"type": "arrows",
"coords": [
"UMAP_1",
"UMAP_2"
],
"data_start": [
{
"UMAP_1": -8.6807,
"UMAP_2": -6.522
},
{
"UMAP_1": -8.6807,
"UMAP_2": -6.0223
},
...
],
"data_end": [
{
"UMAP_1": -8.4769,
"UMAP_2": -7.129
},
{
"UMAP_1": -8.426,
"UMAP_2": -6.8369
},
...
]
}
Whole file
Valid file plot_data.json
will look something like:
{
"fields": {
"tSNE_1": {
"type": "numeric",
"range": [
-68.8883,
66.9387
]
},
"tSNE_2": {
"type": "numeric",
"range": [
-54.9217,
70.4228
]
},
"UMAP_1": {
"type": "numeric",
"range": [
-12.9358,
13.655
]
},
"UMAP_2": {
"type": "numeric",
"range": [
-15.3023,
16.4768
]
},
"Cluster": {
"type": "factor",
"levels": [
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
"10",
"11",
"12",
"13",
"14",
"15",
"16",
"17",
"18",
"19"
]
},
"nUmi": {
"type": "numeric",
"range": [
669818,
1158590
]
},
"nGene": {
"type": "numeric",
"range": [
353,
11159
]
},
"nUmiLog2": {
"type": "numeric",
"range": [
19.3534,
20.1439
]
},
"nGeneLog2": {
"type": "numeric",
"range": [
8.4635,
13.4459
]
}
},
"data": [
{
"tSNE_1": 64.3251,
"tSNE_2": 11.3943,
"UMAP_1": 1.3749,
"UMAP_2": 16.074,
"Cluster": "6",
"nUmi": 890612,
"nGene": 6899,
"nUmiLog2": 19.7644,
"nGeneLog2": 12.7522,
"_row": "00ca0d37-b787-41a4-be59-2aff5b13b0bd"
},
{
"tSNE_1": -10.7318,
"tSNE_2": 8.2139,
"UMAP_1": 1.8595,
"UMAP_2": -2.9429,
"Cluster": "10",
"nUmi": 939514,
"nGene": 3142,
"nUmiLog2": 19.8416,
"nGeneLog2": 11.6175,
"_row": "0103aed0-29c2-4b29-a02a-2b58036fe875"
},
{
"tSNE_1": -41.7568,
"tSNE_2": -31.4512,
"UMAP_1": -6.6463,
"UMAP_2": 4.0747,
"Cluster": "0",
"nUmi": 918941,
"nGene": 3802,
"nUmiLog2": 19.8096,
"nGeneLog2": 11.8925,
"_row": "01a5dd09-db87-47ac-be78-506c690c4efc"
},
...
],
"annotations": {
"tsne_Cluster_centers": {
"type": "text",
"value": "Cluster",
"coords": [
"tSNE_1",
"tSNE_2"
],
"data": [
{
"Cluster": "0",
"tSNE_1": -35.5756,
"tSNE_2": -32.0876,
"Text": "0"
},
{
"Cluster": "1",
"tSNE_1": -0.7239,
"tSNE_2": -17.0211,
"Text": "1"
},
...
]
},
"tsne_Cluster_borders": {
"type": "polygon",
"value": "group",
"coords": [
"tSNE_1",
"tSNE_2"
],
"data": [
{
"tSNE_1": -1.3811,
"tSNE_2": -56.3444,
"Cluster": "13",
"group": "gr1_1"
},
{
"tSNE_1": -2.8141,
"tSNE_2": -55.6834,
"Cluster": "13",
"group": "gr1_1"
},
{
"tSNE_1": -4.2503,
"tSNE_2": -55.0224,
"Cluster": "13",
"group": "gr1_1"
},
...
]
}
}
}
exp_data.json
This file simply contains gene names, cell barcodes/names and total UMI per cell. This file must reflect row and column names of matrix data.h5
which contains expression data.
{
"genes": ["TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", ...],
"barcodes": ["00ca0d37-b787-41a4-be59-2aff5b13b0bd","0103aed0-29c2-4b29-a02a-2b58036fe875", ... ],
"totalCounts": [890612, 939514, ...]
}
When a user queries expression of gene CD14
in the dataset, we first find index of this gene in genes
array, and then ask server expression of a gene with this ID.
I.e. on a client side we would do something like
let geneId = expData.genes.indexOf("Cd14");
let geneExpression = // request expression of geneId from the server
It is super important that file exp_data.json
was consistent with data.h5
.
data.h5
This is an HDF5 file that stores gene expression values. We store counts data since this allows to compress integers efficiently even for large dataset. Also HDF5 files support chunks for compression. We set chunk-size to fit exactly expression of one gene, which allows to get expression values quickly.
Since we keep only counts on the server we require to provide totalCounts
in exp_data.json
file so we could perform normalizations of expression values on a client-side.
This HDF5 file contain only one dataset called expression/mat
which contains expression values (rows are genes and columns are samples).
Usually we create such files in R, below as a snippet on how we do it
## lets assume `counts` is a sparse count matrix of our scRNA-seq experiment
newH5File <- file.path("data.h5")
h5createFile(newH5File)
h5createGroup(newH5File, "expression")
## here we set chunk size and compression levels
h5createDataset(newH5File, "expression/mat", c(nrow(counts), ncol(counts)),
storage.mode = "integer", chunk=c(1, ncol(counts)),
level=9)
write(toJSON(list(
"genes"=rownames(counts),
"barcodes"=colnames(counts),
"totalCounts"=colSums(counts)
)), file.path("exp_data.json"))
# here we write table in batches of 500 genes:
# h5write doesn't support sparse matrices
# writing dataset without batches might take a lot of memory
# for large datasets
chunk_size = 500
start <- 1
totalGenes <- nrow(counts)
while (start <= totalGenes) {
end <- min(start + chunk_size - 1, totalGenes)
h5write(as.matrix(counts[start:end, 1:ncol(counts)]),
newH5File, "expression/mat",
index=list(start:end,1:ncol(counts)))
start <- start + chunk_size
}
h5closeAll()
markers.json
This is a list of data.frames of Seurat FindAllMatrix
converted to JSON. Sometimes we would like to show more than one markers table. For example, when we performed clustering with different resolutions and then identified markers for different clustering resolutions we would like to have both of the tables in the same place.
Below is an example of valid markers.json
file:
{
"Cluster_0.6": [
{
"X": "CEL",
"p_val": 0,
"avg_logFC": 1.4697,
"pct.1": 0.961,
"pct.2": 0.071,
"p_val_adj": 0,
"cluster": 0,
"gene": "CEL"
},
{
"X": "PRSS3P1",
"p_val": 7.5714e-304,
"avg_logFC": 2.1996,
"pct.1": 0.968,
"pct.2": 0.082,
"p_val_adj": 2.4281e-299,
"cluster": 0,
"gene": "PRSS3P1"
},
{
"X": "AQP8",
"p_val": 1.6988e-303,
"avg_logFC": 3.3761,
"pct.1": 0.744,
"pct.2": 0.026,
"p_val_adj": 5.4479e-299,
"cluster": 0,
"gene": "AQP8"
},
{
"X": "AZGP1",
"p_val": 1.4416e-291,
"avg_logFC": 2.8036,
"pct.1": 0.872,
"pct.2": 0.059,
"p_val_adj": 4.6231e-287,
"cluster": 0,
"gene": "AZGP1"
},
{
"X": "ANPEP",
"p_val": 4.769e-291,
"avg_logFC": 2.8413,
"pct.1": 0.94,
"pct.2": 0.082,
"p_val_adj": 1.5294e-286,
"cluster": 0,
"gene": "ANPEP"
},
...
],
"Cluster_0.8": [
{
"X": "RARRES2",
"p_val": 2.255e-278,
"avg_logFC": 2.6058,
"pct.1": 0.918,
"pct.2": 0.081,
"p_val_adj": 7.2317e-274,
"cluster": 0,
"gene": "RARRES2"
},
{
"X": "CPA2",
"p_val": 3.0854e-274,
"avg_logFC": 2.6125,
"pct.1": 1,
"pct.2": 0.117,
"p_val_adj": 9.8946e-270,
"cluster": 0,
"gene": "CPA2"
},
{
"X": "AMY2A",
"p_val": 5.5019e-272,
"avg_logFC": 4.3333,
"pct.1": 0.918,
"pct.2": 0.103,
"p_val_adj": 1.7644e-267,
"cluster": 0,
"gene": "AMY2A"
},
{
"X": "AMY1C",
"p_val": 3.521e-265,
"avg_logFC": 2.4203,
"pct.1": 0.566,
"pct.2": 0.008,
"p_val_adj": 1.1291e-260,
"cluster": 0,
"gene": "AMY1C"
},
{
"X": "AQP12A",
"p_val": 6.0698e-264,
"avg_logFC": 2.4958,
"pct.1": 0.683,
"pct.2": 0.027,
"p_val_adj": 1.9465e-259,
"cluster": 0,
"gene": "AQP12A"
},
{
"X": "PNLIPRP1",
"p_val": 2.9289e-262,
"avg_logFC": 2.466,
"pct.1": 0.979,
"pct.2": 0.116,
"p_val_adj": 9.3926e-258,
"cluster": 0,
"gene": "PNLIPRP1"
},
{
"X": "CTRL",
"p_val": 2.0913e-259,
"avg_logFC": 3.6326,
"pct.1": 0.932,
"pct.2": 0.122,
"p_val_adj": 6.7064e-255,
"cluster": 0,
"gene": "CTRL"
},
{
"X": "AMYP1",
"p_val": 1.8827e-258,
"avg_logFC": 3.0714,
"pct.1": 0.68,
"pct.2": 0.032,
"p_val_adj": 6.0375e-254,
"cluster": 0,
"gene": "AMYP1"
},
{
"X": "CUZD1",
"p_val": 2.3942e-257,
"avg_logFC": 0.4263,
"pct.1": 0.947,
"pct.2": 0.09,
"p_val_adj": 7.6779e-253,
"cluster": 0,
"gene": "CUZD1"
},
{
"X": "AQP12B",
"p_val": 5.9308e-255,
"avg_logFC": 2.3322,
"pct.1": 0.719,
"pct.2": 0.036,
"p_val_adj": 1.902e-250,
"cluster": 0,
"gene": "AQP12B"
},
{
"X": "REG1B",
"p_val": 3.0264e-253,
"avg_logFC": 3.5532,
"pct.1": 0.986,
"pct.2": 0.153,
"p_val_adj": 9.7053e-249,
"cluster": 0,
"gene": "REG1B"
},
...
]
}
Files
files
is just a folder in the folder with your dataset. Files in this folder will be available in browser in "Files" tab. That's it.