SCE dataset · SCE beta

Single-cell explorer requires several files to be present for each dataset:

dataset.json (required) contains description of the datastet.
plot_data.json (required) contains calculated annotations for every cell (like clustering and tSNE coordinates) as well as precalculated annotations (like cluster borders)
exp_data.json contains gene names and cell barcodes in the same order as they appear in expression matrix, as well as number of total UMIs in the cell.
data.h5 is an expression (count) matrix. HDF5 allows to store counts effectively: since for the explorer we mostly need to look expression of a gene in the datasets, HDF5 can effectively compress columns of integers.
markers.json json file describing gene expression markers (optional)
files is a directory where you can put any addtional files of choice (optional)

`dataset.json`

This file is a main file of your dataset in file system. SCE will go through provided folder and will try to find all folders that contain this file and will consider every valid dataset.json as a descriptor of a SCE dataset.

This file contains several fields:

token - ID of your dataset in SCE. All datasets in SCE must have different IDs. (required)
name - Name of your dataset (if the dataset is displayed on the front page) (optional, default value is token)
description - Description of your dataset. (optional, defaults values is "")
link - If you can provide a link to your dataset (GEO, SRA or publication). (optional, defaults values is "")
organism - Abbreviation of the species (without genome version). Like hs or mm. (optional, default value is "")
cells - Number of cells in the dataset. (optional, default is 0)
public - If dataset listed as public it will be shown on the main page. (optional, default is false)
curated - If dataset listed as public and as curated it will be shown on the main page as curated. (optional, default is false)

Example of valid dataset.json:

{
    "token": "HCA_hematopoiesis",
    "name": "HCA: Profiling of CD34+ cells from human bone marrow to understand hematopoiesis",
    "description": "Differentiation is among the most fundamental processes in cell biology. Single cell RNA-seq studies have demonstrated that differentiation is a continuous process and in particular cell states are observed to reside on largely continuous spaces. We have developed Palantir, a graph based algorithm to model continuities in cell state transitions and cell fate choices. Modeling differentiation as a Markov chain, Palantir determines probabilities of reaching terminal states from cells in each intermediate state. The entropy of these probabilities represent the differentiation potential of the cell in the corresponding state. Applied to single cell RNA-seq dataset of CD34+ hematopoietic cells from human bone marrows, Palantir accurately identified key events leading up to cell fate commitment. Integration with ATAC-seq data from bulk sorted populations helped identify key regulators that correlate with cell fate specification and commitment.",
    "link": "https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45",
    "organism": "hs",
    "cells": 33829,
    "public": true,
    "curated": true
}

`plot_data.json`

Plot data contains of three parts: fields, data and annotations.

`fields`

Fields describe which fields for each cell are present, what type they are (numeric or factor) and value range / factor levels. Currently we only support numeric and factor variables. Example for a valid fields structure might look like:

{
  "tSNE_1": {
    "type": "numeric",
    "range": [
      -68.8883,
      66.9387
    ]
  },
  "tSNE_2": {
    "type": "numeric",
    "range": [
      -54.9217,
      70.4228
    ]
  },
  "UMAP_1": {
    "type": "numeric",
    "range": [
      -12.9358,
      13.655
    ]
  },
  "UMAP_2": {
    "type": "numeric",
    "range": [
      -15.3023,
      16.4768
    ]
  },
  "Cluster": {
    "type": "factor",
    "levels": [
      "0",
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "7",
      "8",
      "9",
      "10",
      "11",
      "12",
      "13",
      "14",
      "15",
      "16",
      "17",
      "18",
      "19"
    ]
  },
  "nUmi": {
    "type": "numeric",
    "range": [
      669818,
      1158590
    ]
  },
  "nGene": {
    "type": "numeric",
    "range": [
      353,
      11159
    ]
  },
  "nUmiLog2": {
    "type": "numeric",
    "range": [
      19.3534,
      20.1439
    ]
  },
  "nGeneLog2": {
    "type": "numeric",
    "range": [
      8.4635,
      13.4459
    ]
  }
}

`data`

Data simply contains information about every cell in the dataset. Data may contain extra fields (that are not present infields), however, these fields won't show up in the explorer. Example of valid data field is shown below:

[
  {
    "tSNE_1": 64.3251,
    "tSNE_2": 11.3943,
    "UMAP_1": 1.3749,
    "UMAP_2": 16.074,
    "Cluster": "6",
    "nUmi": 890612,
    "nGene": 6899,
    "nUmiLog2": 19.7644,
    "nGeneLog2": 12.7522,
    "_row": "00ca0d37-b787-41a4-be59-2aff5b13b0bd"
  },
  {
    "tSNE_1": -10.7318,
    "tSNE_2": 8.2139,
    "UMAP_1": 1.8595,
    "UMAP_2": -2.9429,
    "Cluster": "10",
    "nUmi": 939514,
    "nGene": 3142,
    "nUmiLog2": 19.8416,
    "nGeneLog2": 11.6175,
    "_row": "0103aed0-29c2-4b29-a02a-2b58036fe875"
  },
  {
    "tSNE_1": -41.7568,
    "tSNE_2": -31.4512,
    "UMAP_1": -6.6463,
    "UMAP_2": 4.0747,
    "Cluster": "0",
    "nUmi": 918941,
    "nGene": 3802,
    "nUmiLog2": 19.8096,
    "nGeneLog2": 11.8925,
    "_row": "01a5dd09-db87-47ac-be78-506c690c4efc"
  },
  ...
]

`annotations`

Annotations are usualy shown on top of the plot. For that you will need to tell SCE type of annotation (text, polygon or arrows), which fields to use as coordinates (coords) and coordinates of annotation. value field is used to take the actual value from the data

Below is an example of valid annotations


{
  "tsne_Cluster_centers": {
    "type": "text",
    "value": "Cluster",
    "coords": [
      "tSNE_1",
      "tSNE_2"
    ],
    "data": [
      {
        "Cluster": "0",
        "tSNE_1": -35.5756,
        "tSNE_2": -32.0876,
        "Text": "0"
      },
      {
        "Cluster": "1",
        "tSNE_1": -0.7239,
        "tSNE_2": -17.0211,
        "Text": "1"
      },
      ...
    ]
  },
  "tsne_Cluster_borders": {
    "type": "polygon",
    "value": "group",
    "coords": [
      "tSNE_1",
      "tSNE_2"
    ],
    "data": [
      {
        "tSNE_1": -1.3811,
        "tSNE_2": -56.3444,
        "Cluster": "13",
        "group": "gr1_1"
      },
      {
        "tSNE_1": -2.8141,
        "tSNE_2": -55.6834,
        "Cluster": "13",
        "group": "gr1_1"
      },
      {
        "tSNE_1": -4.2503,
        "tSNE_2": -55.0224,
        "Cluster": "13",
        "group": "gr1_1"
      },
      ...
    ]
  }
}

Text will just appear on top of the plot. Polygon vertices are connected one by one in the order they are listed in the data field.

arrows type is somewhat special. It has data_start and data_end fields that specify arrow coordinates. arrows were mostly designed to show RNA velocity annotation on top of the plot. Valid json object for arrows type will look something like:


{
  "type": "arrows",
  "coords": [
    "UMAP_1",
    "UMAP_2"
  ],
  "data_start": [
    {
      "UMAP_1": -8.6807,
      "UMAP_2": -6.522
    },
    {
      "UMAP_1": -8.6807,
      "UMAP_2": -6.0223
    },
    ...
  ], 
  "data_end": [
    {
      "UMAP_1": -8.4769,
      "UMAP_2": -7.129
    },
    {
      "UMAP_1": -8.426,
      "UMAP_2": -6.8369
    },
    ...
  ]
}

Whole file

Valid file plot_data.json will look something like:


{
  "fields": {
      "tSNE_1": {
        "type": "numeric",
        "range": [
          -68.8883,
          66.9387
        ]
      },
      "tSNE_2": {
        "type": "numeric",
        "range": [
          -54.9217,
          70.4228
        ]
      },
      "UMAP_1": {
        "type": "numeric",
        "range": [
          -12.9358,
          13.655
        ]
      },
      "UMAP_2": {
        "type": "numeric",
        "range": [
          -15.3023,
          16.4768
        ]
      },
      "Cluster": {
        "type": "factor",
        "levels": [
          "0",
          "1",
          "2",
          "3",
          "4",
          "5",
          "6",
          "7",
          "8",
          "9",
          "10",
          "11",
          "12",
          "13",
          "14",
          "15",
          "16",
          "17",
          "18",
          "19"
        ]
      },
      "nUmi": {
        "type": "numeric",
        "range": [
          669818,
          1158590
        ]
      },
      "nGene": {
        "type": "numeric",
        "range": [
          353,
          11159
        ]
      },
      "nUmiLog2": {
        "type": "numeric",
        "range": [
          19.3534,
          20.1439
        ]
      },
      "nGeneLog2": {
        "type": "numeric",
        "range": [
          8.4635,
          13.4459
        ]
      }
    },
  "data": [
    {
      "tSNE_1": 64.3251,
      "tSNE_2": 11.3943,
      "UMAP_1": 1.3749,
      "UMAP_2": 16.074,
      "Cluster": "6",
      "nUmi": 890612,
      "nGene": 6899,
      "nUmiLog2": 19.7644,
      "nGeneLog2": 12.7522,
      "_row": "00ca0d37-b787-41a4-be59-2aff5b13b0bd"
    },
    {
      "tSNE_1": -10.7318,
      "tSNE_2": 8.2139,
      "UMAP_1": 1.8595,
      "UMAP_2": -2.9429,
      "Cluster": "10",
      "nUmi": 939514,
      "nGene": 3142,
      "nUmiLog2": 19.8416,
      "nGeneLog2": 11.6175,
      "_row": "0103aed0-29c2-4b29-a02a-2b58036fe875"
    },
    {
      "tSNE_1": -41.7568,
      "tSNE_2": -31.4512,
      "UMAP_1": -6.6463,
      "UMAP_2": 4.0747,
      "Cluster": "0",
      "nUmi": 918941,
      "nGene": 3802,
      "nUmiLog2": 19.8096,
      "nGeneLog2": 11.8925,
      "_row": "01a5dd09-db87-47ac-be78-506c690c4efc"
    },
    ...
  ],
  "annotations": {
   "tsne_Cluster_centers": {
     "type": "text",
     "value": "Cluster",
     "coords": [
       "tSNE_1",
       "tSNE_2"
     ],
     "data": [
       {
         "Cluster": "0",
         "tSNE_1": -35.5756,
         "tSNE_2": -32.0876,
         "Text": "0"
       },
       {
         "Cluster": "1",
         "tSNE_1": -0.7239,
         "tSNE_2": -17.0211,
         "Text": "1"
       },
       ...
     ]
   },
   "tsne_Cluster_borders": {
     "type": "polygon",
     "value": "group",
     "coords": [
       "tSNE_1",
       "tSNE_2"
     ],
     "data": [
       {
         "tSNE_1": -1.3811,
         "tSNE_2": -56.3444,
         "Cluster": "13",
         "group": "gr1_1"
       },
       {
         "tSNE_1": -2.8141,
         "tSNE_2": -55.6834,
         "Cluster": "13",
         "group": "gr1_1"
       },
       {
         "tSNE_1": -4.2503,
         "tSNE_2": -55.0224,
         "Cluster": "13",
         "group": "gr1_1"
       },
       ...
     ]
   }
 }
}

`exp_data.json`

This file simply contains gene names, cell barcodes/names and total UMI per cell. This file must reflect row and column names of matrix data.h5 which contains expression data.

{
  "genes": ["TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", ...],
  "barcodes": ["00ca0d37-b787-41a4-be59-2aff5b13b0bd","0103aed0-29c2-4b29-a02a-2b58036fe875", ... ],
  "totalCounts": [890612, 939514, ...]
}

When a user queries expression of gene CD14 in the dataset, we first find index of this gene in genes array, and then ask server expression of a gene with this ID.

I.e. on a client side we would do something like


let geneId = expData.genes.indexOf("Cd14");
let geneExpression = // request expression of geneId from the server

It is super important that file exp_data.json was consistent with data.h5.

`data.h5`

This is an HDF5 file that stores gene expression values. We store counts data since this allows to compress integers efficiently even for large dataset. Also HDF5 files support chunks for compression. We set chunk-size to fit exactly expression of one gene, which allows to get expression values quickly.

Since we keep only counts on the server we require to provide totalCounts in exp_data.json file so we could perform normalizations of expression values on a client-side.

This HDF5 file contain only one dataset called expression/mat which contains expression values (rows are genes and columns are samples).

Usually we create such files in R, below as a snippet on how we do it


## lets assume `counts` is a sparse count matrix of our scRNA-seq experiment
newH5File <- file.path("data.h5")

h5createFile(newH5File)
h5createGroup(newH5File, "expression")

## here we set chunk size and compression levels
h5createDataset(newH5File, "expression/mat", c(nrow(counts), ncol(counts)),
              storage.mode = "integer", chunk=c(1, ncol(counts)),
              level=9)

write(toJSON(list(
    "genes"=rownames(counts),
    "barcodes"=colnames(counts),
    "totalCounts"=colSums(counts)
)), file.path("exp_data.json"))

# here we write table in batches of 500 genes: 
# h5write doesn't support sparse matrices
# writing dataset without batches might take a lot of memory
# for large datasets
chunk_size = 500
start <- 1
totalGenes <- nrow(counts)
while (start <= totalGenes) {
    end <- min(start + chunk_size - 1, totalGenes)
    h5write(as.matrix(counts[start:end, 1:ncol(counts)]),
            newH5File, "expression/mat",
            index=list(start:end,1:ncol(counts)))
    start <- start + chunk_size
}

h5closeAll()

`markers.json`

This is a list of data.frames of Seurat FindAllMatrix converted to JSON. Sometimes we would like to show more than one markers table. For example, when we performed clustering with different resolutions and then identified markers for different clustering resolutions we would like to have both of the tables in the same place.

Below is an example of valid markers.json file:


{
  "Cluster_0.6": [
    {
      "X": "CEL",
      "p_val": 0,
      "avg_logFC": 1.4697,
      "pct.1": 0.961,
      "pct.2": 0.071,
      "p_val_adj": 0,
      "cluster": 0,
      "gene": "CEL"
    },
    {
      "X": "PRSS3P1",
      "p_val": 7.5714e-304,
      "avg_logFC": 2.1996,
      "pct.1": 0.968,
      "pct.2": 0.082,
      "p_val_adj": 2.4281e-299,
      "cluster": 0,
      "gene": "PRSS3P1"
    },
    {
      "X": "AQP8",
      "p_val": 1.6988e-303,
      "avg_logFC": 3.3761,
      "pct.1": 0.744,
      "pct.2": 0.026,
      "p_val_adj": 5.4479e-299,
      "cluster": 0,
      "gene": "AQP8"
    },
    {
      "X": "AZGP1",
      "p_val": 1.4416e-291,
      "avg_logFC": 2.8036,
      "pct.1": 0.872,
      "pct.2": 0.059,
      "p_val_adj": 4.6231e-287,
      "cluster": 0,
      "gene": "AZGP1"
    },
    {
      "X": "ANPEP",
      "p_val": 4.769e-291,
      "avg_logFC": 2.8413,
      "pct.1": 0.94,
      "pct.2": 0.082,
      "p_val_adj": 1.5294e-286,
      "cluster": 0,
      "gene": "ANPEP"
    },
    ...
  ],
  "Cluster_0.8": [
    {
     "X": "RARRES2",
     "p_val": 2.255e-278,
     "avg_logFC": 2.6058,
     "pct.1": 0.918,
     "pct.2": 0.081,
     "p_val_adj": 7.2317e-274,
     "cluster": 0,
     "gene": "RARRES2"
    },
    {
     "X": "CPA2",
     "p_val": 3.0854e-274,
     "avg_logFC": 2.6125,
     "pct.1": 1,
     "pct.2": 0.117,
     "p_val_adj": 9.8946e-270,
     "cluster": 0,
     "gene": "CPA2"
    },
    {
     "X": "AMY2A",
     "p_val": 5.5019e-272,
     "avg_logFC": 4.3333,
     "pct.1": 0.918,
     "pct.2": 0.103,
     "p_val_adj": 1.7644e-267,
     "cluster": 0,
     "gene": "AMY2A"
    },
    {
     "X": "AMY1C",
     "p_val": 3.521e-265,
     "avg_logFC": 2.4203,
     "pct.1": 0.566,
     "pct.2": 0.008,
     "p_val_adj": 1.1291e-260,
     "cluster": 0,
     "gene": "AMY1C"
    },
    {
     "X": "AQP12A",
     "p_val": 6.0698e-264,
     "avg_logFC": 2.4958,
     "pct.1": 0.683,
     "pct.2": 0.027,
     "p_val_adj": 1.9465e-259,
     "cluster": 0,
     "gene": "AQP12A"
    },
    {
     "X": "PNLIPRP1",
     "p_val": 2.9289e-262,
     "avg_logFC": 2.466,
     "pct.1": 0.979,
     "pct.2": 0.116,
     "p_val_adj": 9.3926e-258,
     "cluster": 0,
     "gene": "PNLIPRP1"
    },
    {
     "X": "CTRL",
     "p_val": 2.0913e-259,
     "avg_logFC": 3.6326,
     "pct.1": 0.932,
     "pct.2": 0.122,
     "p_val_adj": 6.7064e-255,
     "cluster": 0,
     "gene": "CTRL"
    },
    {
     "X": "AMYP1",
     "p_val": 1.8827e-258,
     "avg_logFC": 3.0714,
     "pct.1": 0.68,
     "pct.2": 0.032,
     "p_val_adj": 6.0375e-254,
     "cluster": 0,
     "gene": "AMYP1"
    },
    {
     "X": "CUZD1",
     "p_val": 2.3942e-257,
     "avg_logFC": 0.4263,
     "pct.1": 0.947,
     "pct.2": 0.09,
     "p_val_adj": 7.6779e-253,
     "cluster": 0,
     "gene": "CUZD1"
    },
    {
     "X": "AQP12B",
     "p_val": 5.9308e-255,
     "avg_logFC": 2.3322,
     "pct.1": 0.719,
     "pct.2": 0.036,
     "p_val_adj": 1.902e-250,
     "cluster": 0,
     "gene": "AQP12B"
    },
    {
     "X": "REG1B",
     "p_val": 3.0264e-253,
     "avg_logFC": 3.5532,
     "pct.1": 0.986,
     "pct.2": 0.153,
     "p_val_adj": 9.7053e-249,
     "cluster": 0,
     "gene": "REG1B"
    },
    ...
  ]
}

Files

files is just a folder in the folder with your dataset. Files in this folder will be available in browser in "Files" tab. That's it.