CeNGEN project: splicing

Documentation

This module allows direct visualization of the sequencing data for any genomic locus. For each sample, a density plot indicates the number of reads aligned at a particular genomic position (normalized by the total number of reads in that sample and multiplied by one million, unit: CPM). In addition, a splice junction track indicates the number of junction-spanning reads supporting that junction. For each neuron type, the average of all the samples is performed for each genomic position, to give a mean coverage for that neuron. In addition, the junction-spanning reads are summed for each junction, to give a total junction usage track for that neuron. Finally, to allow rapid examination of a genomic locus across neurons, additional "global" tracks are available:

"mean_cov" is the mean coverage (for each genomic position) across all neurons (as such, it is the mean of the means of sample CPM)
"min_cov" corresponds to the smallest value across all neuron means, for a given genomic position. If a given exon or gene is expressed in every neuron except one, this track will show the absence of expression; whereas the mean will show the average high expression in most neurons. Note that this track becomes of little use if a gene is not expressed at all in one neuron class: this single neuron class will force the min at 0 along the length of the gene.
"max_cov" corresponds to the highest value across all neuron means, for a given genomic position. If a given exon or gene is expressed in a single neuron type, this track will show that expression, while the mean will show the average absence of expression in all other neuron types. Conceptually, these two tracks are similar to maximum or minimum intensity projections (per pixel) as typically performed on microscopy images.
"sum_sj" is the total number of junction-spanning reads for each splice junction. This track only underwent a simple filtering (removing all junctions with fewer than 20 reads), thus may still contain a large amount of noise.
"min_sj", "max_sj" are the minimum and maximum across all neurons, for each splice junction. As with coverage, they facilitate interrogation of a given position across all neurons. Junctions with fewer than 20 reads were filtered out of the "max_sj" track, junctions with fewer than 1 read filtered out of the "min_sj" track.

Genes that are expressed in a given sample typically have 0.2-1.5 CPM along the gene body. Expressions higher than 1 CPM suggest highly expressed genes. A useful approach to examine a given genomic locus is to first visualize the "global" statistics, and look for variable regions. Then, examine these regions using the means per neuron. Finally, if a region of interest appears differentially used between neuron types, displaying the individual samples gives an idea of the robustness. Junctions and coverage often provide complementary information, it is recommended to look at both.

There are two annotation tracks available: a copy of the Wormbase annotation this analysis is based on ("canonical_geneset"), and a "novel" extended annotation. The color indicates the strand a gene is annotated on. The "novel" annotation contains potential novel transcripts annotated by StringTie, which will appear in grey. Note that, while the Wormbase-annotated transcripts have exons and CDS (where the CDS contain only coding sequence, no UTR), the novel transcripts only have exons annotated, as their actual coding sequence is unknown.

The "sj" tracks are displayed by default as arcs, whose thickness is proportional to the log-number of reads. Clicking on the "..." next to the track name, you can select a LinearBasicDisplay where each splice junction is represented by a rectangle, with the number of reads printed below.

Applications and limitations

This visualization makes no assumption on the genomic structure, and allows direct observation of constitutive and alternative splicing (whether differentially used or not), non-coding RNAs, or unannotated genes or exons. The coverage values are only normalized by sequencing depth, the junction values not normalized at all; this can be considered raw data.

Neither the coverage nor the junction values are normalized by gene; thus, when averaging between neurons with different expression levels, the highest expressing neuron can dominate the result. Further, the criteria of what represents "high" and "low" expression can differ between genes. Finally, this is not a proper quantification: the other tools on this website will usually give more accurate answers to splicing questions. Visualizing the read coverage is a good way to verify the predictions of the splicing quantification.

Technical considerations

The tracks were processed using R/Bioconductor, and loaded in JBrowse2. The individual tracks can be downloaded from here and loaded into a different genome browser (e.g. Wormbase's JBrowse, or UCSC).

StringTie was used in quantification mode (Pertea et al., 2016) to estimate the expression level of each transcript. This view is often easier to interpret (as transcripts directly reflect the biology we're interested in), but is typically less reliable than local quantifications. The transcriptome used was Wormbase WS289.

The quantification is accessible by two modules.

Single-gene mode

In single gene mode, you can input a single gene name and a combination of neurons. Use ALL for all neurons, individual neuron names (e.g. "AWA", "ASEL", or "OLL"), or keywords such as "ACh", "motor", "sensory", ... You can combine keywords and neuron names as needed. The list of genes can be longer than one, only the first one will be taken into consideration.

Three plots will be produced. The first one displays the average TPM value for that gene's transcripts, across all sequenced neurons. This is a convenient way to get a glimpse of the general usage of this gene in the nervous system. Then, for that gene, the proportion of its expression that can be attributed to each of its transcripts. This makes identifying an isoform switch easy. However, when a gene is lowly or not expressed, this visualization can be misleading. The third plot allows direct visualization of the transcript expression levels, and enables distinguishing between neurons where no transcript of that gene is expressed, and neurons where some transcripts are highly expressed.

Selecting the checkbox "Plot individual samples" will represent each sample rather than neuron-level aggregates. The choice of color scale is customizable:

viridis and npg are well-suited for genes that have a few transcripts.
d3_cat20 makes it possible to distinguish up to 20 transcripts.
iwanthue generates random colors, to effectively color any number of transcripts. It is probably not colorblind-safe.

Heatmap

The choice of gene and neuron is as described above, except that several genes can be input at once. In addition, there are several options for normalization:

None displays raw TPM values
Log2 displays the log2 (shifted by 1, so that when TPM=0, log2(TPM+1)=0)
Z-score subtracts the mean and divides by the standard deviation. If "Scale on" is set to Transcript, each transcript will be scaled that way, such that a high Z-score indicates that this transcript is higher in this neuron than in the average neuron, and a negative Z-score indicates the transcript is lower in this neuron than in the average neuron. When "Scale on" is set to Neuron, a high value indicates that the transcript is higher in this neuron than the average transcript (of the selected genes), and conversely.
Min-Max scales all values between 0 and 1, such that, if "Scale on" is set to Transcript there will be one neuron with an expression value of 0, and one enuron with an expression value of 1; if "Scale on" is set to Neuron, there will be one transcript at 0, and one transcript at 1. The other Neurons/Transcripts will have values scaled between 0 and 1.

The choice of color scale is customizable.

MetBrewer OKeefe2 and viridis magma are well-suited for positive values (dark at 0, lighter for higher values).
Red-White-Blue has red colors for negative values, white around 0, blue for positive values. It can be used as a white-blue scale for positive values, and is also well-suited to visualize negative Z-scores.

Finally, you have the possibility to download the underlying data as a table (tab-separated values, can be opened with Excel). The downloaded dataset will contain the mean TPM value for selected neurons and transcripts (unnormalized). The heatmap itself can be downloaded as an SVG (compatible with Inkscape, Adobe Illustrator, Affinity Designer, ...) for further editing. The SVG reflects the displayed heatmap, using the same neurons and transcripts, normalizations, and color scale. The downloaded file name has the date and time of download (time in UTC, may differ from your local time), the name of the selected genes (truncated to 20 characters), and the name of the selected neurons (truncated to 20 characters).

Applications and limitations

Because the data was produced with short reads, this quantification is less accurate than a local quantification.

The different visualizations available can make some aspect more clear or obscure them. It is crucial to combine several visualizations before drawing conclusions.

As the transcript level typically reflects the biology, this quantification can be easier to interpret. It can be a great way to explore one or several genes before using more detailed tools.

Technical considerations

The transcriptome reconstruction was performed with StringTie v2.2.1 using a random subsample of the reads used here, along with (unpublished) PacBio long reads. A custom script was used to select potential novel transcripts for existing genes, discarding transcripts spanning several genes or in intergenic regions. This new transcripts have identifiers of the form "STRG.1.1" and can be visualized in the genome browser on this website. WS281 was used for reference-guided reconstruction.

The quantification was performed with StringTie using the "eB" option. TPM values were extracted for each gene in each sample.

References

Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M Transcriptome assembly from long-read RNA-seq alignments with StringTie2, Genome Biology 20, 278 (2019), DOI:10.1186/s13059-019-1910-1

Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown, Nature Protocols 11, 1650-1667 (2016), DOI:10.1038/nprot.2016.095

Shumate A, Wong B, Pertea G, Pertea M, Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie, bioRxiv (2021) 2021.12.08.471868

Splicing was quantified using MAJIQ (see Vaquero-Garcia et al., 2016 for detailed explanations). Briefly, splice junctions (SJ) starting from the same exon (or splice junctions arriving into the same exon) are grouped in a Local Splicing Variation (LSV). For each LSV, the relative usage of each possible SJ is quantified, and a Percent Selected Index (PSI) is estimated with a Bayesian model. This way, a canonical exon skipping is represented as two LSVs, one LSV upstream, containing two SJ (one that links the upstream exon to the cassette exon, and one that skips the cassette to link to the downstream exon), and a second LSV with two SJ (one from the upstream exon into the downstream exon, the other from the cassette exon into the downstream exon).

The portal generates one page for each gene. On top, a representation of the splice graph displaying all exons and SJs. One can display the splice graph for individual neuron types or samples to see the number or reads spanning each SJ. Below, each LSV is represented schematically, and violin plots display the PSI of each SJ in each neuron type. In many cases, there were insufficient reads to quantify an SJ in a given neuron (e.g. when the gene is not expressed in that neuron), and the violin plot is left blank.

Applications and limitations

This visualization is most appropriate to establish differential splicing across neuron types.

This quantification only considers an LSV when there are differences in SJ. Thus, two isoforms distinguished only by the number of exons but with no branch in the splice graph will not appear differentially spliced.

Technical considerations

The MAJIQ quantifications and the VOILA server used for displaying are described in Vaquero-Garcia et al. (2016, 2020).

References

Jorge Vaquero-Garcia, Alejandro Barrera, Matthew R Gazzara, Juan González-Vallinas, Nicholas F Lahens, John B Hogenesch, Kristen W Lynch, Yoseph Barash, A new view of transcriptome complexity and regulation through the lens of local splicing variations, eLife 2016;5:e11752 DOI:10.7554/eLife.11752

Jorge Vaquero-Garcia, Joseph K. Aicher, Paul Jewell, Matthew R. Gazzara, Caleb M. Radens, Anupama Jha, Christopher J. Green, Scott S. Norton, Nicholas F. Lahens, Gregory R. Grant, Yoseph Barash, RNA splicing analysis using heterogeneous and large RNA-seq datasets, bioRxiv 2021.11.03.467086 DOI: 10.1101/2021.11.03.467086

News

2024-03-22: transcript-level, update data to WS289 (no more novel transcript discovery), new samples added (CEP, DVB, HSN, PVP, PVQ, RME, SIA).

2024-03-22: Genome browser, cosmetic updates to SJ display.

2023-12-12: Genome browser, WS289 annotation (previously WS281), new samples added (for neurons CEP, DVB, HSN, PVP, PVQ, RME, SIA).

2022-12-01: update JBrowse to 2.2.1, add strand color in gff track, change sj tracks default display to Arcs, use "novel" gff3 version 221130 with exons.

2022-10-20: Google Analytics.

2022-07-21: transcript-level: multiple improvements (faster, buttons, download heatmap, heatmap color scales, Wormbase browser link, average across neurons, ...).

2022-07-19: transcript-level quantification backend improved (starts faster).

2022-04-08: transcript-level quantification app updated with a new interactive heatmap. Documentation updated.

2022-03-23: Data processing update; any multimapping read was removed, new transcripts added by comparing with (unpublished) long reads data (see e.g. Y37E3.30).

2022-02-09: Adding minimal app for isoform quantification.

2022-01-31: button order changed.

2022-01-25: Updated help. Aesthetic changes in landing page.

2022-01-18: splice junctions available in the browser.

2021-12-06: VOILA server available. The individual samples are again available in the browser.

2021-11-30: removing two samples that failed QC. The individual samples are no longer available.

2021-11-23: "lower" and "higher" now correspong to the 3rd lowest/highest sample.

2021-11-18: "lower" and "higher" now correspond to the 3nd lowest and highest neurons.

2021-11-12: "lower" and "higher" now correspond to the 3rd lowest and 4th highest neurons.

2021-11-12: annotation updated to WS281 to match bw tracks.

2021-11-10: alignments on WS281 (vs 277 previously). In addition, min and max tracks replaced by median, 10th percentile, and 90th percentile.

CeNGEN splicing data portal

Documentation

Applications and limitations

Technical considerations

Single-gene mode

Heatmap

Applications and limitations

Technical considerations

References

Applications and limitations

Technical considerations

References

News