MGcount user guide

Getting started

What is MGcount

MGcount is an RNA-seq quantification tool conceived to address ambiguous read alignments in a flexible way that is compatible with any biotype. This allows to extract more information from total RNA-seq datasets by simultaneous quantifying coding and non-coding transcripts, both small and long, where dealing with heterogeneous read-to-feature alignment ambiguities is key. When aligning reads to a reference genome, we distinguish between two types of ambiguous alignments:

  • Multi-mappers: reads aligning to multiple genomic locations.
  • Multi-overlaps: reads aligning to a genomic location with multiple annotated features.

To deal with the most frequent multi-overlaps, MGcount hierarchically assigns reads to small RNA, long RNA exons and long RNA introns, accounting for their length disparity. Subsequently, MGcount models read-to-feature alignments in a graph. This is exploited to detect and report expression of sequence-similar annotated loci, where reads systematically multi-map, as integrated features called communities.

Citation

Hita, A., Brocart, G., Fernandez, A., Rehmsmeier, M., Alemany, A. and Schvartzman, S., 2022. MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts. BMC Bioinformatics 23, 39 (2022). https://doi.org/10.1186/s12859-021-04544-3

System requirements

MGcount is built on the top of FeatureCounts, a well-known computational efficient software (Liao et. al., 2014). Please download it from the following link: http://subread.sourceforge.net/

Installation

MGcount is written in Python and is executed from the command line. You can either download the executable version (single binary file) or install it as a Python3 module.

Install and run as a Python3 module

You can install the package as a Python module:

pip3 install git+https://github.com/hitaandrea/MGcount.git

Once the package is installed, run the tool as a Python installed module:

python3 -m mgcount [args]

Download and run the executable program

Alternatively you can download the latest release as a single executable file here.

Save the program file to your Linux system and set the permissions to allow executing the file as program:

chmod +x MGcount

Once the file is executable, run the tool by calling the file from the command line with the desired arguments.

MGcount [args]

Workflow

MGcount workflow schema

During its execution, MGcount performs the following 3 steps:

1. Hierarchical assignation

To address the most frequent multi-overlapping ambiguous situations, reads are assigned to genomic annotated features in three pre-defined sequential rounds based on transcript body length: small RNA, long RNA exon and long RNA intron. MGcount prioritizes small RNA when a read aligns to a small RNA loci that is embedded within a long RNA. In the second round, alignments get assigned to long RNA exon features. In the final round, reads that haven’t been assigned to any small RNA or long RNA exon are assigned to introns if aligned within the full gene body length of a long RNA. Hence, all reads with at least one alignment overlapping with an annotated feature in the current round are assigned to such feature and skipped in subsequent rounds.

2. Multi-loci communities recognition

To quantify multi-mapping reads, MGcount builds a directed weighted graph {G=(V,E)} where each vertex (V) is an annotated feature and a pair of directional edges (E) connect two features for which common multi-mapping reads exist. Edge weights are defined as the ratio of multi-mapping reads between the two vertex normalized by the total number of reads assigned to the source vertex. Vertex weights are defined as the log-transformed number of assigned alignments (Fig c). Resultant graphs structures capture the multiple-loci topologies of different RNA biotypes. Two graphs are separately build for small-RNA and long-RNA features, using the full pool of input alignments. Subsequently, highly-related features are grouped together by minimizing the map equation with the communities detection approach described by Rosvall et al., 2008 .

3. Expression matrix generation

MGcount generates an output expression matrix for each hierarchical assignation round. These are appended together in a single output matrix. For each read, each alignment first gets an 1/N count, where N is the number of multi-mappers or residual multi-overlaps that survive the hierarchical assignment. Next, counts for annotated features which have been aggregated together in a community by the map equation are summed up. In this way, the systematic ambiguity in multi-mapping reads gets collapsed into a single MG community while the remaining signal is reported as fractional counts over distinct features.

Usage description

Inputs

Three inputs are required to run the program:

  • Input alignment file: a .txt file listing the paths to the .bam alignment input files by line.
  • Annotations file: a .gtf file containing a set of RNA feature annotations
  • Output directory path: a string specifying the path to the directory where MGcount outputs will be stored

  • ( Please, use full-paths if you experience any problem)

Configurable arguments

Configurable arguments list

Optional arguments to configure a MGcount run include:

Argument Description Default value
–paired_flag (-p) Paired end flag. If null, the assignation occurs in single-end mode. False
–strand_option (-s) Library strandness. Options available are 0: unstranded, 1: forward-stranded and 2:reverse-stranded 1
–featureCounts_path Path to featureCounts software executable file /usr/bin/featureCounts
–btyperounds_filename Optional .csv file with biotype to assignation round associations. It should be a two columns table where column names are, in order, “biotype” and “counting_round”.
–feature_small GTF feature type entry for smallRNA reads assignation transcript
–feature_output_small GTF field name for which to summarize counts of long RNA assigned reads transcript_name
–feature_biotype_small GTF field name defining biotype for small RNA features transcript_biotype
–ml_flag_small Multi-loci graph detection based groups flag for small RNA features 1
–min_overlap_small Minimal feature-alignment overlapping fraction for assigning a read to a small RNA feature 1
–feature_output_long GTF field name for which to summarize counts of long RNA assigned reads gene_name
–feature_biotype_long GTF field name defining biotype for long RNA features gene_biotype
–min_overlap_long Minimal feature-alignment overlapping fraction for assigning a read to a long RNA feature 1
–ml_flag_long Multi-loci graph detection based groups flag for long RNA features 1
–th_low Low minimal threshold of feature-to-feature multi-mapping fraction. 0.01
–th_high High minimal threshold of feature-to-feature multi-mapping fraction. 0.75
–subs Optional sub-sapling number of alignments to build the multi-mapping graph. If 0, include all. 0
–n_cores (-T) Number of cores for parallelization by sample 1
–sample_id SampleID input file names None
–seed Optional fixed seed for random numbers generation during communities detection

Configurable arguments details

Sequencing data type

Two arguments need to be set according to the input data type. For a correct interpretation of RNA-seq data during assignation, the integer argument –strand_option need to be set according to the strandness of the library preparation method utilized (0:unstranded, 1: forward-stranded, 2:reverse-stranded). If dealing with paired reads, –paired_flag should be added to the command line call.

Multi-core mode

MGcount may process samples in parallel in all three steps of the workflow (hierarchical assignation, multi-mapping graph generation and count matrix building). The number of CPUs to be used by MGcount can be defined with the –n_cores option.

Assignation rounds configuration

Round feature feature_output feature_biotype min_overlap ml_flag
small transcript transcript_name transcript_biotype 1 True
long_exon exon gene_name gene_biotype 1 True
long_intron gene gene_name gene_biotype 1 True

At each round of the hierarchical assignation, MGcount extracts the set of annotations in the .gtf with entry type –feature whose –feature_biotype attribute is included in the list of biotypes associated to the round (as defined by the .csv file –btyperounds_filename). Subsequently, alignments are assigned to the restricted set of annotations whenever a minimum read fraction (as defined by –min_overlap) overlaps with an annotated feature of the extracted annotations subset.

If featureCounts software is not accessible on the system path (/usr/bin/), the full path to the software should be set through –featureCounts_path.

The association of different biotypes to either “long” or “small” assignation rounds can be customized in a .csv file and parsed with –btypecrounds_filename argument to MGcount. The .csv file must be a two columns table with names biotype and assignation_round.

By default, MGcount utilizes a .csv file embedded with the program (or alternative installed with the Python module), that is located in the /mgcount/data sub-folder of the Github repository. This table links the set of biotypes encountered in the 4 integrated .gtf files provided (Arabidopsis, Human, Mouse and Nematode) to the corresponding pre-defined small and long rounds.

For running MGcount in further species or different annotations set, please make sure the biotypes you want to include in the quantification are correctly listed in this table for MGcount to recognize them.

At each round of the hierarchical assignation, alignment-feature assignation pairs are determined with FeatureCounts restricted to the designated subset of the .gtf annotated features. Each round can be configured by the user through the following five arguments:

Communities detection

To speed-up computation time, a fixed number of random sub-sampled alignments per sample can be set to build the graph through –subs.

The –seed argument may be set to guarantee exact solutions across runs with the same input arguments. The seed is used to initialize the generation of random numbers during the communities detection approach. MGcount ignores weak edges during the map equation optimization based on a high threshold –th_high (by default 0.75) and a low threshold –th_low (by default 0.01) to prevent for over-fitting (splitting of large densely connected communities and merging of small loosely connected communities). Thresholds are employed as follows according to the type of graph:

  • Long-RNA graph: All edges whose weight are below the high threshold are ignored for the long-RNA graph. This avoids collapsing together certain features sharing only partial similarity. Given the long body length, multi-mappers may occur in only a specific part of the locus. In these situations, the threshold determines how large should the shared reads proportion between two features to be considered for a community. Lower threshold values will tend to aggregate less related features in communities while high thresholds will force features to remain single by splitting the multi-mapping reads as a fraction.

  • Small-RNA graph: The use of the high threshold or the low threshold for weak edges filtering depends on the edge weights distribution in the small-RNA graph. Here, for each biotype (microRNA, piRNA, snRNA, …), the threshold that is closer to the graph weights’ first quantile is employed. In this way, for biotypes were repeated loci are identical or nearly identical (f.i. microRNA), only high weights above the high threshold may be considered for communities while for biotypes with large groups of similar loci (snRNA, YRNA and pseudogenes, ….), all weights may be considered.

Outputs

At the end of its execution, MGcount provides the following outputs:

  • Count matrix: A matrix where each row corresponds to a feature as defined by feature_output (either single features or MG communities aggregating several features) and each column corresponds to one input BAM file.
  • Features metadata: A table reporting: feature names matching row names in the count matrix, the counting round of hierarchical assignation, and its configuration parameters, a flag designing whether a feature belongs to an MG community, and the feature biotype.
  • Multi-mapping graph adjacency matrix: A sparse adjacency matrix for each multi-mapping graph generated (small RNA and/or long RNA), stored as a symmetric, integer, squared matrix. Each matrix element stores the number of alignments that multi-map to a pair of features (defined by row and column), and the diagonal contains the total number of alignments per feature.
  • Multi-mapping graph communities (MGcommunities): A table of MG communities linking each original feature in the GTF file with the resultant count matrix and metadata feature identifiers. It includes both unique features (which remain unmodified) and aggregated features (which are collapsed following MG communities). Also, the table stores the total number of alignments per feature.

Tutorials

T1 - Prepare inputs and execute

In the following tutorials, we use two sub-sampled human brain RNA-seq libraries as example to walk through MGcount execution. First of all, let’s create a folder and download the alignment .bam files of the two samples. (The downloading process might take a few minutes).

mkdir mgcount_tutorial
cd mgcount_tutorial
wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/tutorial_bamfiles.zip
unzip tutorial_bamfiles.zip -d input_bamfiles

To run MGcount, we need to provide the software with a .txt file specifying the paths to the input alignment files. Here, these are the two .bam alignment files we just downloaded. We can generate this file from the command line:

printf "$PWD/%s\n" input_bamfiles/* > input_bamfilenames.txt

The other required input is a .gtf file with transcript features annotations. MGcount repository provides with four ready .gtf files integrating annotations from several databases (see Hita et. al., 2022, Appendices, Methods, Database Integration). Next, we will download them and use the human annotations file for our execution example.

wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/integrated_annotations_gtf.zip
unzip integrated_annotations_gtf.zip -d annotations_gtf

Once we have both the .gtf and the input .bam files, we can simply run MGcount as an executable command-line program or as python3 module if installed via pip. A python module is run by calling python3 with the parameter -m and the name of the module, in this case “mgcount”. After this, we need to specify MGcount required arguments.

For this example, we parse the ready-integrated .gtf for the human genome, the .txt file we just created containing the path to the input alignment .bam files and a string designating the directory where MGcount outputs will be generated.

To reduce the computational time, we set the multicore parameter (-T) to “2” in order to parallelize the different steps of the algorithm by sample.

Run as an executable program:

MGcount -T 2 --gtf annotations_gtf/Homo_sapiens.GRCh38.gtf --outdir outputs --bam_infiles input_bamfilenames.txt

Run as a python3 module:

python3 -m mgcount -T 2 --gtf annotations_gtf/Homo_sapiens.GRCh38.gtf --outdir outputs --bam_infiles input_bamfilenames.txt

After MGcount run successfully finishes, your output directory should contain 6 new output files generate by MGcount including the RNA count matrix, the feature metadata table and two files containing the graph structure and the communities detected for long RNA and small RNA respectively.

Alternatively MGcount software might be invoked from an R console with the function “system”:

root_dir <- '~/mgcount_tutorial/'
system(paste0('MGcount -T 2',
              ' --gtf ',root_dir, 'annotations_gtf/Homo_sapiens.GRCh38.gtf',
              ' --outdir ' root_dir,'outputs ',
              ' --bam_infiles ',root_dir,'input_bamfilenames.txt'))

We are done!

T2 - Explore quantification outputs

In this tutorial, we will load the MGcount outputs we obtained in tutorial 1 and we will explore them from R. MGcount repository contains a few supporting scripts in R providing with side functionalities (manage annotations, visualize MGcount outputs…). It is possible to download the sole folder containing these scripts with the following shell command:

cd mgcount_tutorial
svn export https://github.com/hitaandrea/MGcount/trunk/R

Once the scripts are downloaded, we are ready to launch R. Please, start an R session and define the tutorial root directory as a variable.

root_dir <- '~/mgcount_tutorial/'

Further, this tutorial uses the following R packages. Please, install them with install.packages() if you wish to run this tutorial on your system.

library(dplyr)
library(Hmisc)
library(ggplot2)
library(ggpubr)
library(summarytools)
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
source(paste0(root_dir,'R/integrate_gtf_annotations.R'))

The main output of MGcount is the count_matrix.csv file containing the feature by sample expression matrix. We can import it into R as any regular .csv.

counts <- read.csv(paste0(root_dir,'outputs/count_matrix.csv'), row.names = 1)
colnames(counts) <- sub('_Aligned.genome.dedup','',colnames(counts))

Each row in the count table is a feature for which expression has been quantified and each column is associated to a sample. By interrogating for the matrix dimension, we see the matrix contains two columns corresponding to the two human brain libraries. The matrix can be used as input for any RNA-seq downstream analysis.

dim(counts)
## [1] 46029     2

We can look at the total number of reads assigned from each library by summing up each of the two rows and compute the mean over the two libraries.

colSums(counts)
## Human_Brain_total_100ng_1_subsample Human_Brain_total_100ng_2_subsample 
##                             3260426                             3889868
print(paste('Mean counts:',mean(colSums(counts))))
## [1] "Mean counts: 3575147.255"

Let’s import now the feature_metadata output. This table reports feature-related attributes such as the assignation round to which the annotation belongs, a flag stating whether the feature is an aggregated community of annotations or an individual feature and the biotype associated to the feature. The feature identifiers in the counts_matrix and the feature_metadata match and therefore, this table can facilitate the extraction of particular features from the count matrix, e,g, a certain biotype, the exonic counts, the subset of features aggregated in communities, etc…)

feat_metadata <- read.csv(paste0(root_dir,'outputs/feature_metadata.csv'))

Here, for example, we profit from the feature_metadata table to look at counts distribution by assignation round.

df <- cbind('counts' = rowSums(counts), feat_metadata)
ggplot(df, aes(x = assignation_round, y = counts)) + geom_violin(fill = 'grey') + 
   scale_y_continuous(trans = 'log', lim = c(10,100000), breaks = c(10,100,1000,10000,100000)) + theme_pubclean() +
  geom_point(position = position_jitter(seed = 1, width = 0.4), size = 0.005, alpha = 0.2)

Below, we display a few random rows from the table for illustration.

feat_subset <- with(feat_metadata, feat_metadata[c(
   sample(which(assignation_round == 'small' & community_flag == "True"), 2),
   sample(which(assignation_round == 'small' & community_flag == "False"), 2),
   sample(which(assignation_round == 'long_exon' & community_flag == "True"), 1),
   sample(which(assignation_round == 'long_exon' & community_flag == "False"), 1),
   sample(which(assignation_round == 'long_intron' & community_flag == "True"), 1),
   sample(which(assignation_round == 'long_intron' & community_flag == "False"), 1)),])
kable(feat_subset) %>% scroll_box(height = "300px")
feature assignation_round annotations_subset feature_type feature_output feature_biotype community_flag
1657 SNORA8_AC007448.1_Z77249.1 small small transcript transcript_name snoRNA True
1276 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc small small transcript transcript_name snRNA True
2130 hsa-miR-323b-3p small small transcript transcript_name miRNA False
1545 SCARNA10 small small transcript transcript_name snoRNA False
6308 ACTR2-exon long_exon long exon gene_name protein_coding True
10226 CBR1-exon long_exon long exon gene_name protein_coding False
33009 CNIH4-intron long_intron long gene gene_name protein_coding True
42590 SLC15A2-intron long_intron long gene gene_name protein_coding False

Next, we show how to generate a barplot showing the read distribution by biotype group. First lets load a table to group biotypes by category. We will use this to group less abundant biotypes in larger groups for visualization purposes.

bcats <- define_bcats()

feat_df <- merge(feat_metadata, bcats, by.x = 'feature_biotype', by.y = 'biotype', all.x = TRUE, all.y = FALSE) 

## Add exon/intron distinction to biogroup based on counting round
feat_df$biogroup <- as.character(feat_df$biogroup)
feat_df$biogroup[feat_df$biogroup == 'Protein_coding'] <- 'Protein_coding_exon'
feat_df$biogroup[feat_df$biogroup == 'Long_non_coding'] <- 'Long_non_coding_exon'  
feat_df$biogroup[feat_df$assignation_round == 'long_intron'] <-
  sub('exon','intron',feat_df$biogroup[feat_df$assignation_round == 'long_intron'])

feat_df$biogroup <- as.factor(feat_df$biogroup)
feat_df <- feat_df[match(rownames(counts), feat_df$feature),]

We then combine feature_metadata table with the count_matrix again and sum up the counts to get the total number of reads by biotype. We do this separately by biotype groups and small-non-coding individual biotypes to further represent the small non-coding spectrum.

## Generate biotype matrix
df <- data.frame(counts %>% group_by(feat_df$biogroup) %>% summarise_all(sum), 
                 check.names = FALSE); names(df)[1] <- 'biotype'
biotype <- reshape(df, idvar = 'biotype', varying = list(names(df)[-1]),
                      v.names = 'counts', times = names(df)[-1],
                      timevar = 'sn', direction = 'long')  
biotype$biotype <- as.character(biotype$biotype)
bgroups <- c("Hybrid","Long_pseudogenes","Protein_coding_intron","Long_non_coding_intron",
             "Protein_coding_exon","Long_non_coding_exon" ,"Short_non_coding","tRNA","rRNA")
biotype$biotype[biotype$biotype %nin% bgroups] <- 'Hybrid'
biotype$biotype <- factor(biotype$biotype, levels = bgroups)

## Generate non-coding biotype table
df <- data.frame(counts[feat_df$biocat == 'sNC',] %>%
                 group_by(feat_df$feature_biotype[feat_df$biocat == 'sNC']) %>% summarise_all(sum), 
                 check.names = FALSE); names(df)[1] <- 'biotype'
biotype_snc <- reshape(df, idvar = 'biotype', varying = list(names(df)[-1]),
               v.names = 'counts', times = names(df)[-1],
               timevar = 'sn', direction = 'long')  

Once we have extracted the counts by biotype group, let’s employ ggplot to visualize the read distribution profiles as barplots.

Reads distribution by biotype:

## ---- Abundance plot by biotype group
colP <- c('violetred1','slateblue1','darkgrey','lightgrey',
          'springgreen4','springgreen3','violetred4','tan3','tan4')
names(colP) <- bgroups
p1 <- ggplot(biotype, aes(x = sn, y = counts, group = biotype, fill = biotype)) +    
  geom_bar(stat = 'identity', colour = 'black', width = 0.8) + 
  coord_flip() + xlab('') + ylab('Number of assigned reads') + theme_pubclean() + 
  guides(fill=guide_legend(nrow=5)) + scale_fill_manual(values=colP) +  
  theme(legend.position = 'top', legend.title = element_blank())

## ---- Relative abundance plot by small non-coding biotype
colP <- c('#e6be97','#848CFF','#4c2382', '#50eb76','tomato','#FDFF87','#FFAE51','darkorchid1','gold1','sienna3')
p2 <- ggplot(biotype_snc, aes(x = sn, y = counts, group = biotype, fill = biotype)) +
  geom_bar(stat = 'identity', position = 'fill', colour = 'black', width = 0.8) + theme_pubclean() +
  coord_flip() + xlab('') + ylab('Proportion of small non-coding assigned reads') + 
  guides(fill=guide_legend(nrow=5)) +   scale_fill_manual(values = colP) +
  theme(axis.text.y = element_blank(), legend.position = 'top', legend.title = element_blank())

## Display plots together
ggarrange(p1, p2, ncol = 2, widths = c(2,1))

Finally, we import the multigraph_communities tables. These tables link each original feature in the .gtf with the resultant feature matching the count matrix and feature metadata identifiers. It includes both unique features (that remain unmodified) and aggregated features (that are collapsed following MG communities). Thus, we can track back the features grouped in each aggregated feature.

Here we will use the feature_metadata subset from before to explore the original features in the .gtf forming each new MG community feature.

csmall <- read.csv(paste0(root_dir,'outputs/multigraph_communities_small.csv'))
clong <- read.csv(paste0(root_dir,'outputs/multigraph_communities_long_exon.csv'))

kable(subset(csmall, feature %in% feat_subset$feature)) %>% scroll_box(height = "300px")
transcript_name transcript_biotype naln naln_community community_flag community_id community_name community_biotype feature
38 RNU4-28P snRNA 4 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
79 RNU4-27P snRNA 47 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
122 RNU4-88P snRNA 10 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
150 RNU4-59P snRNA 179 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
161 RNU4-75P snRNA 16 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
216 U4.4 snRNA 6 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
232 U4.7 snRNA 321 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
239 RNU4-42P snRNA 7 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
334 RNU4-21P snRNA 40 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
340 RNU4-77P snRNA 2 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
366 RNU4-73P snRNA 236 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
404 RNU4-63P snRNA 8 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
411 RNU4-49P snRNA 110 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
426 RNU4-51P snRNA 7 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
472 RNU4-8P snRNA 100 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
473 RNU4-84P snRNA 285 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
492 RNU4-48P snRNA 20 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
596 U4.5 snRNA 262 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
623 RNU4-85P snRNA 224 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
644 RNU4-56P snRNA 34 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
649 RNU4-78P snRNA 12 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
729 RNU4-62P snRNA 92 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
779 RNU4-38P snRNA 11 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
787 RNU4-4P snRNA 117 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
789 RNU4-91P snRNA 240 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
832 RNU4-89P snRNA 14 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
939 RNU4-33P snRNA 22 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
975 RNU4-79P snRNA 25 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
979 RNU4-87P snRNA 22 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
995 RNU4-64P snRNA 19 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1078 RNU4-11P snRNA 227 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1133 RNU4-14P snRNA 216 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1169 U4.3 snRNA 68 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1254 RNU4-66P snRNA 17 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1264 RNU4-12P snRNA 318 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1277 RNU4-70P snRNA 55 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1304 RNU4-35P snRNA 6 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1305 RNU4-76P snRNA 78 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1313 RNU4-18P snRNA 14 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1332 RNU4-7P snRNA 427 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1496 RNU4-74P snRNA 15 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1519 RNU4-31P snRNA 27 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1533 RNU4-6P snRNA 237 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1567 RNU4-52P snRNA 281 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1584 RNU4-81P snRNA 7 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1644 Z77249.1 snoRNA 15 944 True snoRNA-cl-129 SNORA8_AC007448.1_Z77249.1 snoRNA SNORA8_AC007448.1_Z77249.1
1647 RNU4-44P snRNA 227 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1685 RNU4-71P snRNA 7 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1723 RNU4-50P snRNA 4 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1766 RNU4-83P snRNA 3 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1785 U4.6 snRNA 69 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1790 RNU4-25P snRNA 186 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1814 RNU4-26P snRNA 4 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1833 RNU4-53P snRNA 195 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1869 RNU4-15P snRNA 5 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1907 RNU4-82P snRNA 1405 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2013 RNU4-39P snRNA 160 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2050 SNORA8 snoRNA 912 944 True snoRNA-cl-129 SNORA8_AC007448.1_Z77249.1 snoRNA SNORA8_AC007448.1_Z77249.1
2060 RNU4-55P snRNA 4 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2086 RNU4-23P snRNA 126 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2095 RNU4-86P snRNA 2 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2100 U4.2 snRNA 24 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2213 RNU4-5P snRNA 8 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2230 SCARNA10 snoRNA 625 625 False SCARNA10
2251 RNU4-67P snRNA 84 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2255 RNU4-54P snRNA 104 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2297 RNU4-20P snRNA 5 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2306 RNU4-65P snRNA 17 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2329 RNU4-24P snRNA 2 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2338 RNU4-41P snRNA 4 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2354 RNU4-32P snRNA 15 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2372 RNU4-2 snRNA 16486 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2373 RNU4-1 snRNA 3245 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2393 RNU4-9P snRNA 37 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2453 RNU4-10P snRNA 344 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2563 RNU4-92P snRNA 26 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2627 RNU4-68P snRNA 112 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2776 RNU4-80P snRNA 34 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2862 RNU4-46P snRNA 147 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2897 RNU4-58P snRNA 338 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2909 RNU4-30P snRNA 157 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2911 RNU4-36P snRNA 245 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
3068 RNU4-13P snRNA 424 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
3088 AC007448.1 snoRNA 17 944 True snoRNA-cl-129 SNORA8_AC007448.1_Z77249.1 snoRNA SNORA8_AC007448.1_Z77249.1
3176 RNU4-17P snRNA 71 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
3243 RNU4-40P snRNA 7 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
3332 RNU4-60P snRNA 25 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
3458 RNU4-45P snRNA 20 28863 True snRNA-cl-7 RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc snRNA RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
5420 hsa-miR-323b-3p miRNA 6 6 False hsa-miR-323b-3p
kable(subset(clong, feature %in% sub('-exon','',feat_subset$feature))) %>% scroll_box(height = "300px")
gene_name gene_biotype naln naln_community community_flag community_id community_name community_biotype feature
3877 ACTR2 protein_coding 708 748 True long-cl-3296 ACTR2 protein_coding ACTR2
33247 AP000357.1 processed_pseudogene 40 748 True long-cl-3296 ACTR2 protein_coding ACTR2
33974 CBR1 protein_coding 325 325 False CBR1
kable(subset(clong, feature %in% sub('-intron','',feat_subset$feature))) %>% scroll_box(height = "300px")
gene_name gene_biotype naln naln_community community_flag community_id community_name community_biotype feature
2919 CNIH4 protein_coding 98 99 True long-cl-2498 CNIH4 protein_coding CNIH4
6688 SLC15A2 protein_coding 39 39 False SLC15A2
12085 AL590002.1 processed_pseudogene 1 99 True long-cl-2498 CNIH4 protein_coding CNIH4

Also, we can exploit the multigraph_communities tables to extract stats from the MGcount run. Here we look at the proportion of aggregated features (community_flag variable) by biotype (feature_biotype). The output shows how small RNA biotypes tend to be more aggregated in communities because duplicated loci are more frequent.

kable(with(csmall, ctable(transcript_biotype, community_flag)))
False True Total
miRNA 484 69 553
misc_RNA 251 1288 1539
Mt_rRNA 2 0 2
Mt_tRNA 22 0 22
piRNA 822 706 1528
ribozyme 2 0 2
rRNA 6 30 36
rRNA_pseudogene 104 189 293
scaRNA 15 13 28
snoRNA 180 436 616
snRNA 260 700 960
tRNA 43 179 222
Total 2191 3610 5801
False True Total
miRNA 0.8752260 0.1247740 1
misc_RNA 0.1630929 0.8369071 1
Mt_rRNA 1.0000000 0.0000000 1
Mt_tRNA 1.0000000 0.0000000 1
piRNA 0.5379581 0.4620419 1
ribozyme 1.0000000 0.0000000 1
rRNA 0.1666667 0.8333333 1
rRNA_pseudogene 0.3549488 0.6450512 1
scaRNA 0.5357143 0.4642857 1
snoRNA 0.2922078 0.7077922 1
snRNA 0.2708333 0.7291667 1
tRNA 0.1936937 0.8063063 1
Total 0.3776935 0.6223065 1
kable(with(clong, ctable(gene_biotype, community_flag)))
False True Total
Hybrid 9 9 18
IG_C_gene 4 2 6
IG_C_pseudogene 0 1 1
IG_J_pseudogene 0 1 1
IG_V_gene 2 6 8
IG_V_pseudogene 4 4 8
lncRNA 6256 1509 7765
polymorphic_pseudogene 11 7 18
processed_pseudogene 2354 3939 6293
protein_coding 12945 4903 17848
TR_C_gene 5 0 5
TR_V_gene 8 6 14
TR_V_pseudogene 3 0 3
transcribed_processed_pseudogene 173 208 381
transcribed_unitary_pseudogene 75 18 93
transcribed_unprocessed_pseudogene 350 321 671
translated_processed_pseudogene 0 2 2
translated_unprocessed_pseudogene 0 1 1
unitary_pseudogene 21 8 29
unprocessed_pseudogene 438 560 998
Total 22658 11505 34163
False True Total
Hybrid 0.5000000 0.5000000 1
IG_C_gene 0.6666667 0.3333333 1
IG_C_pseudogene 0.0000000 1.0000000 1
IG_J_pseudogene 0.0000000 1.0000000 1
IG_V_gene 0.2500000 0.7500000 1
IG_V_pseudogene 0.5000000 0.5000000 1
lncRNA 0.8056665 0.1943335 1
polymorphic_pseudogene 0.6111111 0.3888889 1
processed_pseudogene 0.3740664 0.6259336 1
protein_coding 0.7252913 0.2747087 1
TR_C_gene 1.0000000 0.0000000 1
TR_V_gene 0.5714286 0.4285714 1
TR_V_pseudogene 1.0000000 0.0000000 1
transcribed_processed_pseudogene 0.4540682 0.5459318 1
transcribed_unitary_pseudogene 0.8064516 0.1935484 1
transcribed_unprocessed_pseudogene 0.5216095 0.4783905 1
translated_processed_pseudogene 0.0000000 1.0000000 1
translated_unprocessed_pseudogene 0.0000000 1.0000000 1
unitary_pseudogene 0.7241379 0.2758621 1
unprocessed_pseudogene 0.4388778 0.5611222 1
Total 0.6632322 0.3367678 1

We are done!

T3- Explore output multi-mapping graph

Below, we use a few functions defined in “mg_visualize.R” to graphically explore the multi-mapping graph topologies.

The function ‘mg_build’ takes the multi-mapping graph adjacency matrix and the MG communities and creates an igraph object that can be explored via igraph library. The next two functions (mg_plotset and mg_interactive) provide with the code to generate a few default plots given a Multi-Graph igraph object.

library(Matrix)
library(igraph)
library(plotly)
source(paste0(root_dir,'R/mg_visualize.R'))

As an example, we will load the small-RNA graph adjacency matrix and we will subset it to explore the sub-graph associated to microRNA features.

The adjacency matrix is stored as a sparse symmetric matrix in MatrixMarket format, which can be imported in R by the readMM function from ‘Matrix’ R package. We also import the table containing all annotated features and MG communities outputs.

dir.create(paste0(root_dir,'mgplots'))

## Import ml data and adjacency matrix for small assignation round
inputM <- readMM(paste0(root_dir,'outputs/multigraph_matrix_small.mtx'))
multiloci_table <- read.csv(paste0(root_dir,'outputs/multigraph_communities_small.csv'))

Next we subset the matrix by selecting the features under ‘miRNA category’. Each row in the communities table table corresponds to a feature annotation with non-zero assignments in the small round. The row index of each feature defines its column and row position in the multi-mapping graph adjacency matrix.

## Extract microRNA matrix subset
btype <- 'miRNA'
idx <- which(multiloci_table$transcript_biotype == btype)
inM <- inputM[idx,idx]; ml <- multiloci_table[idx,]

The necessary steps to convert the adjacency matrix and the communities table to an igraph object are in the mg_build function which outputs an igraph object.

## Generate microRNA graph object
attr <- 'transcript_name'
g = mg_build(inM, ml, attr)
g <- delete_vertices(g, V(g)[V(g)$weight < 1])

We can explore the patterns in the feature gene symbols established by the HGNC and link that to specific colors during visualization. Here we force annotations associated to a mature clipped 3-prime microRNA to be colored in orange and annotations associated to a mature clipped 5-prime microRNA to be colored in blue. For this, we look for “-3p” and “-5p” patterns in the transcript_name annotation symbol, which MGcount uses as default featue_output for small RNA.

## Extract microRNA mature extreme information from HUGO symol
V(g)$color_HUGO1 <- 'grey'
V(g)$color_HUGO1[grep('-3p',V(g)$feat)] <- 'orange'
V(g)$color_HUGO1[grep('-5p',V(g)$feat)] <- 'dodgerblue'

## Extract microRNA class from HUGO symbol
V(g)$color_HUGO2 <- 'grey'
idx <- grep('hsa-miR',as.character(V(g)$feat))
hugo <- as.factor(gsub('.*?([0-9]+).*', '\\1', V(g)$feat[idx]))
V(g)$color_HUGO2[idx] <- sample(grDevices::colors()[grep('gr(a|e)y', grDevices::colors(), invert = T)], 
                               length(unique(hugo)))[hugo]

The function mg_plotset generates a set of different plots of the multi-mapping graph and stores them as .png files with a user-given file path and prefix (plotfile argument).

We next use magick R package to load a few of the generated .png files into R. Each vertex is an annotated features with size proportional to its number of aligned reads. Each edge connects two features with shared multi-mappers with thickness proportional to the fraction of shared multi-mappers over the total alignments. Shared grey areas delineate MG communities. Vertices are colored according to the attribute “customColor”, which we just defined to be orange, blue and grey for “-3p”, “-5p” and absence of “-3p”/“-5p” patterns respectively. We may modify this as desired.

## Define plot colot
V(g)$customColor <- V(g)$color_HUGO1

## Standard plots
mg_plotset(g, plotfile= paste0(root_dir,'mgplots/mg_miRNA'))

Raw visualization of the graph

## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
img <- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA.png'))
plot(img)

Visualization of the graph colored by -3/-5 patterns in microRNA transcript symbol

## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
img <- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA_color.png'))
plot(img)

Visualization of the graph colored by -3/-5 patterns in microRNA transcript symbol and detected communities

## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
img <- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA_cl_color.png'))
plot(img)

Interactive visualization of the graph

To explore large graphs with interactive visualization tools such as zoom, we may use mg_interactive which creates a plotly object from the graph. Here the different communities are represented by colors.

mg_interactive(g, paste0(root_dir,'mgplots/mg_miRNA'))

We are done!

Integrating annotation sources

MGcount is a tool conceived to analyse heterogeneous datasets capturing diverse non-coding transcript types but the scope of features quantified by MGcount is bounded by the features annotated in the reference .gtf file used as input. On this line, although MGcount can be executed with any .gtf annotations file (Ensembl, Gencode, etc…), we provide with the option to integrate annotations from several databases to take into account a more complete or/and up-to-date annotations set in the quantification, specially for small regulatory RNAs (piRNA, tRF, microRNA, siRNA). These transcripts are not necessarily annotated in general databases.

In tutorial 1, we used a ready integrated .gtf for the human genome. Besides, the directory integrated_annotations_gtf we downloaded in tutorial 1, provides with integrated annotations .gtf files for Arabidopsis, Mouse and Nematode that can be used for running MGcount in the corresponding species. The following databases have been integrated for each specie:

  • Arabidopsis thaliana: Ensembl, miRBase (microRNA) and RNACentral (siRNA);
  • Homo Sapiens Ensembl, DASHR (piRNA and tRNA fragments [tRF]) and miRBase (microRNA);
  • Mus musculus: Ensembl, miRBase (microRNA) and RNAcentral (piRNA);
  • Caenorhabditis elegans: Ensembl and miRBase (microRNA).

For running MGcount in any other genome, we encourage to follow the same procedure we followed to generate the last 4 .gtf files. The script we used to generate them is provided in the R folder from the MGcount Github repository (that can be individually download with the command in tutorial 2). We hope the script can be used as a template for custom .gtf integration in other species.