MGcount user guide
Getting started
What is MGcount
MGcount is an RNA-seq quantification tool conceived to address ambiguous read alignments in a flexible way that is compatible with any biotype. This allows to extract more information from total RNA-seq datasets by simultaneous quantifying coding and non-coding transcripts, both small and long, where dealing with heterogeneous read-to-feature alignment ambiguities is key. When aligning reads to a reference genome, we distinguish between two types of ambiguous alignments:
- Multi-mappers: reads aligning to multiple genomic locations.
- Multi-overlaps: reads aligning to a genomic location with multiple annotated features.
To deal with the most frequent multi-overlaps, MGcount hierarchically assigns reads to small RNA, long RNA exons and long RNA introns, accounting for their length disparity. Subsequently, MGcount models read-to-feature alignments in a graph. This is exploited to detect and report expression of sequence-similar annotated loci, where reads systematically multi-map, as integrated features called communities.
Citation
Hita, A., Brocart, G., Fernandez, A., Rehmsmeier, M., Alemany, A. and Schvartzman, S., 2022. MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts. BMC Bioinformatics 23, 39 (2022). https://doi.org/10.1186/s12859-021-04544-3
System requirements
MGcount is built on the top of FeatureCounts, a well-known computational efficient software (Liao et. al., 2014). Please download it from the following link: http://subread.sourceforge.net/
Installation
MGcount is written in Python and is executed from the command line. You can either download the executable version (single binary file) or install it as a Python3 module.
Install and run as a Python3 module
You can install the package as a Python module:
pip3 install git+https://github.com/hitaandrea/MGcount.git
Once the package is installed, run the tool as a Python installed module:
python3 -m mgcount [args]
Download and run the executable program
Alternatively you can download the latest release as a single executable file here.
Save the program file to your Linux system and set the permissions to allow executing the file as program:
chmod +x MGcount
Once the file is executable, run the tool by calling the file from the command line with the desired arguments.
MGcount [args]
Workflow
During its execution, MGcount performs the following 3 steps:
1. Hierarchical assignation
To address the most frequent multi-overlapping ambiguous situations, reads are assigned to genomic annotated features in three pre-defined sequential rounds based on transcript body length: small RNA, long RNA exon and long RNA intron. MGcount prioritizes small RNA when a read aligns to a small RNA loci that is embedded within a long RNA. In the second round, alignments get assigned to long RNA exon features. In the final round, reads that haven’t been assigned to any small RNA or long RNA exon are assigned to introns if aligned within the full gene body length of a long RNA. Hence, all reads with at least one alignment overlapping with an annotated feature in the current round are assigned to such feature and skipped in subsequent rounds.
2. Multi-loci communities recognition
To quantify multi-mapping reads, MGcount builds a directed weighted graph {G=(V,E)} where each vertex (V) is an annotated feature and a pair of directional edges (E) connect two features for which common multi-mapping reads exist. Edge weights are defined as the ratio of multi-mapping reads between the two vertex normalized by the total number of reads assigned to the source vertex. Vertex weights are defined as the log-transformed number of assigned alignments (Fig c). Resultant graphs structures capture the multiple-loci topologies of different RNA biotypes. Two graphs are separately build for small-RNA and long-RNA features, using the full pool of input alignments. Subsequently, highly-related features are grouped together by minimizing the map equation with the communities detection approach described by Rosvall et al., 2008 .
3. Expression matrix generation
MGcount generates an output expression matrix for each hierarchical assignation round. These are appended together in a single output matrix. For each read, each alignment first gets an 1/N count, where N is the number of multi-mappers or residual multi-overlaps that survive the hierarchical assignment. Next, counts for annotated features which have been aggregated together in a community by the map equation are summed up. In this way, the systematic ambiguity in multi-mapping reads gets collapsed into a single MG community while the remaining signal is reported as fractional counts over distinct features.
Usage description
Inputs
Three inputs are required to run the program:
- Input alignment file: a .txt file listing the paths to the .bam alignment input files by line.
- Annotations file: a .gtf file containing a set of RNA feature annotations
- Output directory path: a string specifying the path to the directory where MGcount outputs will be stored
(
Please, use full-paths if you experience any problem)
Configurable arguments
Configurable arguments list
Optional arguments to configure a MGcount run include:
Argument | Description | Default value |
---|---|---|
–paired_flag (-p) | Paired end flag. If null, the assignation occurs in single-end mode. | False |
–strand_option (-s) | Library strandness. Options available are 0: unstranded, 1: forward-stranded and 2:reverse-stranded | 1 |
–featureCounts_path | Path to featureCounts software executable file | /usr/bin/featureCounts |
–btyperounds_filename | Optional .csv file with biotype to assignation round associations. It should be a two columns table where column names are, in order, “biotype” and “counting_round”. | |
–feature_small | GTF feature type entry for smallRNA reads assignation | transcript |
–feature_output_small | GTF field name for which to summarize counts of long RNA assigned reads | transcript_name |
–feature_biotype_small | GTF field name defining biotype for small RNA features | transcript_biotype |
–ml_flag_small | Multi-loci graph detection based groups flag for small RNA features | 1 |
–min_overlap_small | Minimal feature-alignment overlapping fraction for assigning a read to a small RNA feature | 1 |
–feature_output_long | GTF field name for which to summarize counts of long RNA assigned reads | gene_name |
–feature_biotype_long | GTF field name defining biotype for long RNA features | gene_biotype |
–min_overlap_long | Minimal feature-alignment overlapping fraction for assigning a read to a long RNA feature | 1 |
–ml_flag_long | Multi-loci graph detection based groups flag for long RNA features | 1 |
–th_low | Low minimal threshold of feature-to-feature multi-mapping fraction. | 0.01 |
–th_high | High minimal threshold of feature-to-feature multi-mapping fraction. | 0.75 |
–subs | Optional sub-sapling number of alignments to build the multi-mapping graph. If 0, include all. | 0 |
–n_cores (-T) | Number of cores for parallelization by sample | 1 |
–sample_id | SampleID input file names | None |
–seed | Optional fixed seed for random numbers generation during communities detection |
Configurable arguments details
Sequencing data type
Two arguments need to be set according to the input data type. For a correct interpretation of RNA-seq data during assignation, the integer argument –strand_option need to be set according to the strandness of the library preparation method utilized (0:unstranded, 1: forward-stranded, 2:reverse-stranded). If dealing with paired reads, –paired_flag should be added to the command line call.
Multi-core mode
MGcount may process samples in parallel in all three steps of the workflow (hierarchical assignation, multi-mapping graph generation and count matrix building). The number of CPUs to be used by MGcount can be defined with the –n_cores option.
Assignation rounds configuration
Round | feature | feature_output | feature_biotype | min_overlap | ml_flag |
---|---|---|---|---|---|
small | transcript | transcript_name | transcript_biotype | 1 | True |
long_exon | exon | gene_name | gene_biotype | 1 | True |
long_intron | gene | gene_name | gene_biotype | 1 | True |
At each round of the hierarchical assignation, MGcount extracts the set of annotations in the .gtf with entry type –feature whose –feature_biotype attribute is included in the list of biotypes associated to the round (as defined by the .csv file –btyperounds_filename). Subsequently, alignments are assigned to the restricted set of annotations whenever a minimum read fraction (as defined by –min_overlap) overlaps with an annotated feature of the extracted annotations subset.
If featureCounts software is not accessible on the system path (/usr/bin/), the full path to the software should be set through –featureCounts_path.
The association of different biotypes to either “long” or “small” assignation rounds can be customized in a .csv file and parsed with –btypecrounds_filename argument to MGcount. The .csv file must be a two columns table with names biotype and assignation_round.
By default, MGcount utilizes a .csv file embedded with the program (or alternative installed with the Python module), that is located in the /mgcount/data sub-folder of the Github repository. This table links the set of biotypes encountered in the 4 integrated .gtf files provided (Arabidopsis, Human, Mouse and Nematode) to the corresponding pre-defined small and long rounds.
For running MGcount in further species or different annotations set, please make sure the biotypes you want to include in the quantification are correctly listed in this table for MGcount to recognize them.
At each round of the hierarchical assignation, alignment-feature assignation pairs are determined with FeatureCounts restricted to the designated subset of the .gtf annotated features. Each round can be configured by the user through the following five arguments:
Communities detection
To speed-up computation time, a fixed number of random sub-sampled alignments per sample can be set to build the graph through –subs.
The –seed argument may be set to guarantee exact solutions across runs with the same input arguments. The seed is used to initialize the generation of random numbers during the communities detection approach. MGcount ignores weak edges during the map equation optimization based on a high threshold –th_high (by default 0.75) and a low threshold –th_low (by default 0.01) to prevent for over-fitting (splitting of large densely connected communities and merging of small loosely connected communities). Thresholds are employed as follows according to the type of graph:
Long-RNA graph: All edges whose weight are below the high threshold are ignored for the long-RNA graph. This avoids collapsing together certain features sharing only partial similarity. Given the long body length, multi-mappers may occur in only a specific part of the locus. In these situations, the threshold determines how large should the shared reads proportion between two features to be considered for a community. Lower threshold values will tend to aggregate less related features in communities while high thresholds will force features to remain single by splitting the multi-mapping reads as a fraction.
Small-RNA graph: The use of the high threshold or the low threshold for weak edges filtering depends on the edge weights distribution in the small-RNA graph. Here, for each biotype (microRNA, piRNA, snRNA, …), the threshold that is closer to the graph weights’ first quantile is employed. In this way, for biotypes were repeated loci are identical or nearly identical (f.i. microRNA), only high weights above the high threshold may be considered for communities while for biotypes with large groups of similar loci (snRNA, YRNA and pseudogenes, ….), all weights may be considered.
Outputs
At the end of its execution, MGcount provides the following outputs:
- Count matrix: A matrix where each row corresponds to a feature as defined by feature_output (either single features or MG communities aggregating several features) and each column corresponds to one input BAM file.
- Features metadata: A table reporting: feature names matching row names in the count matrix, the counting round of hierarchical assignation, and its configuration parameters, a flag designing whether a feature belongs to an MG community, and the feature biotype.
- Multi-mapping graph adjacency matrix: A sparse adjacency matrix for each multi-mapping graph generated (small RNA and/or long RNA), stored as a symmetric, integer, squared matrix. Each matrix element stores the number of alignments that multi-map to a pair of features (defined by row and column), and the diagonal contains the total number of alignments per feature.
- Multi-mapping graph communities (MGcommunities): A table of MG communities linking each original feature in the GTF file with the resultant count matrix and metadata feature identifiers. It includes both unique features (which remain unmodified) and aggregated features (which are collapsed following MG communities). Also, the table stores the total number of alignments per feature.
Tutorials
T1 - Prepare inputs and execute
In the following tutorials, we use two sub-sampled human brain RNA-seq libraries as example to walk through MGcount execution. First of all, let’s create a folder and download the alignment .bam files of the two samples. (The downloading process might take a few minutes).
mkdir mgcount_tutorial
cd mgcount_tutorial
wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/tutorial_bamfiles.zip
unzip tutorial_bamfiles.zip -d input_bamfiles
To run MGcount, we need to provide the software with a .txt file specifying the paths to the input alignment files. Here, these are the two .bam alignment files we just downloaded. We can generate this file from the command line:
printf "$PWD/%s\n" input_bamfiles/* > input_bamfilenames.txt
The other required input is a .gtf file with transcript features annotations. MGcount repository provides with four ready .gtf files integrating annotations from several databases (see Hita et. al., 2022, Appendices, Methods, Database Integration). Next, we will download them and use the human annotations file for our execution example.
wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/integrated_annotations_gtf.zip
unzip integrated_annotations_gtf.zip -d annotations_gtf
Once we have both the .gtf and the input .bam files, we can simply run MGcount as an executable command-line program or as python3 module if installed via pip. A python module is run by calling python3 with the parameter -m and the name of the module, in this case “mgcount”. After this, we need to specify MGcount required arguments.
For this example, we parse the ready-integrated .gtf for the human genome, the .txt file we just created containing the path to the input alignment .bam files and a string designating the directory where MGcount outputs will be generated.
To reduce the computational time, we set the multicore parameter (-T) to “2” in order to parallelize the different steps of the algorithm by sample.
Run as an executable program:
MGcount -T 2 --gtf annotations_gtf/Homo_sapiens.GRCh38.gtf --outdir outputs --bam_infiles input_bamfilenames.txt
Run as a python3 module:
python3 -m mgcount -T 2 --gtf annotations_gtf/Homo_sapiens.GRCh38.gtf --outdir outputs --bam_infiles input_bamfilenames.txt
After MGcount run successfully finishes, your output directory should contain 6 new output files generate by MGcount including the RNA count matrix, the feature metadata table and two files containing the graph structure and the communities detected for long RNA and small RNA respectively.
Alternatively MGcount software might be invoked from an R console with the function “system”:
<- '~/mgcount_tutorial/'
root_dir system(paste0('MGcount -T 2',
' --gtf ',root_dir, 'annotations_gtf/Homo_sapiens.GRCh38.gtf',
' --outdir ' root_dir,'outputs ',
' --bam_infiles ',root_dir,'input_bamfilenames.txt'))
We are done!
T2 - Explore quantification outputs
In this tutorial, we will load the MGcount outputs we obtained in tutorial 1 and we will explore them from R. MGcount repository contains a few supporting scripts in R providing with side functionalities (manage annotations, visualize MGcount outputs…). It is possible to download the sole folder containing these scripts with the following shell command:
cd mgcount_tutorial
svn export https://github.com/hitaandrea/MGcount/trunk/R
Once the scripts are downloaded, we are ready to launch R. Please, start an R session and define the tutorial root directory as a variable.
<- '~/mgcount_tutorial/' root_dir
Further, this tutorial uses the following R packages. Please, install them with install.packages() if you wish to run this tutorial on your system.
library(dplyr)
library(Hmisc)
library(ggplot2)
library(ggpubr)
library(summarytools)
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
source(paste0(root_dir,'R/integrate_gtf_annotations.R'))
The main output of MGcount is the count_matrix.csv file containing the feature by sample expression matrix. We can import it into R as any regular .csv.
<- read.csv(paste0(root_dir,'outputs/count_matrix.csv'), row.names = 1)
counts colnames(counts) <- sub('_Aligned.genome.dedup','',colnames(counts))
Each row in the count table is a feature for which expression has been quantified and each column is associated to a sample. By interrogating for the matrix dimension, we see the matrix contains two columns corresponding to the two human brain libraries. The matrix can be used as input for any RNA-seq downstream analysis.
dim(counts)
## [1] 46029 2
We can look at the total number of reads assigned from each library by summing up each of the two rows and compute the mean over the two libraries.
colSums(counts)
## Human_Brain_total_100ng_1_subsample Human_Brain_total_100ng_2_subsample
## 3260426 3889868
print(paste('Mean counts:',mean(colSums(counts))))
## [1] "Mean counts: 3575147.255"
Let’s import now the feature_metadata output. This table reports feature-related attributes such as the assignation round to which the annotation belongs, a flag stating whether the feature is an aggregated community of annotations or an individual feature and the biotype associated to the feature. The feature identifiers in the counts_matrix and the feature_metadata match and therefore, this table can facilitate the extraction of particular features from the count matrix, e,g, a certain biotype, the exonic counts, the subset of features aggregated in communities, etc…)
<- read.csv(paste0(root_dir,'outputs/feature_metadata.csv')) feat_metadata
Here, for example, we profit from the feature_metadata table to look at counts distribution by assignation round.
<- cbind('counts' = rowSums(counts), feat_metadata)
df ggplot(df, aes(x = assignation_round, y = counts)) + geom_violin(fill = 'grey') +
scale_y_continuous(trans = 'log', lim = c(10,100000), breaks = c(10,100,1000,10000,100000)) + theme_pubclean() +
geom_point(position = position_jitter(seed = 1, width = 0.4), size = 0.005, alpha = 0.2)
Below, we display a few random rows from the table for illustration.
<- with(feat_metadata, feat_metadata[c(
feat_subset sample(which(assignation_round == 'small' & community_flag == "True"), 2),
sample(which(assignation_round == 'small' & community_flag == "False"), 2),
sample(which(assignation_round == 'long_exon' & community_flag == "True"), 1),
sample(which(assignation_round == 'long_exon' & community_flag == "False"), 1),
sample(which(assignation_round == 'long_intron' & community_flag == "True"), 1),
sample(which(assignation_round == 'long_intron' & community_flag == "False"), 1)),])
kable(feat_subset) %>% scroll_box(height = "300px")
feature | assignation_round | annotations_subset | feature_type | feature_output | feature_biotype | community_flag | |
---|---|---|---|---|---|---|---|
1657 | SNORA8_AC007448.1_Z77249.1 | small | small | transcript | transcript_name | snoRNA | True |
1276 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | small | small | transcript | transcript_name | snRNA | True |
2130 | hsa-miR-323b-3p | small | small | transcript | transcript_name | miRNA | False |
1545 | SCARNA10 | small | small | transcript | transcript_name | snoRNA | False |
6308 | ACTR2-exon | long_exon | long | exon | gene_name | protein_coding | True |
10226 | CBR1-exon | long_exon | long | exon | gene_name | protein_coding | False |
33009 | CNIH4-intron | long_intron | long | gene | gene_name | protein_coding | True |
42590 | SLC15A2-intron | long_intron | long | gene | gene_name | protein_coding | False |
Next, we show how to generate a barplot showing the read distribution by biotype group. First lets load a table to group biotypes by category. We will use this to group less abundant biotypes in larger groups for visualization purposes.
<- define_bcats()
bcats
<- merge(feat_metadata, bcats, by.x = 'feature_biotype', by.y = 'biotype', all.x = TRUE, all.y = FALSE)
feat_df
## Add exon/intron distinction to biogroup based on counting round
$biogroup <- as.character(feat_df$biogroup)
feat_df$biogroup[feat_df$biogroup == 'Protein_coding'] <- 'Protein_coding_exon'
feat_df$biogroup[feat_df$biogroup == 'Long_non_coding'] <- 'Long_non_coding_exon'
feat_df$biogroup[feat_df$assignation_round == 'long_intron'] <-
feat_dfsub('exon','intron',feat_df$biogroup[feat_df$assignation_round == 'long_intron'])
$biogroup <- as.factor(feat_df$biogroup)
feat_df<- feat_df[match(rownames(counts), feat_df$feature),] feat_df
We then combine feature_metadata table with the count_matrix again and sum up the counts to get the total number of reads by biotype. We do this separately by biotype groups and small-non-coding individual biotypes to further represent the small non-coding spectrum.
## Generate biotype matrix
<- data.frame(counts %>% group_by(feat_df$biogroup) %>% summarise_all(sum),
df check.names = FALSE); names(df)[1] <- 'biotype'
<- reshape(df, idvar = 'biotype', varying = list(names(df)[-1]),
biotype v.names = 'counts', times = names(df)[-1],
timevar = 'sn', direction = 'long')
$biotype <- as.character(biotype$biotype)
biotype<- c("Hybrid","Long_pseudogenes","Protein_coding_intron","Long_non_coding_intron",
bgroups "Protein_coding_exon","Long_non_coding_exon" ,"Short_non_coding","tRNA","rRNA")
$biotype[biotype$biotype %nin% bgroups] <- 'Hybrid'
biotype$biotype <- factor(biotype$biotype, levels = bgroups)
biotype
## Generate non-coding biotype table
<- data.frame(counts[feat_df$biocat == 'sNC',] %>%
df group_by(feat_df$feature_biotype[feat_df$biocat == 'sNC']) %>% summarise_all(sum),
check.names = FALSE); names(df)[1] <- 'biotype'
<- reshape(df, idvar = 'biotype', varying = list(names(df)[-1]),
biotype_snc v.names = 'counts', times = names(df)[-1],
timevar = 'sn', direction = 'long')
Once we have extracted the counts by biotype group, let’s employ ggplot to visualize the read distribution profiles as barplots.
Reads distribution by biotype:
## ---- Abundance plot by biotype group
<- c('violetred1','slateblue1','darkgrey','lightgrey',
colP 'springgreen4','springgreen3','violetred4','tan3','tan4')
names(colP) <- bgroups
<- ggplot(biotype, aes(x = sn, y = counts, group = biotype, fill = biotype)) +
p1 geom_bar(stat = 'identity', colour = 'black', width = 0.8) +
coord_flip() + xlab('') + ylab('Number of assigned reads') + theme_pubclean() +
guides(fill=guide_legend(nrow=5)) + scale_fill_manual(values=colP) +
theme(legend.position = 'top', legend.title = element_blank())
## ---- Relative abundance plot by small non-coding biotype
<- c('#e6be97','#848CFF','#4c2382', '#50eb76','tomato','#FDFF87','#FFAE51','darkorchid1','gold1','sienna3')
colP <- ggplot(biotype_snc, aes(x = sn, y = counts, group = biotype, fill = biotype)) +
p2 geom_bar(stat = 'identity', position = 'fill', colour = 'black', width = 0.8) + theme_pubclean() +
coord_flip() + xlab('') + ylab('Proportion of small non-coding assigned reads') +
guides(fill=guide_legend(nrow=5)) + scale_fill_manual(values = colP) +
theme(axis.text.y = element_blank(), legend.position = 'top', legend.title = element_blank())
## Display plots together
ggarrange(p1, p2, ncol = 2, widths = c(2,1))
Finally, we import the multigraph_communities tables. These tables link each original feature in the .gtf with the resultant feature matching the count matrix and feature metadata identifiers. It includes both unique features (that remain unmodified) and aggregated features (that are collapsed following MG communities). Thus, we can track back the features grouped in each aggregated feature.
Here we will use the feature_metadata subset from before to explore the original features in the .gtf forming each new MG community feature.
<- read.csv(paste0(root_dir,'outputs/multigraph_communities_small.csv'))
csmall <- read.csv(paste0(root_dir,'outputs/multigraph_communities_long_exon.csv'))
clong
kable(subset(csmall, feature %in% feat_subset$feature)) %>% scroll_box(height = "300px")
transcript_name | transcript_biotype | naln | naln_community | community_flag | community_id | community_name | community_biotype | feature | |
---|---|---|---|---|---|---|---|---|---|
38 | RNU4-28P | snRNA | 4 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
79 | RNU4-27P | snRNA | 47 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
122 | RNU4-88P | snRNA | 10 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
150 | RNU4-59P | snRNA | 179 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
161 | RNU4-75P | snRNA | 16 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
216 | U4.4 | snRNA | 6 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
232 | U4.7 | snRNA | 321 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
239 | RNU4-42P | snRNA | 7 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
334 | RNU4-21P | snRNA | 40 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
340 | RNU4-77P | snRNA | 2 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
366 | RNU4-73P | snRNA | 236 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
404 | RNU4-63P | snRNA | 8 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
411 | RNU4-49P | snRNA | 110 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
426 | RNU4-51P | snRNA | 7 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
472 | RNU4-8P | snRNA | 100 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
473 | RNU4-84P | snRNA | 285 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
492 | RNU4-48P | snRNA | 20 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
596 | U4.5 | snRNA | 262 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
623 | RNU4-85P | snRNA | 224 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
644 | RNU4-56P | snRNA | 34 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
649 | RNU4-78P | snRNA | 12 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
729 | RNU4-62P | snRNA | 92 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
779 | RNU4-38P | snRNA | 11 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
787 | RNU4-4P | snRNA | 117 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
789 | RNU4-91P | snRNA | 240 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
832 | RNU4-89P | snRNA | 14 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
939 | RNU4-33P | snRNA | 22 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
975 | RNU4-79P | snRNA | 25 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
979 | RNU4-87P | snRNA | 22 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
995 | RNU4-64P | snRNA | 19 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1078 | RNU4-11P | snRNA | 227 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1133 | RNU4-14P | snRNA | 216 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1169 | U4.3 | snRNA | 68 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1254 | RNU4-66P | snRNA | 17 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1264 | RNU4-12P | snRNA | 318 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1277 | RNU4-70P | snRNA | 55 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1304 | RNU4-35P | snRNA | 6 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1305 | RNU4-76P | snRNA | 78 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1313 | RNU4-18P | snRNA | 14 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1332 | RNU4-7P | snRNA | 427 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1496 | RNU4-74P | snRNA | 15 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1519 | RNU4-31P | snRNA | 27 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1533 | RNU4-6P | snRNA | 237 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1567 | RNU4-52P | snRNA | 281 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1584 | RNU4-81P | snRNA | 7 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1644 | Z77249.1 | snoRNA | 15 | 944 | True | snoRNA-cl-129 | SNORA8_AC007448.1_Z77249.1 | snoRNA | SNORA8_AC007448.1_Z77249.1 |
1647 | RNU4-44P | snRNA | 227 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1685 | RNU4-71P | snRNA | 7 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1723 | RNU4-50P | snRNA | 4 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1766 | RNU4-83P | snRNA | 3 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1785 | U4.6 | snRNA | 69 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1790 | RNU4-25P | snRNA | 186 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1814 | RNU4-26P | snRNA | 4 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1833 | RNU4-53P | snRNA | 195 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1869 | RNU4-15P | snRNA | 5 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
1907 | RNU4-82P | snRNA | 1405 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2013 | RNU4-39P | snRNA | 160 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2050 | SNORA8 | snoRNA | 912 | 944 | True | snoRNA-cl-129 | SNORA8_AC007448.1_Z77249.1 | snoRNA | SNORA8_AC007448.1_Z77249.1 |
2060 | RNU4-55P | snRNA | 4 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2086 | RNU4-23P | snRNA | 126 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2095 | RNU4-86P | snRNA | 2 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2100 | U4.2 | snRNA | 24 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2213 | RNU4-5P | snRNA | 8 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2230 | SCARNA10 | snoRNA | 625 | 625 | False | SCARNA10 | |||
2251 | RNU4-67P | snRNA | 84 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2255 | RNU4-54P | snRNA | 104 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2297 | RNU4-20P | snRNA | 5 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2306 | RNU4-65P | snRNA | 17 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2329 | RNU4-24P | snRNA | 2 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2338 | RNU4-41P | snRNA | 4 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2354 | RNU4-32P | snRNA | 15 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2372 | RNU4-2 | snRNA | 16486 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2373 | RNU4-1 | snRNA | 3245 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2393 | RNU4-9P | snRNA | 37 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2453 | RNU4-10P | snRNA | 344 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2563 | RNU4-92P | snRNA | 26 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2627 | RNU4-68P | snRNA | 112 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2776 | RNU4-80P | snRNA | 34 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2862 | RNU4-46P | snRNA | 147 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2897 | RNU4-58P | snRNA | 338 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2909 | RNU4-30P | snRNA | 157 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
2911 | RNU4-36P | snRNA | 245 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
3068 | RNU4-13P | snRNA | 424 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
3088 | AC007448.1 | snoRNA | 17 | 944 | True | snoRNA-cl-129 | SNORA8_AC007448.1_Z77249.1 | snoRNA | SNORA8_AC007448.1_Z77249.1 |
3176 | RNU4-17P | snRNA | 71 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
3243 | RNU4-40P | snRNA | 7 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
3332 | RNU4-60P | snRNA | 25 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
3458 | RNU4-45P | snRNA | 20 | 28863 | True | snRNA-cl-7 | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc | snRNA | RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc |
5420 | hsa-miR-323b-3p | miRNA | 6 | 6 | False | hsa-miR-323b-3p |
kable(subset(clong, feature %in% sub('-exon','',feat_subset$feature))) %>% scroll_box(height = "300px")
gene_name | gene_biotype | naln | naln_community | community_flag | community_id | community_name | community_biotype | feature | |
---|---|---|---|---|---|---|---|---|---|
3877 | ACTR2 | protein_coding | 708 | 748 | True | long-cl-3296 | ACTR2 | protein_coding | ACTR2 |
33247 | AP000357.1 | processed_pseudogene | 40 | 748 | True | long-cl-3296 | ACTR2 | protein_coding | ACTR2 |
33974 | CBR1 | protein_coding | 325 | 325 | False | CBR1 |
kable(subset(clong, feature %in% sub('-intron','',feat_subset$feature))) %>% scroll_box(height = "300px")
gene_name | gene_biotype | naln | naln_community | community_flag | community_id | community_name | community_biotype | feature | |
---|---|---|---|---|---|---|---|---|---|
2919 | CNIH4 | protein_coding | 98 | 99 | True | long-cl-2498 | CNIH4 | protein_coding | CNIH4 |
6688 | SLC15A2 | protein_coding | 39 | 39 | False | SLC15A2 | |||
12085 | AL590002.1 | processed_pseudogene | 1 | 99 | True | long-cl-2498 | CNIH4 | protein_coding | CNIH4 |
Also, we can exploit the multigraph_communities tables to extract stats from the MGcount run. Here we look at the proportion of aggregated features (community_flag variable) by biotype (feature_biotype). The output shows how small RNA biotypes tend to be more aggregated in communities because duplicated loci are more frequent.
kable(with(csmall, ctable(transcript_biotype, community_flag)))
|
|
kable(with(clong, ctable(gene_biotype, community_flag)))
|
|
We are done!
T3- Explore output multi-mapping graph
Below, we use a few functions defined in “mg_visualize.R” to graphically explore the multi-mapping graph topologies.
The function ‘mg_build’ takes the multi-mapping graph adjacency matrix and the MG communities and creates an igraph object that can be explored via igraph library. The next two functions (mg_plotset and mg_interactive) provide with the code to generate a few default plots given a Multi-Graph igraph object.
library(Matrix)
library(igraph)
library(plotly)
source(paste0(root_dir,'R/mg_visualize.R'))
As an example, we will load the small-RNA graph adjacency matrix and we will subset it to explore the sub-graph associated to microRNA features.
The adjacency matrix is stored as a sparse symmetric matrix in MatrixMarket format, which can be imported in R by the readMM function from ‘Matrix’ R package. We also import the table containing all annotated features and MG communities outputs.
dir.create(paste0(root_dir,'mgplots'))
## Import ml data and adjacency matrix for small assignation round
<- readMM(paste0(root_dir,'outputs/multigraph_matrix_small.mtx'))
inputM <- read.csv(paste0(root_dir,'outputs/multigraph_communities_small.csv')) multiloci_table
Next we subset the matrix by selecting the features under ‘miRNA category’. Each row in the communities table table corresponds to a feature annotation with non-zero assignments in the small round. The row index of each feature defines its column and row position in the multi-mapping graph adjacency matrix.
## Extract microRNA matrix subset
<- 'miRNA'
btype <- which(multiloci_table$transcript_biotype == btype)
idx <- inputM[idx,idx]; ml <- multiloci_table[idx,] inM
The necessary steps to convert the adjacency matrix and the communities table to an igraph object are in the mg_build function which outputs an igraph object.
## Generate microRNA graph object
<- 'transcript_name'
attr = mg_build(inM, ml, attr)
g <- delete_vertices(g, V(g)[V(g)$weight < 1]) g
We can explore the patterns in the feature gene symbols established by the HGNC and link that to specific colors during visualization. Here we force annotations associated to a mature clipped 3-prime microRNA to be colored in orange and annotations associated to a mature clipped 5-prime microRNA to be colored in blue. For this, we look for “-3p” and “-5p” patterns in the transcript_name annotation symbol, which MGcount uses as default featue_output for small RNA.
## Extract microRNA mature extreme information from HUGO symol
V(g)$color_HUGO1 <- 'grey'
V(g)$color_HUGO1[grep('-3p',V(g)$feat)] <- 'orange'
V(g)$color_HUGO1[grep('-5p',V(g)$feat)] <- 'dodgerblue'
## Extract microRNA class from HUGO symbol
V(g)$color_HUGO2 <- 'grey'
<- grep('hsa-miR',as.character(V(g)$feat))
idx <- as.factor(gsub('.*?([0-9]+).*', '\\1', V(g)$feat[idx]))
hugo V(g)$color_HUGO2[idx] <- sample(grDevices::colors()[grep('gr(a|e)y', grDevices::colors(), invert = T)],
length(unique(hugo)))[hugo]
The function mg_plotset generates a set of different plots of the multi-mapping graph and stores them as .png files with a user-given file path and prefix (plotfile argument).
We next use magick R package to load a few of the generated .png files into R. Each vertex is an annotated features with size proportional to its number of aligned reads. Each edge connects two features with shared multi-mappers with thickness proportional to the fraction of shared multi-mappers over the total alignments. Shared grey areas delineate MG communities. Vertices are colored according to the attribute “customColor”, which we just defined to be orange, blue and grey for “-3p”, “-5p” and absence of “-3p”/“-5p” patterns respectively. We may modify this as desired.
## Define plot colot
V(g)$customColor <- V(g)$color_HUGO1
## Standard plots
mg_plotset(g, plotfile= paste0(root_dir,'mgplots/mg_miRNA'))
Raw visualization of the graph
## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
<- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA.png'))
img plot(img)
Visualization of the graph colored by -3/-5 patterns in microRNA transcript symbol
## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
<- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA_color.png'))
img plot(img)
Visualization of the graph colored by -3/-5 patterns in microRNA transcript symbol and detected communities
## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
<- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA_cl_color.png'))
img plot(img)
Interactive visualization of the graph
To explore large graphs with interactive visualization tools such as zoom, we may use mg_interactive which creates a plotly object from the graph. Here the different communities are represented by colors.
mg_interactive(g, paste0(root_dir,'mgplots/mg_miRNA'))
We are done!
Integrating annotation sources
MGcount is a tool conceived to analyse heterogeneous datasets capturing diverse non-coding transcript types but the scope of features quantified by MGcount is bounded by the features annotated in the reference .gtf file used as input. On this line, although MGcount can be executed with any .gtf annotations file (Ensembl, Gencode, etc…), we provide with the option to integrate annotations from several databases to take into account a more complete or/and up-to-date annotations set in the quantification, specially for small regulatory RNAs (piRNA, tRF, microRNA, siRNA). These transcripts are not necessarily annotated in general databases.
In tutorial 1, we used a ready integrated .gtf for the human genome. Besides, the directory integrated_annotations_gtf we downloaded in tutorial 1, provides with integrated annotations .gtf files for Arabidopsis, Mouse and Nematode that can be used for running MGcount in the corresponding species. The following databases have been integrated for each specie:
- Arabidopsis thaliana: Ensembl, miRBase (microRNA) and RNACentral (siRNA);
- Homo Sapiens Ensembl, DASHR (piRNA and tRNA fragments [tRF]) and miRBase (microRNA);
- Mus musculus: Ensembl, miRBase (microRNA) and RNAcentral (piRNA);
- Caenorhabditis elegans: Ensembl and miRBase (microRNA).
For running MGcount in any other genome, we encourage to follow the same procedure we followed to generate the last 4 .gtf files. The script we used to generate them is provided in the R folder from the MGcount Github repository (that can be individually download with the command in tutorial 2). We hope the script can be used as a template for custom .gtf integration in other species.