MGcount user guide

Getting started

What is MGcount

MGcount is an RNA-seq quantification tool conceived to address ambiguous read alignments in a flexible way that is compatible with any biotype. This allows to extract more information from total RNA-seq datasets by simultaneous quantifying coding and non-coding transcripts, both small and long, where dealing with heterogeneous read-to-feature alignment ambiguities is key. When aligning reads to a reference genome, we distinguish between two types of ambiguous alignments:

Multi-mappers: reads aligning to multiple genomic locations.
Multi-overlaps: reads aligning to a genomic location with multiple annotated features.

To deal with the most frequent multi-overlaps, MGcount hierarchically assigns reads to small RNA, long RNA exons and long RNA introns, accounting for their length disparity. Subsequently, MGcount models read-to-feature alignments in a graph. This is exploited to detect and report expression of sequence-similar annotated loci, where reads systematically multi-map, as integrated features called communities.

Citation

Hita, A., Brocart, G., Fernandez, A., Rehmsmeier, M., Alemany, A. and Schvartzman, S., 2022. MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts. BMC Bioinformatics 23, 39 (2022). https://doi.org/10.1186/s12859-021-04544-3

System requirements

MGcount is built on the top of FeatureCounts, a well-known computational efficient software (Liao et. al., 2014). Please download it from the following link: http://subread.sourceforge.net/

Installation

MGcount is written in Python and is executed from the command line. You can either download the executable version (single binary file) or install it as a Python3 module.

Install and run as a Python3 module

You can install the package as a Python module:

pip3 install git+https://github.com/hitaandrea/MGcount.git

Once the package is installed, run the tool as a Python installed module:

python3 -m mgcount [args]

Download and run the executable program

Alternatively you can download the latest release as a single executable file here.

Save the program file to your Linux system and set the permissions to allow executing the file as program:

chmod +x MGcount

Once the file is executable, run the tool by calling the file from the command line with the desired arguments.

MGcount [args]

Workflow

MGcount workflow schema

During its execution, MGcount performs the following 3 steps:

1. Hierarchical assignation

To address the most frequent multi-overlapping ambiguous situations, reads are assigned to genomic annotated features in three pre-defined sequential rounds based on transcript body length: small RNA, long RNA exon and long RNA intron. MGcount prioritizes small RNA when a read aligns to a small RNA loci that is embedded within a long RNA. In the second round, alignments get assigned to long RNA exon features. In the final round, reads that haven’t been assigned to any small RNA or long RNA exon are assigned to introns if aligned within the full gene body length of a long RNA. Hence, all reads with at least one alignment overlapping with an annotated feature in the current round are assigned to such feature and skipped in subsequent rounds.

2. Multi-loci communities recognition

To quantify multi-mapping reads, MGcount builds a directed weighted graph {G=(V,E)} where each vertex (V) is an annotated feature and a pair of directional edges (E) connect two features for which common multi-mapping reads exist. Edge weights are defined as the ratio of multi-mapping reads between the two vertex normalized by the total number of reads assigned to the source vertex. Vertex weights are defined as the log-transformed number of assigned alignments (Fig c). Resultant graphs structures capture the multiple-loci topologies of different RNA biotypes. Two graphs are separately build for small-RNA and long-RNA features, using the full pool of input alignments. Subsequently, highly-related features are grouped together by minimizing the map equation with the communities detection approach described by Rosvall et al., 2008 .

3. Expression matrix generation

MGcount generates an output expression matrix for each hierarchical assignation round. These are appended together in a single output matrix. For each read, each alignment first gets an 1/N count, where N is the number of multi-mappers or residual multi-overlaps that survive the hierarchical assignment. Next, counts for annotated features which have been aggregated together in a community by the map equation are summed up. In this way, the systematic ambiguity in multi-mapping reads gets collapsed into a single MG community while the remaining signal is reported as fractional counts over distinct features.

Usage description

Inputs

Three inputs are required to run the program:

Input alignment file: a .txt file listing the paths to the .bam alignment input files by line.
Annotations file: a .gtf file containing a set of RNA feature annotations
Output directory path: a string specifying the path to the directory where MGcount outputs will be stored

( Please, use full-paths if you experience any problem)

Configurable arguments

Configurable arguments list

Optional arguments to configure a MGcount run include:

Argument	Description	Default value
–paired_flag (-p)	Paired end flag. If null, the assignation occurs in single-end mode.	False
–strand_option (-s)	Library strandness. Options available are 0: unstranded, 1: forward-stranded and 2:reverse-stranded	1
–featureCounts_path	Path to featureCounts software executable file	/usr/bin/featureCounts
–btyperounds_filename	Optional .csv file with biotype to assignation round associations. It should be a two columns table where column names are, in order, “biotype” and “counting_round”.
–feature_small	GTF feature type entry for smallRNA reads assignation	transcript
–feature_output_small	GTF field name for which to summarize counts of long RNA assigned reads	transcript_name
–feature_biotype_small	GTF field name defining biotype for small RNA features	transcript_biotype
–ml_flag_small	Multi-loci graph detection based groups flag for small RNA features	1
–min_overlap_small	Minimal feature-alignment overlapping fraction for assigning a read to a small RNA feature	1
–feature_output_long	GTF field name for which to summarize counts of long RNA assigned reads	gene_name
–feature_biotype_long	GTF field name defining biotype for long RNA features	gene_biotype
–min_overlap_long	Minimal feature-alignment overlapping fraction for assigning a read to a long RNA feature	1
–ml_flag_long	Multi-loci graph detection based groups flag for long RNA features	1
–th_low	Low minimal threshold of feature-to-feature multi-mapping fraction.	0.01
–th_high	High minimal threshold of feature-to-feature multi-mapping fraction.	0.75
–subs	Optional sub-sapling number of alignments to build the multi-mapping graph. If 0, include all.	0
–n_cores (-T)	Number of cores for parallelization by sample	1
–sample_id	SampleID input file names	None
–seed	Optional fixed seed for random numbers generation during communities detection

Configurable arguments details

Sequencing data type

Two arguments need to be set according to the input data type. For a correct interpretation of RNA-seq data during assignation, the integer argument –strand_option need to be set according to the strandness of the library preparation method utilized (0:unstranded, 1: forward-stranded, 2:reverse-stranded). If dealing with paired reads, –paired_flag should be added to the command line call.

Multi-core mode

MGcount may process samples in parallel in all three steps of the workflow (hierarchical assignation, multi-mapping graph generation and count matrix building). The number of CPUs to be used by MGcount can be defined with the –n_cores option.

Assignation rounds configuration

Round	feature	feature_output	feature_biotype	min_overlap	ml_flag
small	transcript	transcript_name	transcript_biotype	1	True
long_exon	exon	gene_name	gene_biotype	1	True
long_intron	gene	gene_name	gene_biotype	1	True

At each round of the hierarchical assignation, MGcount extracts the set of annotations in the .gtf with entry type –feature whose –feature_biotype attribute is included in the list of biotypes associated to the round (as defined by the .csv file –btyperounds_filename). Subsequently, alignments are assigned to the restricted set of annotations whenever a minimum read fraction (as defined by –min_overlap) overlaps with an annotated feature of the extracted annotations subset.

If featureCounts software is not accessible on the system path (/usr/bin/), the full path to the software should be set through –featureCounts_path.

The association of different biotypes to either “long” or “small” assignation rounds can be customized in a .csv file and parsed with –btypecrounds_filename argument to MGcount. The .csv file must be a two columns table with names biotype and assignation_round.

By default, MGcount utilizes a .csv file embedded with the program (or alternative installed with the Python module), that is located in the /mgcount/data sub-folder of the Github repository. This table links the set of biotypes encountered in the 4 integrated .gtf files provided (Arabidopsis, Human, Mouse and Nematode) to the corresponding pre-defined small and long rounds.

For running MGcount in further species or different annotations set, please make sure the biotypes you want to include in the quantification are correctly listed in this table for MGcount to recognize them.

At each round of the hierarchical assignation, alignment-feature assignation pairs are determined with FeatureCounts restricted to the designated subset of the .gtf annotated features. Each round can be configured by the user through the following five arguments:

Communities detection

To speed-up computation time, a fixed number of random sub-sampled alignments per sample can be set to build the graph through –subs.

The –seed argument may be set to guarantee exact solutions across runs with the same input arguments. The seed is used to initialize the generation of random numbers during the communities detection approach. MGcount ignores weak edges during the map equation optimization based on a high threshold –th_high (by default 0.75) and a low threshold –th_low (by default 0.01) to prevent for over-fitting (splitting of large densely connected communities and merging of small loosely connected communities). Thresholds are employed as follows according to the type of graph:

Long-RNA graph: All edges whose weight are below the high threshold are ignored for the long-RNA graph. This avoids collapsing together certain features sharing only partial similarity. Given the long body length, multi-mappers may occur in only a specific part of the locus. In these situations, the threshold determines how large should the shared reads proportion between two features to be considered for a community. Lower threshold values will tend to aggregate less related features in communities while high thresholds will force features to remain single by splitting the multi-mapping reads as a fraction.
Small-RNA graph: The use of the high threshold or the low threshold for weak edges filtering depends on the edge weights distribution in the small-RNA graph. Here, for each biotype (microRNA, piRNA, snRNA, …), the threshold that is closer to the graph weights’ first quantile is employed. In this way, for biotypes were repeated loci are identical or nearly identical (f.i. microRNA), only high weights above the high threshold may be considered for communities while for biotypes with large groups of similar loci (snRNA, YRNA and pseudogenes, ….), all weights may be considered.

Outputs

At the end of its execution, MGcount provides the following outputs:

Count matrix: A matrix where each row corresponds to a feature as defined by feature_output (either single features or MG communities aggregating several features) and each column corresponds to one input BAM file.
Features metadata: A table reporting: feature names matching row names in the count matrix, the counting round of hierarchical assignation, and its configuration parameters, a flag designing whether a feature belongs to an MG community, and the feature biotype.
Multi-mapping graph adjacency matrix: A sparse adjacency matrix for each multi-mapping graph generated (small RNA and/or long RNA), stored as a symmetric, integer, squared matrix. Each matrix element stores the number of alignments that multi-map to a pair of features (defined by row and column), and the diagonal contains the total number of alignments per feature.
Multi-mapping graph communities (MGcommunities): A table of MG communities linking each original feature in the GTF file with the resultant count matrix and metadata feature identifiers. It includes both unique features (which remain unmodified) and aggregated features (which are collapsed following MG communities). Also, the table stores the total number of alignments per feature.

Tutorials

T1 - Prepare inputs and execute

In the following tutorials, we use two sub-sampled human brain RNA-seq libraries as example to walk through MGcount execution. First of all, let’s create a folder and download the alignment .bam files of the two samples. (The downloading process might take a few minutes).

mkdir mgcount_tutorial
cd mgcount_tutorial

wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/tutorial_bamfiles.zip
unzip tutorial_bamfiles.zip -d input_bamfiles

To run MGcount, we need to provide the software with a .txt file specifying the paths to the input alignment files. Here, these are the two .bam alignment files we just downloaded. We can generate this file from the command line:

printf "$PWD/%s\n" input_bamfiles/* > input_bamfilenames.txt

The other required input is a .gtf file with transcript features annotations. MGcount repository provides with four ready .gtf files integrating annotations from several databases (see Hita et. al., 2022, Appendices, Methods, Database Integration). Next, we will download them and use the human annotations file for our execution example.

wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/integrated_annotations_gtf.zip
unzip integrated_annotations_gtf.zip -d annotations_gtf

Once we have both the .gtf and the input .bam files, we can simply run MGcount as an executable command-line program or as python3 module if installed via pip. A python module is run by calling python3 with the parameter -m and the name of the module, in this case “mgcount”. After this, we need to specify MGcount required arguments.

For this example, we parse the ready-integrated .gtf for the human genome, the .txt file we just created containing the path to the input alignment .bam files and a string designating the directory where MGcount outputs will be generated.

To reduce the computational time, we set the multicore parameter (-T) to “2” in order to parallelize the different steps of the algorithm by sample.

Run as an executable program:

MGcount -T 2 --gtf annotations_gtf/Homo_sapiens.GRCh38.gtf --outdir outputs --bam_infiles input_bamfilenames.txt

Run as a python3 module:

python3 -m mgcount -T 2 --gtf annotations_gtf/Homo_sapiens.GRCh38.gtf --outdir outputs --bam_infiles input_bamfilenames.txt

After MGcount run successfully finishes, your output directory should contain 6 new output files generate by MGcount including the RNA count matrix, the feature metadata table and two files containing the graph structure and the communities detected for long RNA and small RNA respectively.

Alternatively MGcount software might be invoked from an R console with the function “system”:

root_dir <- '~/mgcount_tutorial/'
system(paste0('MGcount -T 2',
              ' --gtf ',root_dir, 'annotations_gtf/Homo_sapiens.GRCh38.gtf',
              ' --outdir ' root_dir,'outputs ',
              ' --bam_infiles ',root_dir,'input_bamfilenames.txt'))

We are done!

T2 - Explore quantification outputs

In this tutorial, we will load the MGcount outputs we obtained in tutorial 1 and we will explore them from R. MGcount repository contains a few supporting scripts in R providing with side functionalities (manage annotations, visualize MGcount outputs…). It is possible to download the sole folder containing these scripts with the following shell command:

cd mgcount_tutorial
svn export https://github.com/hitaandrea/MGcount/trunk/R

Once the scripts are downloaded, we are ready to launch R. Please, start an R session and define the tutorial root directory as a variable.

root_dir <- '~/mgcount_tutorial/'

Further, this tutorial uses the following R packages. Please, install them with install.packages() if you wish to run this tutorial on your system.

library(dplyr)
library(Hmisc)
library(ggplot2)
library(ggpubr)
library(summarytools)

## Warning in fun(libname, pkgname): couldn't connect to display ":0"

source(paste0(root_dir,'R/integrate_gtf_annotations.R'))

The main output of MGcount is the count_matrix.csv file containing the feature by sample expression matrix. We can import it into R as any regular .csv.

counts <- read.csv(paste0(root_dir,'outputs/count_matrix.csv'), row.names = 1)
colnames(counts) <- sub('_Aligned.genome.dedup','',colnames(counts))

Each row in the count table is a feature for which expression has been quantified and each column is associated to a sample. By interrogating for the matrix dimension, we see the matrix contains two columns corresponding to the two human brain libraries. The matrix can be used as input for any RNA-seq downstream analysis.

dim(counts)

## [1] 46029     2

We can look at the total number of reads assigned from each library by summing up each of the two rows and compute the mean over the two libraries.

colSums(counts)

## Human_Brain_total_100ng_1_subsample Human_Brain_total_100ng_2_subsample 
##                             3260426                             3889868

print(paste('Mean counts:',mean(colSums(counts))))

## [1] "Mean counts: 3575147.255"

Let’s import now the feature_metadata output. This table reports feature-related attributes such as the assignation round to which the annotation belongs, a flag stating whether the feature is an aggregated community of annotations or an individual feature and the biotype associated to the feature. The feature identifiers in the counts_matrix and the feature_metadata match and therefore, this table can facilitate the extraction of particular features from the count matrix, e,g, a certain biotype, the exonic counts, the subset of features aggregated in communities, etc…)

feat_metadata <- read.csv(paste0(root_dir,'outputs/feature_metadata.csv'))

Here, for example, we profit from the feature_metadata table to look at counts distribution by assignation round.

df <- cbind('counts' = rowSums(counts), feat_metadata)
ggplot(df, aes(x = assignation_round, y = counts)) + geom_violin(fill = 'grey') + 
   scale_y_continuous(trans = 'log', lim = c(10,100000), breaks = c(10,100,1000,10000,100000)) + theme_pubclean() +
  geom_point(position = position_jitter(seed = 1, width = 0.4), size = 0.005, alpha = 0.2)

Below, we display a few random rows from the table for illustration.

feat_subset <- with(feat_metadata, feat_metadata[c(
   sample(which(assignation_round == 'small' & community_flag == "True"), 2),
   sample(which(assignation_round == 'small' & community_flag == "False"), 2),
   sample(which(assignation_round == 'long_exon' & community_flag == "True"), 1),
   sample(which(assignation_round == 'long_exon' & community_flag == "False"), 1),
   sample(which(assignation_round == 'long_intron' & community_flag == "True"), 1),
   sample(which(assignation_round == 'long_intron' & community_flag == "False"), 1)),])
kable(feat_subset) %>% scroll_box(height = "300px")

	feature	assignation_round	annotations_subset	feature_type	feature_output	feature_biotype	community_flag
1657	SNORA8_AC007448.1_Z77249.1	small	small	transcript	transcript_name	snoRNA	True
1276	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	small	small	transcript	transcript_name	snRNA	True
2130	hsa-miR-323b-3p	small	small	transcript	transcript_name	miRNA	False
1545	SCARNA10	small	small	transcript	transcript_name	snoRNA	False
6308	ACTR2-exon	long_exon	long	exon	gene_name	protein_coding	True
10226	CBR1-exon	long_exon	long	exon	gene_name	protein_coding	False
33009	CNIH4-intron	long_intron	long	gene	gene_name	protein_coding	True
42590	SLC15A2-intron	long_intron	long	gene	gene_name	protein_coding	False

Next, we show how to generate a barplot showing the read distribution by biotype group. First lets load a table to group biotypes by category. We will use this to group less abundant biotypes in larger groups for visualization purposes.

bcats <- define_bcats()

feat_df <- merge(feat_metadata, bcats, by.x = 'feature_biotype', by.y = 'biotype', all.x = TRUE, all.y = FALSE) 

## Add exon/intron distinction to biogroup based on counting round
feat_df$biogroup <- as.character(feat_df$biogroup)
feat_df$biogroup[feat_df$biogroup == 'Protein_coding'] <- 'Protein_coding_exon'
feat_df$biogroup[feat_df$biogroup == 'Long_non_coding'] <- 'Long_non_coding_exon'  
feat_df$biogroup[feat_df$assignation_round == 'long_intron'] <-
  sub('exon','intron',feat_df$biogroup[feat_df$assignation_round == 'long_intron'])

feat_df$biogroup <- as.factor(feat_df$biogroup)
feat_df <- feat_df[match(rownames(counts), feat_df$feature),]

We then combine feature_metadata table with the count_matrix again and sum up the counts to get the total number of reads by biotype. We do this separately by biotype groups and small-non-coding individual biotypes to further represent the small non-coding spectrum.

## Generate biotype matrix
df <- data.frame(counts %>% group_by(feat_df$biogroup) %>% summarise_all(sum), 
                 check.names = FALSE); names(df)[1] <- 'biotype'
biotype <- reshape(df, idvar = 'biotype', varying = list(names(df)[-1]),
                      v.names = 'counts', times = names(df)[-1],
                      timevar = 'sn', direction = 'long')  
biotype$biotype <- as.character(biotype$biotype)
bgroups <- c("Hybrid","Long_pseudogenes","Protein_coding_intron","Long_non_coding_intron",
             "Protein_coding_exon","Long_non_coding_exon" ,"Short_non_coding","tRNA","rRNA")
biotype$biotype[biotype$biotype %nin% bgroups] <- 'Hybrid'
biotype$biotype <- factor(biotype$biotype, levels = bgroups)

## Generate non-coding biotype table
df <- data.frame(counts[feat_df$biocat == 'sNC',] %>%
                 group_by(feat_df$feature_biotype[feat_df$biocat == 'sNC']) %>% summarise_all(sum), 
                 check.names = FALSE); names(df)[1] <- 'biotype'
biotype_snc <- reshape(df, idvar = 'biotype', varying = list(names(df)[-1]),
               v.names = 'counts', times = names(df)[-1],
               timevar = 'sn', direction = 'long')  

Once we have extracted the counts by biotype group, let’s employ ggplot to visualize the read distribution profiles as barplots.

Reads distribution by biotype:

## ---- Abundance plot by biotype group
colP <- c('violetred1','slateblue1','darkgrey','lightgrey',
          'springgreen4','springgreen3','violetred4','tan3','tan4')
names(colP) <- bgroups
p1 <- ggplot(biotype, aes(x = sn, y = counts, group = biotype, fill = biotype)) +    
  geom_bar(stat = 'identity', colour = 'black', width = 0.8) + 
  coord_flip() + xlab('') + ylab('Number of assigned reads') + theme_pubclean() + 
  guides(fill=guide_legend(nrow=5)) + scale_fill_manual(values=colP) +  
  theme(legend.position = 'top', legend.title = element_blank())

## ---- Relative abundance plot by small non-coding biotype
colP <- c('#e6be97','#848CFF','#4c2382', '#50eb76','tomato','#FDFF87','#FFAE51','darkorchid1','gold1','sienna3')
p2 <- ggplot(biotype_snc, aes(x = sn, y = counts, group = biotype, fill = biotype)) +
  geom_bar(stat = 'identity', position = 'fill', colour = 'black', width = 0.8) + theme_pubclean() +
  coord_flip() + xlab('') + ylab('Proportion of small non-coding assigned reads') + 
  guides(fill=guide_legend(nrow=5)) +   scale_fill_manual(values = colP) +
  theme(axis.text.y = element_blank(), legend.position = 'top', legend.title = element_blank())

## Display plots together
ggarrange(p1, p2, ncol = 2, widths = c(2,1))

Finally, we import the multigraph_communities tables. These tables link each original feature in the .gtf with the resultant feature matching the count matrix and feature metadata identifiers. It includes both unique features (that remain unmodified) and aggregated features (that are collapsed following MG communities). Thus, we can track back the features grouped in each aggregated feature.

Here we will use the feature_metadata subset from before to explore the original features in the .gtf forming each new MG community feature.

csmall <- read.csv(paste0(root_dir,'outputs/multigraph_communities_small.csv'))
clong <- read.csv(paste0(root_dir,'outputs/multigraph_communities_long_exon.csv'))

kable(subset(csmall, feature %in% feat_subset$feature)) %>% scroll_box(height = "300px")

	transcript_name	transcript_biotype	naln	naln_community	community_flag	community_id	community_name	community_biotype	feature
38	RNU4-28P	snRNA	4	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
79	RNU4-27P	snRNA	47	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
122	RNU4-88P	snRNA	10	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
150	RNU4-59P	snRNA	179	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
161	RNU4-75P	snRNA	16	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
216	U4.4	snRNA	6	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
232	U4.7	snRNA	321	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
239	RNU4-42P	snRNA	7	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
334	RNU4-21P	snRNA	40	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
340	RNU4-77P	snRNA	2	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
366	RNU4-73P	snRNA	236	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
404	RNU4-63P	snRNA	8	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
411	RNU4-49P	snRNA	110	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
426	RNU4-51P	snRNA	7	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
472	RNU4-8P	snRNA	100	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
473	RNU4-84P	snRNA	285	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
492	RNU4-48P	snRNA	20	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
596	U4.5	snRNA	262	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
623	RNU4-85P	snRNA	224	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
644	RNU4-56P	snRNA	34	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
649	RNU4-78P	snRNA	12	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
729	RNU4-62P	snRNA	92	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
779	RNU4-38P	snRNA	11	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
787	RNU4-4P	snRNA	117	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
789	RNU4-91P	snRNA	240	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
832	RNU4-89P	snRNA	14	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
939	RNU4-33P	snRNA	22	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
975	RNU4-79P	snRNA	25	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
979	RNU4-87P	snRNA	22	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
995	RNU4-64P	snRNA	19	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1078	RNU4-11P	snRNA	227	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1133	RNU4-14P	snRNA	216	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1169	U4.3	snRNA	68	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1254	RNU4-66P	snRNA	17	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1264	RNU4-12P	snRNA	318	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1277	RNU4-70P	snRNA	55	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1304	RNU4-35P	snRNA	6	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1305	RNU4-76P	snRNA	78	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1313	RNU4-18P	snRNA	14	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1332	RNU4-7P	snRNA	427	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1496	RNU4-74P	snRNA	15	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1519	RNU4-31P	snRNA	27	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1533	RNU4-6P	snRNA	237	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1567	RNU4-52P	snRNA	281	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1584	RNU4-81P	snRNA	7	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1644	Z77249.1	snoRNA	15	944	True	snoRNA-cl-129	SNORA8_AC007448.1_Z77249.1	snoRNA	SNORA8_AC007448.1_Z77249.1
1647	RNU4-44P	snRNA	227	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1685	RNU4-71P	snRNA	7	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1723	RNU4-50P	snRNA	4	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1766	RNU4-83P	snRNA	3	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1785	U4.6	snRNA	69	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1790	RNU4-25P	snRNA	186	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1814	RNU4-26P	snRNA	4	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1833	RNU4-53P	snRNA	195	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1869	RNU4-15P	snRNA	5	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
1907	RNU4-82P	snRNA	1405	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2013	RNU4-39P	snRNA	160	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2050	SNORA8	snoRNA	912	944	True	snoRNA-cl-129	SNORA8_AC007448.1_Z77249.1	snoRNA	SNORA8_AC007448.1_Z77249.1
2060	RNU4-55P	snRNA	4	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2086	RNU4-23P	snRNA	126	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2095	RNU4-86P	snRNA	2	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2100	U4.2	snRNA	24	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2213	RNU4-5P	snRNA	8	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2230	SCARNA10	snoRNA	625	625	False				SCARNA10
2251	RNU4-67P	snRNA	84	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2255	RNU4-54P	snRNA	104	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2297	RNU4-20P	snRNA	5	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2306	RNU4-65P	snRNA	17	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2329	RNU4-24P	snRNA	2	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2338	RNU4-41P	snRNA	4	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2354	RNU4-32P	snRNA	15	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2372	RNU4-2	snRNA	16486	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2373	RNU4-1	snRNA	3245	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2393	RNU4-9P	snRNA	37	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2453	RNU4-10P	snRNA	344	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2563	RNU4-92P	snRNA	26	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2627	RNU4-68P	snRNA	112	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2776	RNU4-80P	snRNA	34	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2862	RNU4-46P	snRNA	147	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2897	RNU4-58P	snRNA	338	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2909	RNU4-30P	snRNA	157	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
2911	RNU4-36P	snRNA	245	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
3068	RNU4-13P	snRNA	424	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
3088	AC007448.1	snoRNA	17	944	True	snoRNA-cl-129	SNORA8_AC007448.1_Z77249.1	snoRNA	SNORA8_AC007448.1_Z77249.1
3176	RNU4-17P	snRNA	71	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
3243	RNU4-40P	snRNA	7	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
3332	RNU4-60P	snRNA	25	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
3458	RNU4-45P	snRNA	20	28863	True	snRNA-cl-7	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc	snRNA	RNU4-2_RNU4-1_RNU4-82P_RNU4-7P_RNU4-13P_RNU4-10P_RNU4-58P_U4.7_etc
5420	hsa-miR-323b-3p	miRNA	6	6	False				hsa-miR-323b-3p

kable(subset(clong, feature %in% sub('-exon','',feat_subset$feature))) %>% scroll_box(height = "300px")

	gene_name	gene_biotype	naln	naln_community	community_flag	community_id	community_name	community_biotype	feature
3877	ACTR2	protein_coding	708	748	True	long-cl-3296	ACTR2	protein_coding	ACTR2
33247	AP000357.1	processed_pseudogene	40	748	True	long-cl-3296	ACTR2	protein_coding	ACTR2
33974	CBR1	protein_coding	325	325	False				CBR1

kable(subset(clong, feature %in% sub('-intron','',feat_subset$feature))) %>% scroll_box(height = "300px")

	gene_name	gene_biotype	naln	naln_community	community_flag	community_id	community_name	community_biotype	feature
2919	CNIH4	protein_coding	98	99	True	long-cl-2498	CNIH4	protein_coding	CNIH4
6688	SLC15A2	protein_coding	39	39	False				SLC15A2
12085	AL590002.1	processed_pseudogene	1	99	True	long-cl-2498	CNIH4	protein_coding	CNIH4

Also, we can exploit the multigraph_communities tables to extract stats from the MGcount run. Here we look at the proportion of aggregated features (community_flag variable) by biotype (feature_biotype). The output shows how small RNA biotypes tend to be more aggregated in communities because duplicated loci are more frequent.

kable(with(csmall, ctable(transcript_biotype, community_flag)))

	False	True	Total
miRNA	484	69	553
misc_RNA	251	1288	1539
Mt_rRNA	2	0	2
Mt_tRNA	22	0	22
piRNA	822	706	1528
ribozyme	2	0	2
rRNA	6	30	36
rRNA_pseudogene	104	189	293
scaRNA	15	13	28
snoRNA	180	436	616
snRNA	260	700	960
tRNA	43	179	222
Total	2191	3610	5801

	False	True	Total
miRNA	0.8752260	0.1247740	1
misc_RNA	0.1630929	0.8369071	1
Mt_rRNA	1.0000000	0.0000000	1
Mt_tRNA	1.0000000	0.0000000	1
piRNA	0.5379581	0.4620419	1
ribozyme	1.0000000	0.0000000	1
rRNA	0.1666667	0.8333333	1
rRNA_pseudogene	0.3549488	0.6450512	1
scaRNA	0.5357143	0.4642857	1
snoRNA	0.2922078	0.7077922	1
snRNA	0.2708333	0.7291667	1
tRNA	0.1936937	0.8063063	1
Total	0.3776935	0.6223065	1

kable(with(clong, ctable(gene_biotype, community_flag)))

	False	True	Total
Hybrid	9	9	18
IG_C_gene	4	2	6
IG_C_pseudogene	0	1	1
IG_J_pseudogene	0	1	1
IG_V_gene	2	6	8
IG_V_pseudogene	4	4	8
lncRNA	6256	1509	7765
polymorphic_pseudogene	11	7	18
processed_pseudogene	2354	3939	6293
protein_coding	12945	4903	17848
TR_C_gene	5	0	5
TR_V_gene	8	6	14
TR_V_pseudogene	3	0	3
transcribed_processed_pseudogene	173	208	381
transcribed_unitary_pseudogene	75	18	93
transcribed_unprocessed_pseudogene	350	321	671
translated_processed_pseudogene	0	2	2
translated_unprocessed_pseudogene	0	1	1
unitary_pseudogene	21	8	29
unprocessed_pseudogene	438	560	998
Total	22658	11505	34163

	False	True	Total
Hybrid	0.5000000	0.5000000	1
IG_C_gene	0.6666667	0.3333333	1
IG_C_pseudogene	0.0000000	1.0000000	1
IG_J_pseudogene	0.0000000	1.0000000	1
IG_V_gene	0.2500000	0.7500000	1
IG_V_pseudogene	0.5000000	0.5000000	1
lncRNA	0.8056665	0.1943335	1
polymorphic_pseudogene	0.6111111	0.3888889	1
processed_pseudogene	0.3740664	0.6259336	1
protein_coding	0.7252913	0.2747087	1
TR_C_gene	1.0000000	0.0000000	1
TR_V_gene	0.5714286	0.4285714	1
TR_V_pseudogene	1.0000000	0.0000000	1
transcribed_processed_pseudogene	0.4540682	0.5459318	1
transcribed_unitary_pseudogene	0.8064516	0.1935484	1
transcribed_unprocessed_pseudogene	0.5216095	0.4783905	1
translated_processed_pseudogene	0.0000000	1.0000000	1
translated_unprocessed_pseudogene	0.0000000	1.0000000	1
unitary_pseudogene	0.7241379	0.2758621	1
unprocessed_pseudogene	0.4388778	0.5611222	1
Total	0.6632322	0.3367678	1

We are done!

T3- Explore output multi-mapping graph

Below, we use a few functions defined in “mg_visualize.R” to graphically explore the multi-mapping graph topologies.

The function ‘mg_build’ takes the multi-mapping graph adjacency matrix and the MG communities and creates an igraph object that can be explored via igraph library. The next two functions (mg_plotset and mg_interactive) provide with the code to generate a few default plots given a Multi-Graph igraph object.

library(Matrix)
library(igraph)
library(plotly)
source(paste0(root_dir,'R/mg_visualize.R'))

As an example, we will load the small-RNA graph adjacency matrix and we will subset it to explore the sub-graph associated to microRNA features.

The adjacency matrix is stored as a sparse symmetric matrix in MatrixMarket format, which can be imported in R by the readMM function from ‘Matrix’ R package. We also import the table containing all annotated features and MG communities outputs.

dir.create(paste0(root_dir,'mgplots'))

## Import ml data and adjacency matrix for small assignation round
inputM <- readMM(paste0(root_dir,'outputs/multigraph_matrix_small.mtx'))
multiloci_table <- read.csv(paste0(root_dir,'outputs/multigraph_communities_small.csv'))

Next we subset the matrix by selecting the features under ‘miRNA category’. Each row in the communities table table corresponds to a feature annotation with non-zero assignments in the small round. The row index of each feature defines its column and row position in the multi-mapping graph adjacency matrix.

## Extract microRNA matrix subset
btype <- 'miRNA'
idx <- which(multiloci_table$transcript_biotype == btype)
inM <- inputM[idx,idx]; ml <- multiloci_table[idx,]

The necessary steps to convert the adjacency matrix and the communities table to an igraph object are in the mg_build function which outputs an igraph object.

## Generate microRNA graph object
attr <- 'transcript_name'
g = mg_build(inM, ml, attr)
g <- delete_vertices(g, V(g)[V(g)$weight < 1])

We can explore the patterns in the feature gene symbols established by the HGNC and link that to specific colors during visualization. Here we force annotations associated to a mature clipped 3-prime microRNA to be colored in orange and annotations associated to a mature clipped 5-prime microRNA to be colored in blue. For this, we look for “-3p” and “-5p” patterns in the transcript_name annotation symbol, which MGcount uses as default featue_output for small RNA.

## Extract microRNA mature extreme information from HUGO symol
V(g)$color_HUGO1 <- 'grey'
V(g)$color_HUGO1[grep('-3p',V(g)$feat)] <- 'orange'
V(g)$color_HUGO1[grep('-5p',V(g)$feat)] <- 'dodgerblue'

## Extract microRNA class from HUGO symbol
V(g)$color_HUGO2 <- 'grey'
idx <- grep('hsa-miR',as.character(V(g)$feat))
hugo <- as.factor(gsub('.*?([0-9]+).*', '\\1', V(g)$feat[idx]))
V(g)$color_HUGO2[idx] <- sample(grDevices::colors()[grep('gr(a|e)y', grDevices::colors(), invert = T)], 
                               length(unique(hugo)))[hugo]

The function mg_plotset generates a set of different plots of the multi-mapping graph and stores them as .png files with a user-given file path and prefix (plotfile argument).

We next use magick R package to load a few of the generated .png files into R. Each vertex is an annotated features with size proportional to its number of aligned reads. Each edge connects two features with shared multi-mappers with thickness proportional to the fraction of shared multi-mappers over the total alignments. Shared grey areas delineate MG communities. Vertices are colored according to the attribute “customColor”, which we just defined to be orange, blue and grey for “-3p”, “-5p” and absence of “-3p”/“-5p” patterns respectively. We may modify this as desired.

## Define plot colot
V(g)$customColor <- V(g)$color_HUGO1

## Standard plots
mg_plotset(g, plotfile= paste0(root_dir,'mgplots/mg_miRNA'))

Raw visualization of the graph

## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
img <- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA.png'))
plot(img)

Visualization of the graph colored by -3/-5 patterns in microRNA transcript symbol

## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
img <- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA_color.png'))
plot(img)

Visualization of the graph colored by -3/-5 patterns in microRNA transcript symbol and detected communities

## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
img <- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA_cl_color.png'))
plot(img)

Interactive visualization of the graph

To explore large graphs with interactive visualization tools such as zoom, we may use mg_interactive which creates a plotly object from the graph. Here the different communities are represented by colors.

mg_interactive(g, paste0(root_dir,'mgplots/mg_miRNA'))

We are done!

Integrating annotation sources

MGcount is a tool conceived to analyse heterogeneous datasets capturing diverse non-coding transcript types but the scope of features quantified by MGcount is bounded by the features annotated in the reference .gtf file used as input. On this line, although MGcount can be executed with any .gtf annotations file (Ensembl, Gencode, etc…), we provide with the option to integrate annotations from several databases to take into account a more complete or/and up-to-date annotations set in the quantification, specially for small regulatory RNAs (piRNA, tRF, microRNA, siRNA). These transcripts are not necessarily annotated in general databases.

In tutorial 1, we used a ready integrated .gtf for the human genome. Besides, the directory integrated_annotations_gtf we downloaded in tutorial 1, provides with integrated annotations .gtf files for Arabidopsis, Mouse and Nematode that can be used for running MGcount in the corresponding species. The following databases have been integrated for each specie:

Arabidopsis thaliana: Ensembl, miRBase (microRNA) and RNACentral (siRNA);
Homo Sapiens Ensembl, DASHR (piRNA and tRNA fragments [tRF]) and miRBase (microRNA);
Mus musculus: Ensembl, miRBase (microRNA) and RNAcentral (piRNA);
Caenorhabditis elegans: Ensembl and miRBase (microRNA).

For running MGcount in any other genome, we encourage to follow the same procedure we followed to generate the last 4 .gtf files. The script we used to generate them is provided in the R folder from the MGcount Github repository (that can be individually download with the command in tutorial 2). We hope the script can be used as a template for custom .gtf integration in other species.