PASTA is a modular software pipeline for the analysis of RNA-Sequencing data. It provides an innovative algorithm for the accurate, unbiased identification of splice junctions; automatic generation of gene models from junction information, both annotation-based or de novo; and estimation of relative expression changes of individual exons and junctions or entire isoforms.
The first component of the pipeline, called PASTA1, implements a novel splice junction detection algorithm based on patterned subsequence alignments and a detailed, species-specific model of intronic context. The method is highly sensitive, and is able to reliably detect splice junctions even at low sequencing depths.
The program is highly configurable and easy to use. It is distributed as a command-line tool designed for inclusion in automated RNA-Seq analysis pipelines in a GNU/Linux environment.
PASTA1 is distributed as a GNU/Linux command-line 64bit executable. The downloadable package contains the program, documentation, and sample data. Source code is available upon request.
PASTA1 relies on the bowtie program. Therefore, the following executables should be found in your PATH (or their location can be specified using command-line arguments):
PASTA1 can also work with bowtie2. In this case the names of the three executables will be
Please see the PASTA1 tutorial for instructions on configuring and running PASTA1.
PASTA1 produces the following two output files:
- junctions.bed (or the filename specified with the -outfile argument): a BED file containing the location of predicted splice junctions.
- alignments.sam (or the filename specified with the -sam argument): read alignments in SAM format.
How it works
The first step in a PASTA run consists in a straightforward alignment of the RNA-Seq short reads to the reference genome, using bowtie. PASTA then considers the set of reads for which no full-length alignment was found. Each unaligned read produces a set of fragment pairs, by splitting it at different points along its length. The splitting can be controlled by the user, to make the process faster (larger stepping distance, leading to fewer fragment pairs) or more accurate (smaller stepping distance, leading to more fragment pairs). These patterned sequences will then be aligned to the reference sequence again, and the results will be used by a logistic regression model to identify the exact position of exon-exon junctions. The model also takes into account organism-specific information, if available, such as expected intron length, branch-point sequence, splice signals. The model coefficients are pre-computed and are different for each organism. The example configuration file included in the PASTA package provides the coefficients for the mouse genome.
PASTA: splice junction identification from RNA-Sequencing data
Tang S, Riva A.
BMC Bioinformatics. 2013; 14:116