% cvs -d :pserver:anonymous@pasa.cvs.sourceforge.net:/cvsroot/pasa co PASA
![]() |
PASA, acronym for Program to Assemble Spliced Alignments, is a Eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.
See also older
2007 pasa_daphc status
PASA was originally developed at The Institute for Genomic Research in 2002 as an effort to automatically improve gene structures in Arabidopsis thaliana. Since then, it has been applied to numerous Eukaryotic genome annotation projects including Rice, Aspergillus species, Plasmodium falciparum, Schistosoma mansoni, and Aedes aegypti. We have also successfully applied it to lesser extents in both Mouse and Human among others.
Functions of PASA include:
model complete and partial gene structures based on assembled spliced alignments.
automatically incorporate gene structures based on transcript alignments into existing gene structure annotations, thereby maintaining annotations consistent with experimental evidence. Annotation updates include
modeling untranslated regions (UTRs)
exon additions, deletions, adjustments
addition of models for alternative splicing variants
merging genes
splitting genes
modeling novel genes
map polyadenylation sites to the genome
identify and classify all found splicing variations
PASA is composed of a pipeline of utilities that perform the following ordered set of tasks:
cleaning the transcripts
The seqclean utility, developed by the TIGR Gene Index group, is used to identify evidence of polyadenylation and strip the poly-A, trim vector, and discard low quality sequences.
mapping and aligning transcripts to the genome
GMAP or BLAT is used to map and align the transcripts to the genome. In case this alignment is deemed unsuitable (see below), sim4 can be used to realign the transcript to the mapped region of the genome. Only the single best alignment for each transcript is analyzed.
Validate nearly perfect alignments
PASA utilizes only near perfect alignments. These alignments are required to align with a specified percent identity (typically 95%) along a specified percent of the transcript length (typically 90%). Each alignment is required to have consensus splice sites at all inferred intron boundaries, including (GT/GC donor with an AG acceptor, or the AT-AC U12-type dinucleotide pairs).
Maximal assembly of spliced alignments
The valid transcript alignments are clustered based on genome mapping location and assembled into gene structures that include the maximal number of compatible transcript alignments. Compatible alignments are those that have identical gene structures in their region of overlap. The products are termed PASA maximal alignment assemblies. Those assemblies that contain at least one full-length cDNA are termed FL-assemblies; the rest are non-FL-assembles.
Grouping alternatively spliced isoforms
Alignment assemblies that map to the same genomic locus, significantly overlap, and are transcribed on the same strand, are grouped into clusters of assemblies.
Automatic Genome Annotation
Given a set of existing gene structure annotations, which may include the latest annotation for a given genome or the results of a single ab-initio gene finder, a comparison to the PASA alignment assemblies is performed. Each alignment assembly is assigned a status identifier based on the results of the annotation comparison. The status identifier indicates whether or not the update is sanctioned as likely to improve the annotation, and the type of update that the assembly provides. There are over 40 different status identifiers (actually, about 20 since half correspond to FL-assemblies and the other half to non-FL-assemblies).
In the absence of any preexisting gene annotations, novel genes and alternative splicing isoforms of novel genes can be modeled.
At any time, regardless of any existing annotations, users can obtain candidate gene structures based on the longest open reading frame (ORF) found within each PASA alignment assembly. The output includes a fasta file for the proteins and a GFF3 file describing the gene structures. This is useful when applied to a previously uncharacterized genome sequence, allowing one to rapidly obtaining a set of candidate gene structures for training various ab-intio gene prediction programs.
PASA runs on a UNIX/LINUX-based architecture. PASA involves components written in Perl and C++. Utilities used by PASA, including GMAP, are wrapped by Perl code. Results are stored primarily within a MySQL database, and are available for analysis using the companion suite of Web-based tools and command-line utilities. Running PASA to generate alignment assemblies requires only three inputs: a multi-fasta file for the genome, a multi-fasta file for the transcripts, and a file containing only the accessions of those entries in the transcript file that are considered full-length cDNAs. In order to compare the assemblies to existing gene structure annotations and to forcefully update those annotations based on the alignment assemblies, the user must integrate PASA into their own annotation system by implementing the available data adapters, as described in the Data Adapter section below.
Download the latest version of the PASA software straight from Sourceforge using CVS like so:
% cvs -d :pserver:anonymous@pasa.cvs.sourceforge.net:/cvsroot/pasa co PASA
In addition to the PASA software obtained here, you will need the following:
Relational Database
MySQL (www.mysql.com) create a user/password with read-only access create a user/password with all privileges
Webserver
Apache (www.apache.org)
Perl Modules from CPAN (www.cpan.org):
DBD::mysql
GD
Bioinformatics Tools:
GMAP v.2005-10-25 (gmap-2005-10-25.tar.gz) Important: Use this version only for now.
blat: (Jim Kent's homepage Download the blat suite under the Executables link)
fasta: (Fasta3 ftp site) Note that the fasta utility is bundled with other utilites as part of the Fasta3 suite. The fasta utility (ie. named fasta34) should be renamed (or symlinked to) fasta
Note
|
The utilities provided by each software package above should be available via your PATH setting. |
Move the PASA distribution to a location on your filesystem that we can call PASAHOME, such as /usr/local/bin/PASA. From henceforth, we'll refer to this location as $PASAHOME.
The PASA distribution includes the following utilities that you should build and centrally install:
pasa (the alignment assembler utility). $PASAHOME/pasa_cpp contains the source for the spliced alignment assembler utility. Build it like so:
% cd pasa_cpp % make install the pasa binary in a central location (ie. /usr/local/bin/)
slclust (a clustering utility). $PASAHOME/SLCLUST contains the source for a single-linkage clustering utility. Build it like so:
% cd SLCLUST % make install copy the bin/slclust utility to a central location.
sim4 $PASAHOME/SIM4_MOD contains a version of sim4 with a slightly modified output format. (generously supplied by Liliana Florea). Build it like so:
% cd SIM4_MOD/sim4.2002-03-03_mod/ % make install the sim4-mod binary in a central location
seqclean $PASAHOME/seqclean provides the seqclean sofware developed by the TIGR Gene Index Group, and distributed along with PASA by permisson of John Quackenbush.
Install the software by following the instructions provided.
cdbtools (fasta file indexing and entry retrieval) $PASAHOME/cdbtools provides the cdbyank and cdbfasta utilities. The CdbTools were written by Geo Pertea, formerly of the TIGR Gene Index group.
Build the software as per the instructions included and install them in a central location
After installing each of the software tools above, all that is needed before running PASA is to configure it. The PASA configuration relies on the file: $PASAHOME/pasa_conf/conf.txt
A template configuration file is provided at $PASAHOME/pasa_conf/pasa.CONFIG.template
Simply copy pasa.CONFIG.template to conf.txt and set the values for your MySQL database settings. You only need concern yourself with the following values: PASA_ADMIN_EMAIL=(your email address)
MYSQLSERVER=(your mysql server name) MYSQL_RO_USER=(mysql read-only username) MYSQL_RO_PASSWORD=(mysql read-only password) MYSQL_RW_USER=(mysql all privileges username) MYSQL_RW_PASSWORD=(mysql all privileges password)
Recursively copy the $PASAHOME area to the cgi-bin directory of your webserver. Change permissions on everything so that it is world executable (ie. % chmod -R 755 ./PASA ) Now, visit the URL for the status report page for the pasa database you created during the pasa run above.
http://yourServerName/cgi-bin/PASA/cgi-bin/status_report.cgi?db=$mysqldb
This will provide some summary statistics and links to additional web-based utilities for navigating the results from your pasa run.
Now that you have a URL for your base PASA url, update your original configuration file at: $PASAHOME/pasa_conf/conf.txt to set the value of BASE_PASA_URL=http://yourServerName/cgi-bin/PASA/cgi-bin/
As input to the command-line driven PASA pipeline, we need only three input files.
The genome sequence in a multiFasta file (ie. genome.fasta)
The transcript sequences in a multiFasta file (ie. transcripts.fasta)
A file containing the list of accessions corresponding to full-length cDNAs (ie. FL_accs.txt)
Have each of these files in the same working directory. Then, run the seqclean utility on you transcripts like so:
% seqclean transcripts.fasta
If you have a database of vector sequences (ie. UniVec), you can screen for vector as part of the cleaning process by running the following instead:
% seqclean transcripts.fasta -v /path/to/your/vectors.fasta
This will generate several output files including transcripts.fasta.cln and transcripts.fasta.clean Both of these can be used as inputs to PASA.
copy $PASAHOME/pasa.alignAssembly.Template.txt to your working directory as alignAssembly.config.
Edit this configuration file, replacing <__MYSQLDB__> with the name of the pasa mysql database to be created as part of this pasa run.
You need not make any other changes at this time. We'll discuss more extensive parameterization of PASA via the runtime configuration files later.
% $PASAHOME/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome.fasta -t transcripts.fasta.clean -T -u transcripts.fasta -f FL_accs.txt --USE_GMAP
This should execute the numerous steps that involve processing and assembling the transcript alignments. The pipeline generates several output files in your working directory, most notably:
tentative cDNA sequences for PASA alignment assemblies
coordinates for pasa alignment assemblies on the genome
coordinates for each underlying valid transcript alignment
a report for every transcript mapped to the genome, and an indicator for the success of the validation test. Those failing validation include a short cryptic message describing why.
Most of the remaining output files exist for auditing and tracking purposes, and are not of interest to most users. The primary access to the data is thru the companion web portal.
Sample inputs are provided in the $PASAHOME/sample_data directory. We'll use these inputs to demonstrate the breadth of the software application, including using sample DATA ADAPTERs to import existing gene annotations into the database, and tentative structural updates out.
For my PASA configuration file, I'm using $PASAHOME/pasa_conf/sample_test.conf symlinked to conf.txt in that same directory. In it, the mysql server and user/password information our set. This configuration file need only be established once per PASA software installation. You should configure your conf.txt file immediately if you haven't done so already.
The next steps explain the current contents of the sample_data directory. You need not redo these operations:
I've copied the ../pasa_conf/pasa.alignAssembly.Template.txt to alignAssembly.config and edited the pasa database name to pasa_sample_db.
My required input files exist as: genome_sample.fasta, all_transcripts.fasta, and FL_accs.txt
I already ran seqclean to generate files: all_transcripts.fasta.clean and all_transcripts.fasta.cln
The following steps, you must execute in order to demonstrate the software.
Run the PASA alignment assembly pipeline like so:
% ../scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome_sample.fasta -t all_transcripts.fasta.clean -T -u all_transcripts.fasta -f FL_accs.txt --USE_GMAP
There are two primary forms of data adapters. One is used to load the latest versions of annotations into the mysql database, and the other is to perform annotation updates based on the results of an annotation comparison. The Data Adapters need to be implemented by the user, although sample implementations are available and described here.
Data Adapters, termed here as hooks, are Perl modules that implement abstract interfaces specific for each operation. The $PASAHOME/pasa_conf/conf.txt file indicates the directory where these custom modules are to be located. In our sample_test.conf file, we have the following line:
HOOK_PERL_LIBS=__PASAHOME__/SAMPLE_HOOKS
which indicates that our example hooks are modules that are found in the $PASAHOME/SAMPLE_HOOKS directory. The actual modules that implement the hooks are specified in the lines:
HOOK_EXISTING_GENE_ANNOTATION_LOADER=Sample_annot_retriever::get_annot_retriever
and
HOOK_GENE_STRUCTURE_UPDATER=Sample_annot_updater::get_updater_obj
The HOOK_EXISTING_GENE_ANNOTATION_LOADER=Sample_annot_retriever::get_annot_retriever indicates that the module Sample_annot_retriever.pm found in the SAMPLE_HOOKS directory implements a method get_annot_retriever() which returns an object that inherits from and implements the functions of package PASA_UPDATES::Pasa_latest_annot_retrieval_adapter.
The HOOK_GENE_STRUCTURE_UPDATER=Sample_annot_updater::get_updater_obj indicates that the module Sample_annot_updater.pm found in the SAMPLE_HOOKS directory implements a method get_update_obj() which returns an object that inherits from and implements the functions of package PASA_UPDATES::Pasa_update_adapter.
In our example, we have our original annotations supplied in gff3 format, and our gene annotation loader hook implementing a gff3 file reader to supply the genome annotations. We run the following script to load our annotations:
% ../scripts/Load_Current_Gene_Annotations.dbi -c alignAssembly.config -g genome_sample.fasta -P orig_annotations_sample.gff3
The above script calls the gene annotation loader hook as specified in our sample conf.txt file. The value provided to -P is provided as a parameter to the function called as the hook. In this example, the paramter is the name of the gff3 file that contains the annotations. In a different implementation of this hook (ie. at TIGR), this paramter is instead a set of values that are needed to connect to a relational database from which the annotations are extacted.
This system is designed to be flexible so that the annotations can be extracted from any source, relying on a custom implementation of the data adapter specified in the conf.txt file.
Now that the original annotations are loaded, we can perform a comparison of the PASA alignment assemblies to these preexisting gene annotations, to identify cases where updates can be automatically performed to gene structures in order to incorporate the transcript alignments.
I've copied the ../pasa_conf/pasa.annotationCompare.Template.txt file to our working directory as annotCompare.config. Then, I replaced the MYSQLDB=<__MYSQLDB__> line with MYSQLDB=pasa_sample_db as before with the alignAssembly.config file. Notice this config file contains numerous parameters that can be modified to tune the process to any genome of interest. We'll leave these values untouched for now, relying on the defaults used by PASA, and we'll revisit parameterization later. For most purposes, the defaults are well suited. Run the annotation comparison like so:
% ../scripts/Launch_PASA_pipeline.pl -c annotCompare.config -A -g genome_sample.fasta -t all_transcripts.fasta.clean
Once this finishes, you should revisit the status_report.cgi web page as described above under Setting Up the PASA Web Portal. There, you will be able to navigate the results of the comparison and examine the classifications for annotation updates assigned to each pasa alignment assembly.
After the annotation comparison, there will likely be some subset of PASA alignment assemblies that are classified as able to be successfully incorporated into gene structure annotations. To extract these and perform updates, we need run the following script that calls the hook to the gene structure update adapter.
% ../scripts/cDNA_annotation_updater.dbi -M "pasa_sample_db:bhaas-lx:access:access" -P null
Again, if we needed to pass some critical piece of information to the hook, such as database connection parameters, we would do that thru the -P option of the script. The example data adapter here does nothing but print the tentative successful gene structure updates to stdout, and so we simply pass null to the -P option just so the parameter won't be empty.
Note
|
It usually requires at least two cycles of annotation loading, annotation comparison, and annotation updates in order to maximize the incorporation of transcript alignments into gene structures. Updates made to gene structures in the first round often lead to the capacity to incorporate additional transcript alignments that did not fit well in the context of the earlier gene structures. |
For demonstration purposes, PASA was applied to a set of 100 Arabidopsis BAC sequences, ~700,000 EST and cDNAs, and genemarkHMM ab-initio gene predictions, and the results are made accessible here.
If seqclean was used to clean the transcript sequences, and both the cleaned and original transcript databases were provided in the alignment assembly run of the PASA pipeline as described, then the polyadenylation sites as evidenced in the original transcript sequences and identified as part of the seqclean process were mapped to the genome. The termini of the polyadenylated transcripts are compared to the genome, and those transcripts that truly appear to be polyadenylated and not resulting from an artifact of internal priming to an A-rich region, are reported as candidate polyA sites. The genome coordinate reported as the polyA site is the nucleotide to which polyA is added, so it corresponds to the last non-polyA nucleotide of the polyadenylated transcript. An example of a candidate polyA site can be extracted from one of the output files (default output.polyAsite_analysis.out) like so:
// cdna:gi|51968615|dbj|AK175237.1|, annotdb_asmbl_id:68712, polyAcoord:50443, transcribedOrient:+, rend CGCTTCTTATattacagggt CGCTTCTTATAAAAAAAAAA gi|51968615|dbj|AK175237.1| TransOrient (+) trimmedSeq: AAAAAAAAAA OK polyA site candidate.
An additional fasta file (default output.polyAsites.fasta) summarizes all mapped polyA sites supported by the transcripts. A 100 bp segment of the genome sequence is extracted and oriented, and the last nucleotide in uppercase corresponds to the residue to which polyA is added in the processed transcript. The site corresponding to our example above is as follows:
>68712-50443_+ 1 transcripts: gi|51968615|dbj|AK175237.1| ATCGACCACCCTCTTTTTTATAAGTAACTTTTCAAGATAACGCTTCTTATattacagggtctacttccattacaaatgcaataggtttgatggttaataa
The accession is bundled like so:
genome_accession - polyA_coordinate _ transcribed_orientation
The rest of the header indicates the number of transcripts supporting this polyA site followed by the list of those transcript accessions. The examples above were extracted from our sample data set provided. A more compelling example for Arabidopsis, using spliced transcripts only, is as follows:
>chr5-506542_- 44 transcripts: gi|86086725|gb|DR382484.1|DR382484,gi|86082384|gb|DR378143.1|DR378143,gi|86082270|gb|DR378029.1|DR378029,gi|86082193|gb|DR377952.1|DR37795 2,gi|86082172|gb|DR377931.1|DR377931,gi|86082156|gb|DR377915.1|DR377915,gi|86082123|gb|DR377882.1|DR377882,gi|86082071|gb|DR377830.1|DR377830,gi|86081971|gb|DR377730. 1|DR377730,gi|86081887|gb|DR377646.1|DR377646,gi|86081885|gb|DR377644.1|DR377644,gi|86081868|gb|DR377627.1|DR377627,gi|86081709|gb|DR377466.1|DR377466,gi|86081657|gb| DR377414.1|DR377414,gi|86081635|gb|DR377392.1|DR377392,gi|86081559|gb|DR377316.1|DR377316,gi|86081550|gb|DR377307.1|DR377307,gi|86081543|gb|DR377300.1|DR377300,gi|860 81529|gb|DR377286.1|DR377286,gi|86081252|gb|DR377009.1|DR377009,gi|86081247|gb|DR377004.1|DR377004,gi|86081239|gb|DR376996.1|DR376996,gi|86079014|gb|DR374771.1|DR3747 71,gi|86076986|gb|DR372743.1|DR372743,gi|85870703|gb|DR191655.1|DR191655,gi|85869935|gb|DR190887.1|DR190887,gi|85869920|gb|DR190872.1|DR190872,gi|85869608|gb|DR190560 .1|DR190560,gi|85869452|gb|DR190404.1|DR190404,gi|85869353|gb|DR190305.1|DR190305,gi|85869352|gb|DR190304.1|DR190304,gi|85869340|gb|DR190292.1|DR190292,gi|85869337|gb |DR190289.1|DR190289,gi|85869336|gb|DR190288.1|DR190288,gi|85869335|gb|DR190287.1|DR190287,gi|85869329|gb|DR190281.1|DR190281,gi|85868471|gb|DR189423.1|DR189423,gi|85 867798|gb|DR188750.1|DR188750,gi|85867058|gb|DR188010.1|DR188010,gi|49285508|gb|BP634256.1|BP634256,gi|32888810|gb|CB264037.1|CB264037,gi|32888295|gb|CB263522.1|CB263 522,gi|32885705|gb|CB260932.1|CB260932,gi|32885650|gb|CB260877.1|CB260877 GTTTTATCTTTGTGACTTTATTAATCCTAAGACTATTATGGGTTTGTATTaaagtttgcttctttcttgctcactacacaattaagattcaagcccattg
Note
|
Polyadenylation sites identified here require that there is evidence of polyadenylation in the original transcript sequence. Other systems examine clusters of transcript alignment termini within windows. This is not done here yet as part of PASA. Only those polyA sites supported by experimental evidence of polyadenylation are reported. |
PASA is a tool well suited to the identification and classification of alternative splicing isoforms as evidenced by incompatible transcript alignments. Overlapping alignments found incompatible in that they have some structural difference within their overlapping region, and due to their nature of incompatibility, they are relegated to different but overlapping alignment assemblies. PASA performs and all-vs-all comparison among the clustered overlapping alignment assemblies to identify the following categories of splicing variations:
alternative donor or acceptor
retained or spliced intron
starts or ends in an intron
skipped or retained exons
alternate terminal exons
The automated alternative splicing analysis can be run like so as exemplified from the sample_data directory:
% ../scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -g genome_sample.fasta -t all_transcripts.fasta.clean --ALT_SPLICE
The results are available in the default output files: output.alt_splice_label_combinations.dat:: a tab-delimited listing that contains all unique splicing labels for each pasa alignment assembly labeled with a variation. For example:
genome pasa_acc assembly_cluster combinations_of_labels 68711 asmbl_2 1 ends_in_intron 68711 asmbl_6 3 alt_donor 68711 asmbl_4 3 alt_donor 68711 asmbl_10 6 alt_acceptor, retained_exon, skipped_exon 68711 asmbl_11 6 alt_acceptor, retained_exon, skipped_exon 68711 asmbl_9 6 alt_acceptor, retained_exon, skipped_exon 68711 asmbl_24 14 spliced_intron, starts_in_intron 68711 asmbl_23 14 retained_intron ...
provides the genome coordinates for each alternative splicing label applied to each corresponding pasa alignment assembly. For example:
genome_acc pasa_acc assembly_cluster altsplice_label genome_lend genome_rend transcribed_orient list_of_cdnas_supporting_variation 68711 asmbl_10 6 alt_acceptor 35633 35634 - gi|42468094|emb|BX819464.1|CNS0A8YA 68711 asmbl_11 6 alt_acceptor 35639 35640 - gi|6782248|emb|AJ271597.1|ATH271597 68711 asmbl_10 6 retained_exon 35448 35498 - gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42528978|gb|BX835128.1|BX835128 68711 asmbl_11 6 skipped_exon 35448 35498 - gi|6782248|emb|AJ271597.1|ATH271597 68711 asmbl_10 6 retained_exon 36174 36227 - gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42532609|gb|BX838526.1|BX838526 68711 asmbl_11 6 skipped_exon 36174 36227 - gi|6782248|emb|AJ271597.1|ATH271597 68711 asmbl_11 6 retained_exon 36268 36309 - gi|6782248|emb|AJ271597.1|ATH271597 68711 asmbl_10 6 skipped_exon 36268 36309 - gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42532609|gb|BX838526.1|BX838526 68711 asmbl_11 6 retained_exon 36879 37028 - gi|6782248|emb|AJ271597.1|ATH271597 68711 asmbl_10 6 skipped_exon 36879 37028 - gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42532609|gb|BX838526.1|BX838526 68711 asmbl_10 6 alt_acceptor 35633 35634 - gi|42468094|emb|BX819464.1|CNS0A8YA 68711 asmbl_9 6 alt_acceptor 35639 35640 - gi|11125656|emb|AJ294534.1|ATH294534,gi|13398925|emb|AJ276619.1|ATH276619 ...
The PASA web portal provides numerous reports, graphs, and illustrations to navigate the results of the automated alternative splicing analysis.
In our current working directory, there's a file clusters_of_valid_alignments.txt that contains all the clusters of valid alignments in a simple text format like so:
// cluster: number accession,transcribed_orientation,lend-rend,lend-rend,... ...
The transcribed orientation is +,-, or ?. The ? orientation should be used only for single-exon transcript alignments for which the orientation of transcription is ambiguous. By default, PASA assigns all single-exon transcripts that lack evidence of polyadenylation to the ambiguous transcribed orientation. Given this input file, we can demonstrate the pasa alignment assembler like so:
% ../scripts/pasa_alignment_assembler_textprocessor.pl < clusters_of_valid_alignments.txt
Each cluster of transcript alignments is assembled separately and the results are outputted to stdout with illustrations.
Example input
// cluster: 52 gi|14532493|gb|AY039871.1|,-,38468-38715,38808-39953 gi|14532527|gb|AY039888.1|,-,38468-38715,38808-39953 gi|18655376|gb|AY077666.1|,-,38846-39847 gi|19801675|gb|AV782885.1|AV782885,-,38468-38715,38808-39255 gi|19839856|gb|AV805871.1|AV805871,-,38478-38715,38808-38972 gi|19861773|gb|AV819822.1|AV819822,-,38496-38715,38808-39021 gi|19864228|gb|AV822195.1|AV822195,?,39309-39953 gi|21403701|gb|AY084991.1|,-,38331-38715,38912-39950 gi|32362537|gb|CB074156.1|CB074156,?,38866-39212 gi|42467384|emb|BX819813.1|CNS0A8I9,-,38509-38715,38808-39898 gi|42467462|emb|BX820042.1|CNS0A8GI,-,38481-38715,38808-39873 gi|42467544|emb|BX820309.1|CNS0A8LV,-,38509-38715,38808-39907 gi|42467850|emb|BX818822.1|CNS0A905,-,38506-38715,38808-39907 gi|42468073|emb|BX819411.1|CNS0A8VM,-,38495-38715,38912-39907 gi|42468257|emb|BX820772.1|CNS0A8PI,-,38434-38715,38808-39907 gi|49289224|gb|BP637972.1|BP637972,-,38427-38715,38808-38892 gi|56086876|gb|BP562044.2|BP562044,?,39467-39919 gi|58799838|gb|BP779059.1|BP779059,-,38468-38715,38912-39063 gi|59847772|gb|BP811693.1|BP811693,?,39525-39918 gi|59898821|gb|BP837850.1|BP837850,?,39540-39918 gi|86056909|gb|DR352666.1|DR352666,?,39578-39950 gi|86056910|gb|DR352667.1|DR352667,?,39681-39894 gi|86056911|gb|DR352668.1|DR352668,?,39496-39950 gi|86056912|gb|DR352669.1|DR352669,?,39454-39907 gi|86056913|gb|DR352670.1|DR352670,?,39507-39950 gi|86056914|gb|DR352671.1|DR352671,?,39437-39919 gi|86084686|gb|DR380445.1|DR380445,-,38331-38715,38912-39127 gi|8678774|gb|AV519247.1|AV519247,-,38401-38715,38808-38918 gi|8682044|gb|AV522517.1|AV522517,-,38486-38715,38912-39124 gi|8700432|gb|AV538676.1|AV538676,-,38506-38715,38912-39282
Corresponding Output
Individual Alignments: (30) 0 --------------> <--------------------------------------- (a+/s-)gi|21403701|gb|AY084991.1| 1 --------------> <-------- (a+/s-)gi|86084686|gb|DR380445.1|DR380445 2 -----------> <---- (a+/s-)gi|8678774|gb|AV519247.1|AV519247 3 ----------> <--- (a+/s-)gi|49289224|gb|BP637972.1|BP637972 4 ----------> <---------------------------------------- (a+/s-)gi|42468257|emb|BX820772.1|CNS0A8PI 5 ---------> <------------------------------------------ (a+/s-)gi|14532493|gb|AY039871.1| 6 ---------> <------------------------------------------ (a+/s-)gi|14532527|gb|AY039888.1| 7 ---------> <---------------- (a+/s-)gi|19801675|gb|AV782885.1|AV782885 8 ---------> <------ (a+/s-)gi|58799838|gb|BP779059.1|BP779059 9 ---------> <------ (a+/s-)gi|19839856|gb|AV805871.1|AV805871 10 --------> <--------------------------------------- (a+/s-)gi|42467462|emb|BX820042.1|CNS0A8GI 11 --------> <-------- (a+/s-)gi|8682044|gb|AV522517.1|AV522517 12 --------> <------------------------------------- (a+/s-)gi|42468073|emb|BX819411.1|CNS0A8VM 13 --------> <-------- (a+/s-)gi|19861773|gb|AV819822.1|AV819822 14 --------> <---------------------------------------- (a+/s-)gi|42467850|emb|BX818822.1|CNS0A905 15 --------> <-------------- (a+/s-)gi|8700432|gb|AV538676.1|AV538676 16 -------> <---------------------------------------- (a+/s-)gi|42467384|emb|BX819813.1|CNS0A8I9 17 -------> <---------------------------------------- (a+/s-)gi|42467544|emb|BX820309.1|CNS0A8LV 18 -------------------------------------- (a+/s-)gi|18655376|gb|AY077666.1| 19 -------------- (a+/s?)gi|32362537|gb|CB074156.1|CB074156 20 ------------------------- (a+/s?)gi|19864228|gb|AV822195.1|AV822195 21 ------------------- (a+/s?)gi|86056914|gb|DR352671.1|DR352671 22 ----------------- (a+/s?)gi|86056912|gb|DR352669.1|DR352669 23 ------------------ (a+/s?)gi|56086876|gb|BP562044.2|BP562044 24 ------------------ (a+/s?)gi|86056911|gb|DR352668.1|DR352668 25 ----------------- (a+/s?)gi|86056913|gb|DR352670.1|DR352670 26 ---------------- (a+/s?)gi|59847772|gb|BP811693.1|BP811693 27 --------------- (a+/s?)gi|59898821|gb|BP837850.1|BP837850 28 --------------- (a+/s?)gi|86056909|gb|DR352666.1|DR352666 29 --------- (a+/s?)gi|86056910|gb|DR352667.1|DR352667 ASSEMBLIES: (2) -----------> <------------------------------------------ (a-/s-)gi|8678774|gb|AV519247.1|AV519247/gi|49289224|gb|BP637972.1|BP637972/gi|42468257|emb|BX820772.1|CNS0A8PI/gi|14532493|gb|AY039871.1|/gi|14532527|gb|AY039888.1|/gi|19801675|gb|AV782885.1|AV782885/gi|19839856|gb|AV805871.1|AV805871/gi|42467462|emb|BX820042.1|CNS0A8GI/gi|19861773|gb|AV819822.1|AV819822/gi|42467850|emb|BX818822.1|CNS0A905/gi|42467384|emb|BX819813.1|CNS0A8I9/gi|42467544|emb|BX820309.1|CNS0A8LV/gi|18655376|gb|AY077666.1|/gi|32362537|gb|CB074156.1|CB074156/gi|19864228|gb|AV822195.1|AV822195/gi|86056914|gb|DR352671.1|DR352671/gi|86056912|gb|DR352669.1|DR352669/gi|56086876|gb|BP562044.2|BP562044/gi|86056911|gb|DR352668.1|DR352668/gi|86056913|gb|DR352670.1|DR352670/gi|59847772|gb|BP811693.1|BP811693/gi|59898821|gb|BP837850.1|BP837850/gi|86056909|gb|DR352666.1|DR352666/gi|86056910|gb|DR352667.1|DR352667 --------------> <--------------------------------------- (a-/s-)gi|21403701|gb|AY084991.1|/gi|86084686|gb|DR380445.1|DR380445/gi|58799838|gb|BP779059.1|BP779059/gi|8682044|gb|AV522517.1|AV522517/gi|42468073|emb|BX819411.1|CNS0A8VM/gi|8700432|gb|AV538676.1|AV538676/gi|19864228|gb|AV822195.1|AV822195/gi|86056914|gb|DR352671.1|DR352671/gi|86056912|gb|DR352669.1|DR352669/gi|56086876|gb|BP562044.2|BP562044/gi|86056911|gb|DR352668.1|DR352668/gi|86056913|gb|DR352670.1|DR352670/gi|59847772|gb|BP811693.1|BP811693/gi|59898821|gb|BP837850.1|BP837850/gi|86056909|gb|DR352666.1|DR352666/gi|86056910|gb|DR352667.1|DR352667 Assembly(1): orient(a-/s-) align: 38401(1461)-38715(1147)>YY....XX<38808(1146)-39953(1) Assembly(2): orient(a-/s-) align: 38331(1427)-38715(1043)>YY....XX<38912(1042)-39953(1)
This system and its original application are described in:
Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res, 31, 5654-5666.
The PASA pipeline is an important component of the Eukaryotic genome annotation at The Institute for Genomic Research. I (Brian Haas) developed and actively maintain the software at the Institute. Please contact me directly at bhaas@tigr.org if you should have any questions or require assistance.