|
About PGN
PGN is a repository for plant EST sequence data located at Cornell. It
comprises an analysis pipeline and a website, and presently contains
mainly data from the Floral Genome Project.
However, PGN accepts submission from other sources. This page gives
more information about the methods used in the analysis pipeline.
Sequence processing pipelines
PGN has developed a standard sequence analysis pipeline consisting of
base calling using Phred, vector and
E. coli sequence contamination screening, and unigene assembly. In
addition, a database was developed that also serves as the back end for
the Plant Genome Network website (www.pgn.cornell.edu). The analysis
pipeline and the sequence database are tightly integrated. The quality
screen consists of trimming low quality sequence, based on Phred
scores, using a custom algorithm which works as follows: To extend the
high quality sequence as far as possible given a particular quality
threshold, the sequence was scanned and, concomitantly, the difference
between the quality score and the quality threshold (termed the
"adjusted score") was integrated over the length of the
sequence. The
high quality sequence was defined as the region of sequence in which
the integration of the adjusted score was maximal. Importantly, this
can include small regions of lower scoring nucleotides if they are
"compensated" by higher scoring downstream sequence. Next,
putative
polyA tails are removed if they contained more than 20 consecutive
adenine residues. A contamination screen is performed to remove E. coli
chromosomal sequences from the dataset. In a final quality screening
step, sequences with lengths below a certain threshold (150bp),
sequences with more than 4% ambiguous base calls (Ns), and sequences
with a complexity below a given threshold (defined as sequence composed
of more than 60% of one nucleotide) are rejected. The rejected
sequences are not used in unigene builds, but are retained in the
database along with information as to why they were rejected. For each
library, a quick evaluation assembly is generated using Phrap to evaluate the sequences for redundancy.
Unigene building
During a sequencing project, unigene builds are generated at regular
intervals, and at the end of the project, combining all libraries for a
given organism. The sequences are first preclustered, and these
clusters then assembled with the cap3 (Huang et al, 1997) program.
Sequences are also checked for length, complexity, contamination, with
identical parameters as the evaluation builds, and extensive chimera
detection is performed. The builds are then uploaded to the database,
where each unigene was assigned a unique unigene ID. This ID will
remains unique, i.e., when the unigene set is built again (such as when
new sequence for a library becomes available); at that point, a new
unigene ID is created for that build. Subsequent unigene builds of the
same libraries are attributed new ids for all unigenes. Unigenes from
a newer build can be tracked to the older builds through the ESTs that
they share, and a complete history of unigene IDs in order to track
corresponding unigenes from earlier builds is available on the website
for tracking unigenes through the different builds.
Annotation of sequence data
For functional annotation, blast is used to compare find the best match
of each unigene sequence to in the Genbank NR database, and the in
complete coding sequences from Arabidopsis. These annotations are
stored in the database and serve as the primary source of annotation.
The annotation framework will be extended to Gene Ontology annotations in the
future.
|