What's the difference between the various fasta format options in Progenesis QI for Proteomics and PLGS? - WKB201722

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Article number: 201722

ENVIRONMENT

ProteinLynx Global Server (PLGS)
Progenesis QI for Proteomics

ANSWER

In the New Databank and Edit Databank options menu in PLGS, you are asked to choose a Fasta format from a list of options.

Similarly, in the Identify Peptides tab of Progenesis QI for Proteomics, you are asked to select both a fasta file and the format of that fasta file.

So what's the difference between these options, and why are there several options for what's supposed to be a standard amino acid sequence library file format?

All the options follow the basic fasta format; the difference is how they arrange the data in the header of each sequence within the fasta file. FASTA format consists of a description line, beginning with a “>” symbol, followed by multiple lines containing the sequence of amino acid characters in 60-character blocks. What follows the the ">" symbol and what precedes the sequence can vary in both content and order of content. Many of them use the pipe symbol "|", colons, or semi-colons to separate fields. PLGS and Progenesis QI for Proteomics both need to know which version of fasta format is being used so they can correctly read the information about each identification from the header and display it in the results. If you select a fasta variant that's not correct for the fasta file your using, the results may look a bit odd. The PLGS3.0.3 library import/editor tool has a test button that you can use to confirm that you've selected the right fasta variant when importing a fasta file into the sequence library.

Descriptions of some commonly used fasta variants:

FASTA STANDARD

Description line: >NAME|ACCESSION_NUMBER|DATABANK_OF_ORIGIN: DESCRIPTION

Example:

>IF3_AQUAE|O67653|SPT: Translation initiation factor IF-3.

MSKLKEYRVNRQIRAKECRLIDENGQQIGIVPIEEALKIAEEKGLDLVEIAPQAKPPVCK

IMDYGKFKYELKKKEREARKKQREHQIEVKDIRMKVRIDEHDLQVKLKHMREFLEEGDKV

KVWLRFRGRENIYPELGKKLAERIINELSDIAEVEVQPKKEGNFMIFVLAPKRKK

FASTA NCBI_EXPASY_STANDARD

This format comes in two different forms: a 2-pipe version, and the 4-pipe version shown below. The description line of this particular data bank format is not shortened in any way. The Fasta NCBI EXPASY standard option in PLGS and PQIP accepts both 2- and 4-pipe versions.

Description line (2 pipes): >NAME|ACCESSION_NUMBER|DATABANK_OF_ORIGIN: DESCRIPTION

Description line (4 pipes): >gi|NUMBER|DATABANK_OF_ORIGIN|ACCESSION_NUMBER|LOCUS_OR_NAME DESCRIPTION

Example of 2-pipe version:

>SP|PLASM_FALCI|(P08978) metal binding protein (DHHC domain) [Plasmodium falciparum 3D7]

MIIWCHIKCLCTNPGFLNETFHFVSDNTTEYDNNVQMCKKCNLLKIKRSHHCSVCDKCIMKMDHHCFWIN

SCVGLYNQKYFILLNFVRTKGKYNTNIIKHL

Example of 4-pipe version:

>gi|3845261|gb|AAC71934.1| metal binding protein (DHHC domain) [Plasmodium falciparum 3D7]

MIIWCHIKCLCTNPGFLNETFHFVSDNTTEYDNNVQMCKKCNLLKIKRSHHCSVCDKCIMKMDHHCFWIN

SCVGLYNQKYFILLNFVRTKGKYNTNIIKHL

FASTA STANDARD_SPACED

No pipes in the sequence headers. Description line: >NAME ACCESSION_NUMBER DESCRIPTION

Example:

>IF3_AQUAE (O67653) Translation initiation factor IF-3.

MSKLKEYRVNRQIRAKECRLIDENGQQIGIVPIEEALKIAEEKGLDLVEIAPQAKPPVCK

IMDYGKFKYELKKKEREARKKQREHQIEVKDIRMKVRIDEHDLQVKLKHMREFLEEGDKV

KVWLRFRGRENIYPELGKKLAERIINELSDIAEVEVQPKKEGNFMIFVLAPKRKK

FASTA Uniprot

A sample Uniprot entry is shown below. In this example, one accession number (Q4U9M9) is followed by a pipe symbol "|", and then the entry name (104K_THEAN) and description

>Q4U9M9|104K_THEAN 104 kDamicroneme-rhoptry antigen precursor (p104) - Theileriaannulata

MKFLVLLFNILCLFPILGADELVMSPIPTTDVQPKVTFDINSEVSSGPLYLNPVEMAGVK

YLQLQRQPGVQVHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLK

EGDQWAPIPEDQYLARLQQLRQQIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTP

KNGHICKMVYDKNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLD

DKYAPISVQGYVATIPKLKDFAEPYHPIILDISDIDYVNFYLGDATYHDPGFKIVPKTPQ

CITKVVDGNEVIYESSNPSVECVYKVTYYDKKNESMLRLDLNHSPPSYTSYYAKREGVWV

TSTYIDLEEKIEELQDHRSTELDVMFMSDKDLNVVPLTNGNLEYFMVTPKPHRDIIIVFD

GSEVLWYYEGLENHLVCTWIYVTEGAPRLVHLRVKDRIPQNTDIYMVKFGEYWVRISKTQ

YTQEIKKLIKKSKKKLPSIEEEDSDKHGGPPKGPEPPTGPGHSSSESKEHEDSKESKEPK

EHGSPKETKEGEVTKKPGPAKEHKPSKIPVYTKRPEFPKKSKSPKRPESPKSPKRPVSPQ

RPVSPKSPKRPESLDIPKSPKRPESPKSPKRPVSPQRPVSPRRPESPKSPKSPKSPKSPK

VPFDPKFKEKLYDSYLDKAAKTKETVTLPPVLPTDESFTHTPIGEPTAEQPDDIEPIEES

VFIKETGILTEEVKTEDIHSETGEPEEPKRPDSPTKHSPKPTGTHPSMPKKRRRSDGLAL

STTDLESEAGRILRDPTGKIVTMKRSKSFDDLTTVREKEHMGAEIRKIVVDDDGTEADDE

DTHPSKEKHLSTVRRRRPRPKKSSKSSKPRKPDSAFVPSIIFIFLVSLIVGIL

FASTA LONG_DESCRIPTION

Description line: >NAME DESCRIPTION

This format is used when the description is very long. In the ProteinLynx display, the description is truncated to fit into the viewing area.

Example:

>gp:AL034396_1 PID:5441319 Human DNA sequence from clone 1158B12 on chromosome Xp11.21-11.4 Contains the ZXDA gene for X-linked duplicated Zinc finger A, and MYCL1 (v-myc avian myelocytomatosis viral oncogenehomolog 1, lung carcinoma derived) and KRT8 (Keratin 8, Cytokeratin 8, CYK8, Keratin type II skeletal 8) pseudogenes. Contains ESTs, an STS, GSSs and a CpG island, complete sequence; match: proteins: Sw:P98168 Sw:P98169. (gb:AL034396)

MEIPKLLPARGTLQGGGGGGIPAGGGRVHRGPDSPAGQVPTRRLLLPRGPQDGGPGRRRE

EASTASRGPGPSLFAPRPHQPSGGGDDFFLVLLDPVGGDVETAGSGQAAGPVLREEAKAG

PGLQGDESGANPAGCSAQGPHCLSAVPTPAPISAPGPAAAFAGTVTIHNQDLLLRFENGV

LTLATPPPHAWEPGAAPAQQPRCLIAPQAGFPQAAHPGDCPELRSDLLLAEPAEPAPAPA

PQEEAEGLAAALGPRGLLGSGPGVVLYLCPEALCGQTFAKKHQLKMHLLTHSSSQGQRPF

KCPLGGCGWTFTTSYKLKRHLQSHDKLRPFGCPAEGCGKSFTTVYNLKAHMKGHEQENSF

KCEVCEESFPTQAKLGAHQRSHFEPERPYQCAFSGCKKTFITVSALFSHNRAHFREQELF

SCSFPGCSKQYDKACRLKIHLRSHTGERPFLCDFDGCGWNFTSMSKLLRHKRKHDDDRRF

MCPVEGCGKSFTRAEHLKGHSITHLGTKPFVCPVAGCCARFSARSSLYIHSKKHLQDVDT

WKSRCPISSCNKLFTSKHSMKTHMVKRHKVGQDLLAQLEAANSLTPSSELTSQRQNDLSD

AEIVSLFSDVPDSTSAALLDTALVNSGILTIDVASVSSTLAGHLPANNNNSVGQAVDPPS

LMATSDPPQSLDTSLFFGTAATGFQQSSLNMDEVSSVSVGPLGSLDSLAMKNSSPEPQAL

TPSSKLTVDTDTLTPSSTLCENSVSELLTPAKAEWSVHPNSDFFGQEGETQFGFPNAAGN

HGSQKERNLITVTGSSFLV

ADDITIONAL INFORMATION

id201722, isopropanol, SUPPLGS