What does Fasta file randomization do, and how does it affect search results? - WKB1211

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Article number: 1211

ENVIRONMENT

PLGS
Progenesis QI for Proteomics

ANSWER

When performing MSe searches in PLGS, you have two options for randomizing the Fasta sequence database to create dummy entries for false discovery:

A - Create a prerandomized databank using the Randomize button in the databank library manager, and use the resulting Fasta databank in the workflow, or

B - Use the original Fasta databank in the workflow, in which case iadbs.exe will create the dummy sequences "on the fly".

Progeneis QI for Proteomics ion accounting database search uses option B. However, if you generate a Fasta file containing randomized sequences using PLGS, you can then specify that Fasta databank in the Progenesis QI for Proteomics search option.

If you compare the search results produced using A and B, they can be quite different. So the obvious questions are why are they different and which one is right?

1 - When you press the Randomize button in the PLGS databank library manager, it calculates the percentage of each amino acid in the original Fasta file, and then it creates random sequences with the same lengths as the input proteins and the same amino acid distribution. It then adds those random sequences to the list of original proteins and saves the new Fasta file.

2 - When you run a search with a non-randomized Fasta file, the iadbs executable creates randomized entries "on the fly" during the search process using exactly the same process as above, except it doesn't save the file.

3 - It's possible to get different results because the two randomizations create similar but different random proteins. The different randomized proteins can match to different accurate mass retention times (AMRTs) in the data. Because the search works as a series of iterative depletions, when AMRTs are matched to randomized entries they are then not available to match to real proteins. That's why it may appear to miss certain hits in one of the searches. If the searches match different real peptides and different proteins in the first pass of the search, the subset database used in the second pass will also be different. As a result, the effect can snowball and produce the rather large search differences you've encountered

4 - To minimize the differences between searches with pre-randomized and "original" Fasta files, increase the false discovery rate to 100%. However, even doing that you may still see differences.

5 - It isn't possible to say one search result is more right than the other, and the additional hits you may see in the search results using one approach over the other are genuine matches.

ADDITIONAL INFORMATION

id1211, SUPPLGS