*Evaluating various normalization methods*

andreas.goteson@gu.se

UPDATED: 191024

**Normalization** has been a constant issue in this project. It turns out, only referencing to a control channel in each set does not cancel out experimental bias.

This notebook evaluates the

three different normalization methods:

**NOMAD**: NOMAD is a package on github developed by Carl Murie. Actually developed for iTRAQ but the normalization should be applicable to any multiplex MS-data. There is also a published paper describing this package**nonorm**: Just a log2-transformation and median centering of all samples.**jhl_ag**: Short for Jessica HolmÃ©n Larsson & Andreas GÃ¶teson. We made a lot of simulations with various ways to normalize but ended up with just adding a slight tweak to nonorm, as described in the flowchart

**Cohorts**: We have run 4 cohorts with TMT-MS(Fusion). Here, we'll look at three of them. The fourth is SBP_ML endogenic peptides (and have not been searched)

Cohort | file | fractionated? | #detected peptides | #sets | #samples |
---|---|---|---|---|---|

SBP MÃ¶lndal | ml | 12 fractions | 17022 | 16 | 144 |

SBP Stockholm | sthlm | No | 7418 | 38 | 342 |

SBP ADHD | adhd | 6 fractions | 12018 | 16 | 144 |

Detailed script is available at `~/Documents/MS_normalization/NOMAD_peptideNormalization.R`

Loading output library...

Just a brief data overview before normalization

Loading output library...

Some MS-files have had one or two major outliers, typically blood contaminated (assessed by high scores on blood-based proteins). Here I reduced data with PCA and plot a model overview together with score distance from the model.

Loading output library...

There is one similar outlier in sthlm, but no clear outliers in ml

**SUMMARY**: I removed F117 127N from adhd and F20 127N from sthlm

The NOMAD package also includes a neat little plot to evaluate whether log2-transformation is necessary. *The ANOVA approach assumes a constant (homogeneous) variance across peptide signal for the normalization to to be valid*. So this (artsy) plot just displays the variation across the peptide signal (mean values), and from my understanding it should be roughly a flat line -> indicating that variance does not increase in high signals (high mean values). Let's look at the SBP datasets!

Btw. this is only relevant in peptides, don't know why they added proteins to the plots.

**sthlm**

Loading output library...

SBP_ML looks similar but take a look at adhd - it looks weird!

**adhd**

Loading output library...

It's clear that raw data have an increasing variance in high signals. Log2-transforming data makes it a bit fuzzy but... I'll just stick with log2-data.

Dunno whats going on **in adhd**?

Now let's compare how the methods perform in various ways. But first just retrieve the calls to see exactly how the NOMAD package was used:

**Peptide normalization**

Loading output library...

**Protein assembly**

Loading output library...

I used the same call for all datasets!

The NOMAD package includes a simple p-histogram of ANOVA tests across the proteins (yep - proteins!). Let's look at the plots:

**sthlm**

Loading output library...

**ml**

Loading output library...

**adhd**

Loading output library...

So, a general theme seems to be that normalization (NOMAD & jhl_ag) cancels bias in iTRAQ/TMTch but **enriches bias in Run/TMTset**. How can that be?

Let's just plot the scores by Run to see if we can identify the Run bias?

Loading output library...

Loading output library...

Loading output library...

Loading output library...

**SUMMARY**: jhl_ag moves the distribution just a hinch towards 0, while nomad completely redistributes the data.

With NOMADs protocol no referencing to control channel is made. So, there is one reference sample in each set which should be identifical if normalization was flawless. Let's look at it - here I plot the CV across all assembeled proteins.`CV = sd(x)/mean(x)`

Loading output library...

With unsupervised dimensionality reduction techniques we van estimate structural effects across the dataset. Here I'm using three different algorithms: i) PCA; ii) t-SNE; iii) umap.

You could potentially define #clusters unsupervised via monte carlo, but I won't get into that since the statistics may come in the way for the message

Loading output library...

It is probably not possible to remove experimental bias in silico. This is a major limitation in this method.

But, our best estimate would be: Normalize peptides according to jhl_ag protocol and assemble to proteins by NOMAD.

- NOMAD normalization settings?
- NOMAD protein assembly settings?
- The PCA model in sthlm has only one component - why? Does it map to sth e.g. run

I don't understand the matrix they build up. Is this similar to the TMT mass-variance conversion tables we got? And isn't that inputted and adjusted for in PD2.2? I think so, so I assumed the current peptide quantification was already adjusted for variance in TMT-mass. But are these two things the same?

From my understanding: PSMs are quantified to Quan spectra which are summarized to peptides with abundances. So, why are there more Quan spectra than PSMs? And which Abundance should I work with? And does 'peptide *groups*' simply refer to that many PSMs are grouped into one peptide?

- Which algorithm to synthesize proteins?
- How to deal with duplicate peptides (median seems intuitive but the manual proposes to include both independently)