Evaluating various normalization methods
Normalization has been a constant issue in this project. It turns out, only referencing to a control channel in each set does not cancel out experimental bias.
This notebook evaluates the
Cohorts: We have run 4 cohorts with TMT-MS(Fusion). Here, we'll look at three of them. The fourth is SBP_ML endogenic peptides (and have not been searched)
|SBP Mölndal||ml||12 fractions||17022||16||144|
|SBP ADHD||adhd||6 fractions||12018||16||144|
Detailed script is available at
Some MS-files have had one or two major outliers, typically blood contaminated (assessed by high scores on blood-based proteins). Here I reduced data with PCA and plot a model overview together with score distance from the model.
There is one similar outlier in sthlm, but no clear outliers in ml
SUMMARY: I removed F117 127N from adhd and F20 127N from sthlm
The NOMAD package also includes a neat little plot to evaluate whether log2-transformation is necessary. The ANOVA approach assumes a constant (homogeneous) variance across peptide signal for the normalization to to be valid. So this (artsy) plot just displays the variation across the peptide signal (mean values), and from my understanding it should be roughly a flat line -> indicating that variance does not increase in high signals (high mean values). Let's look at the SBP datasets!
Btw. this is only relevant in peptides, don't know why they added proteins to the plots.
SBP_ML looks similar but take a look at adhd - it looks weird!
It's clear that raw data have an increasing variance in high signals. Log2-transforming data makes it a bit fuzzy but... I'll just stick with log2-data.
Dunno whats going on in adhd?
Now let's compare how the methods perform in various ways. But first just retrieve the calls to see exactly how the NOMAD package was used:
I used the same call for all datasets!
The NOMAD package includes a simple p-histogram of ANOVA tests across the proteins (yep - proteins!). Let's look at the plots:
So, a general theme seems to be that normalization (NOMAD & jhl_ag) cancels bias in iTRAQ/TMTch but enriches bias in Run/TMTset. How can that be?
Let's just plot the scores by Run to see if we can identify the Run bias?
SUMMARY: jhl_ag moves the distribution just a hinch towards 0, while nomad completely redistributes the data.
With NOMADs protocol no referencing to control channel is made. So, there is one reference sample in each set which should be identifical if normalization was flawless. Let's look at it - here I plot the CV across all assembeled proteins.
CV = sd(x)/mean(x)
With unsupervised dimensionality reduction techniques we van estimate structural effects across the dataset. Here I'm using three different algorithms: i) PCA; ii) t-SNE; iii) umap.
You could potentially define #clusters unsupervised via monte carlo, but I won't get into that since the statistics may come in the way for the message
It is probably not possible to remove experimental bias in silico. This is a major limitation in this method.
But, our best estimate would be: Normalize peptides according to jhl_ag protocol and assemble to proteins by NOMAD.
I don't understand the matrix they build up. Is this similar to the TMT mass-variance conversion tables we got? And isn't that inputted and adjusted for in PD2.2? I think so, so I assumed the current peptide quantification was already adjusted for variance in TMT-mass. But are these two things the same?
From my understanding: PSMs are quantified to Quan spectra which are summarized to peptides with abundances. So, why are there more Quan spectra than PSMs? And which Abundance should I work with? And does 'peptide groups' simply refer to that many PSMs are grouped into one peptide?