|
Graphical Abstract
|
|
FIG. 1. Peptide measurements are analogous to coin-flips: A, Peptides are labeled with isobaric tags encoding the different conditions case and control (red and blue). Shown are examples for two peptides with the identical relative abundance (true ratio) between case and control. B, After ionization and fragmentation, the relative signal of fragments produced by the isobaric tag can be used to quantify relative peptide abundance. These quantification spectra not only contain information about the peptide relative abundance but also the MS-signal. This signal is proportional to the number of ions. C, The problem of estimating the posterior probability of the âtrue peptide fractionâ ô° becomes identical to the estimation of a coinâs fairness, given a certain number of head and tail measurements. The probability of âtrue peptide fractionâ ô° is a beta distribution with ô°, ô° as the shape parameters, where ô°, ô° represent the number of ions in the two channels. D, Intuitively, the fewer ions we measure the more the measured peptide ratio tends to divert from the true ratio between case and control. A higher ion-count (top row) results in a tighter probability distribution than a low ion-count (bottom row). For more than two cases this approach can be generalized with a Dirichlet distribution.
|
|
FIG. 2. Conversion of MS-signal into counts and assigning confidence to the measurement of a single peptide. A, We generated a sample in which all peptides are labeled with two different TMT-tags and mixed with identical ô°1:1 ratio. When we plot the observed peptide ratio in one channel versus the summed MS-signal in both channels, measurement with higher MS-signal asymptotes to true underlying fraction (dashed line). B, Assuming ion-statistics is the dominant noise source, we can fit the coefficients of variation (CVs) and obtain the conversion factor of MS-signal to the number of ions or pseudo-counts. The data shown was obtained on an Orbitrap Lumos with 50K mass resolution. Our best estimate for the conversion factor is 2.0. C, Plot of the probability distributions of the âtrue peptide fractionâ for the three examples color-coded to correspond to three peptide data points in sub-figure (A) D, Histogram of the upper and lower bound values for the 95% confidence intervals. The observed percentage of peptides for which the true answer is outside of the 95% confidence interval is 2.0% and 2.5% respectively for over- and under-estimation, which are symmetric and in good agreement with the expected total 5%.
|
|
Fig. 3. Only considering ion-statistics does not produce accurate confidence intervals at the protein level: A, To evaluate the confidence intervals of peptides from the same protein, we revisited our previously published experiment, where we measured the localization of proteins between nucleus and cytoplasm in the frog oocyte. B, Blue discs show 50 measured peptides (RNC and MS-signal) assigned to the Ribosomal Protein L5 (RPL5). We show the beta posterior probabilities for two extreme peptides (leftmost in blue and rightmost in green). Note that these peptides' probability distributions are basically mutually exclusive, i.e. the most generous confidence intervals would exclude each other. Additionally, we show the distribution based on summing up all the peptides together (magenta) which corresponds to unjustifiably tight confidence. This example illustrates that for the expression of confidence intervals on the protein level, we cannot assume that ion-statistics is the only source of measurement error in proteomics experiments. Rather, we have to integrate other sources of errors e.g. because of differences in sample handling.
|
|
FIG. 4. Schematic of the data generating process for modeling confidence for proteins with multiple peptide measurements. A, The âtrue protein ratioâ can be distorted because of differences in sample handling (e.g. digestion and isobaric-labeling) and give rise to multiple peptides with differing âtrue peptide ratiosâ. The peptide ratios for the constituent peptides of a given protein are sampled from the probability distribution parameterized according to the âtrue protein ratioâ B, Each peptide is measured via the mass spectrometer. Based on the number of ions used to measure each peptide, the confidence in quantification varies. The observed data is sampled from a probability distribution given a true peptide ratio, for each peptide separately C, The goal is to infer the underlying true protein ratio between the conditions and generate confidence using the agreement between multiple peptide measurements and their respective MS-signals.
|
|
FIG. 5. Mathematical model for estimating the protein fraction and its confidence: The entire data generation process can be adequately described with the Beta-Binomial process (or Dirichlet -Multinomial process for more than two cases). To draw the parallel with figure 4, we consider the two condition three peptide case. We assume an underlying beta-distribution with mean ô° and precision ô° representing the probability distribution of true protein fraction, from which peptide fractions ô°i for the constituent peptides of a given protein are sampled. Given a true underlying peptide fraction ô°i, we can sample the number of ions in a channel ô°i from Binomial distribution. Each peptide is independently sampled from its respective binomial distribution with the true peptide fraction of ô°i and total number of ions ni.
|
|
FIG. 6. Validating our method with a differential expression experiment. A, Six samples were prepared by mixing material from two species as follows. Six identical human samples (i.e. proportions across 6 channels were 1.0: 1.0: 1.0: 1.0: 1.0: 1.0) were mixed with an E. coli sample in two sets of three as shown (i.e. proportions across 6 channels were 1.0: 1.0: 1.0: 1.2: 1.2: 1.2). A mixture of peptides from the two proteomes was analyzed by LC-MS. B, Comparison of our method with a one tailed t test to detect significantly changing proteins. BACIQ can detect statistically significant changes without replicates, whereas the t test requires replicates. ROC plot indicates that our method is superior to the t test when the same number of replicates are used. Even without replicates, our method (red) nearly outperforms the t test with two replicates (dashed yellow). We achieve close to perfect detection of the significantly changing proteins by using the BACIQ analysis with three replicates (blue). C, A comparison of our method (with pooled variance across proteins) with the compMS. BACIQ outperforms compMS, the method that ignores ion statistics in assigning confidence.
|
|
FIG. 7. Re-analysis of subcellular movement on Exportin-1 inhibition with BACIQ. A, On inhibition of Exportin-1 with Leptomycin B (LMB) we expect Exportin-1 substrates to move toward the nucleus. We compared the RNC in control and drug treated samples to identify the proteins that confidently shift toward the nucleus regardless of initial nucleocytoplasmic distribution. B, A scatterplot indicating the shift in RNC post LMB treatment. Most proteins seem unaffected by the treatment. The proteins above the diagonal indicate movement toward the nucleus and those below the diagonal indicate movement toward the cytoplasm. C, The raw peptide data and probability distributions of RNC for three different proteins (shown as discs in B). The blue curve represents the probability distribution of RNC before adding the drug and the orange curve represents the probability distribution of RNC after adding the drug. With our approach, we can detect the movement toward the nucleus of as little as 1%. D, Applied to the entire dataset BACIQ detects 750 putative Exportin-1 substrates (612 unique gene symbols) at 5% false discovery rate. With identical FDR, BACIQ extracts ô°2ô° more proteins as significantly moving compared with our previously published naive analysis. E, Venn diagram shows the overlap in unique gene symbols of the LMB responders at 5% FDR from this study with Cargo database based on Exportin-1 affinity experiments (Kirli et al.), and a database curated from literature (NESDB). The overlap with both databases is highly significant with p values of 5.5 e-29 (Kirli et al.) and 3.1e-6 (NESDB) based on hypergeometric test.
|