Using Simultaneous Spectrum Analysis

This file describes the process of creating data files for Simultaneous Spectrum Analysis, installing the required software, and running the example code. The output will be tables of peaks in a tab-delimitated text format. This format should be appropriate for input to most analysis packages (due to the structure of R tables, it may be necessary to manually insert a tab as the first character of some files). The R function also returns an R list object that can be manipulated directly in the R environment. All R code through version 1.2 was and will be developed using R version 1.9.1. It has been tested without incident in the R environment version 2.0.1.

Installing R

Download the R installation from http://www.r-project.org/, and install R using the instructions provided. (use http://cran.stat.ucla.edu/banner.shtml to go directly to an R download page mirrored in the USA)


Exporting data from CiphergenExpress Data Manager 2.1 (CDM)

Select the spectra that you want to export. We suggest resetting the normalization coefficients, removing filtering, and disabling baseline subtraction before exporting any spectra. You may, however, want to apply the Spectrum Alignment tool (check first for any bad spectra because bad spectra often have rather implausible changes made to the m/z axis). Go to the Spectrum menu and select "Export Spectra." Choose file type ".csv" and check only the box next to "Processed data." Click the export button and select a directory to hold the spectrum files.

Next create a metadata file: the metadata file will be three columns separated by tabs. The first column must contain the file names of spectra without a directory path and without a file extension (all file names must be unique), the second column must contain the sample name for each spectrum (replicates should have the same name), and the third column must contain a class identifier for that spectrum (there is no limit to the number of different class identifiers). To produce a metadata file from CiphergenExpress Data Manager, copy the contents of the "Name" column into the first column of an empty spreadsheet. Return to CDM and select all the cells in the column with sample identifiers (do not click on any column headings because that will change the order in which spectra are displayed), and copy them as the second column in the spreadsheet. Last, copy the column in CDM that contains group identifiers (usually the Group column) and paste those cells as the third column in the spreadsheet. Save the spreadsheet as tab-delimitated text. This is the metadata file.

Note: Additional columns are allowed to appear in the metadata file. Names for these columns should be given in the "additional.metadata" parameter in the call to getPeaks (for example, if the metadata contains extra columns for patient weight and height the code would read "additional.metadata = c("Weight","Height")" ). If any of the extra columns are named "Find.Peaks" the code to treat that column in the metadata file as a bit-mask specifying which spectra should be used to determine peak positions. If a Find.Peaks column is included an entry of "1" will cause that spectrum to be used in determining peaks, any other value will cause it to be excluded (this is useful if you have control spectra included with your experimental set, and you want to find peak areas on the control spectra without having them affect the peak finding). Peak magnitudes will still be calculated for spectra that are not included in peak-finding, and all spectra will still be used to determine normalization coefficients.

 

Running Simultaneous Spectrum Analysis

Download the R source code, R example code, C binary for R (Windows 95 or later, tested on Windows 2000, XP). For other operating systems, download the C source code and compile to an appropriate format for your environment (see the Writing R Extensions manual for instructions on how to do this). Download the sample data from our website and extract it to a folder.

Open the file example_R.R in a text editor. There are five lines that need to be modified, each is marked with "# <--". Change "C:\\Source code path\\Peaks_R_1.2.R" to the path of the source file Peaks_R_1.2.R, change "C:\\C_DLL_path\\Peaks_R.dll" to the location of the C DLL (note that all backslash characters must be given as two backslashes, \\ instead of \), change "C:\\input_directory" to the directory containing the example data, change "C:\\metadata.txt" to the path of the metadata file (meta.txt is the metadata included in the example data), and change "C:\\output_directory" to the directory where output files should be generated (If you receive a "file not found error", the first thing to check is that backslashes are \\). Save changes to the file before starting the next step

Open the R environment and use the File menu "Source R code" option to load the file example_R.R (R environments under operating systems other than Windows may require you to use the "source" command at the prompt). A warning message "DLL attempted to change FPU control word from 8001f to 9001f" may appear, this is not a problem. On a 2.6 GHz machine running Windows 2000, it takes less than five minutes to process the 90 spectra with a digitizer rate of 250 MHz going up to m/z 75,000. If R crashes, check that the metadata is correct.

The example code also performs a two-group Student's t test on peaks that pass the F-test with replicates averaged. This code appears immediately below the command that begins "peaks <- getPeaks(". The groups to be used are specified as "status1" and "status2". These are given as numbers, where numbers correspond to groups in the metadata file sorted in alphabetical order (the sample has groups Het, KO, WT, corresponding to 1, 2, 3). Only peaks with a p-value smaller than "p.threshold" are printed.

 

Setting analysis parameters

Using default parameters as provided in the example_R.1.2.1.R file will detect and quantify peaks using the settings as used in the paper. Parameters listed here are those that are likely to change. A complete list of parameters appears in the code comments.

min.peak.widths (optional) - sets the minimum spacing between peaks as a fraction of the current m/z (so 0.01 means minimum spacing of 1%). This can be given as a numeric vector, as in c(0.004, 0.008), when different peak widths are to be applied to different m/z ranges. The default is set for a Ciphergen PBS-IIC SELDI-TOF-MS using fractionated plasma with SPA (all default settings are for this instrument).

max.peak.widths (optional) - sets the largest peak width that will be tested when determining peak edges. Specified in the same manner as min.peak.widths.

m.z.regions (optional) - specifies where m/z regions start and end for each of the peak widths that was given. For example, c(3000, 18000) applies the first entry in each peak width vector to the m/z region from 3000 to 18000, and the second entry in each width vector to the range from 18000 to the end.

m.z.step (default = 0.5) - gives the m/z data resolution used when calculating peaks. Allocating the array of uniform m/z spectra is a major memory bottleneck. If you are having memory problems then try increasing the m.z.step so that fewer points per spectrum will be calculated.

F.test.threshold (default = 0.95) - this parameter sets the minimum confidence that peaks' magnitude exceeds their noise. Peaks with lower confidence on the F-test are not included in output files specified by output.path.screened or output.path.averaged.

chi.sq.threshold (default = 0.99) - specifies the confidence needed to exclude a bad spectrum. Lowing the value will result in a larger number of excluded spectra.

additional.metadata (required if metadata column present) - a character vector giving names to each column after the 3rd that appears in the metadata file.

Memory.Conserve (default = FALSE) - a Boolean value that if TRUE instructs the software to load spectra only on demand. This will greatly decrease the RAM required for most analyses and it may be necessary for large datasets. The downside is that it takes longer because each spectrum has to be loaded several times.

Peaks.Show (default = TRUE) - if this flag is TRUE then R will generate a plot showing the average spectrum and the central positions of each peak. Use the peakPlotMinMZ and peakPlotMaxMZ parameters to change the bounds of the plot.

Tips for determining parameters:

Look at spectra by eye and measure the distance from the center to the right edge. The fractional value of distance divided by center should be between the min and max peak spacing values. If the fractional value changes depending on m/z then divide the spectra into region using the m.z.regions parameter.

Run SSA with Peaks.Show equal to TRUE and check that peak positions are not off-center. If they are then try reducing min.peak.widths. Off-center peaks can result in large negative values (edge-finding performs poorly on misplaced peaks).

Also using Peaks.Show check to see if many obvious peaks are not being found. Try lowering the value of F.test.threshold if there are peaks being missed.

The default value of m.z.step is appropriate for peaks that are wider than about 6 m/z units. If your experiment gives sharper peaks then lower the m.z.step to be no more than 1/10 of the smallest peak width that you are interested in.

If you do not want spectra removed because the are implausible then set chi.sq.threshold to 1. This is useful for experiments where spectra need to form complete sequences or in cases where one or more samples are of a different type than the rest.

When the dataset is small (fewer than around 25 samples run in duplicate in our experience) the F-test looses its ability to distinguish real peaks from noise with high confidence. If many obvious peaks are being rejected as irreproducible then reduce the F-test threshold. Best results are obtained with 50+ samples run in duplicate.

Output

The peaks measured by SSA are compiled into output files given by the parameters "output.path.screened", "output.path.all", "output.path.averaged", and "output.path.peak.bounds". The output file given by "output.path.screened" will contain normalized peak magnitudes for every peak and spectrum that passed the F-test and Chi-squared test respectively. The output file "output.path.all" will contain the normalized peak magnitudes for every peak (including those that failed the F-test) from every spectrum that passed the Chi-squared test. The output file "output.path.averaged" will contain normalized peak magnitudes for peaks and spectra that passed their respective tests averaged by sample. The output file "output.path.peak.bounds" will be a table of three columns with the edges and center of every peak that was detected, as well as a column with TRUE/FALSE denoting whether that peak passed the F-test, and a column F.Confidence containing the confidence score that the peak received from the F-test. Note: the standard R output format for tables does not include a padding cell in the upper left corner, so the column labels are shifted one column left when these files are viewed as spreadsheets.

The R function creates output files and returns a list object with results of the analysis. Elements of the list are:

return.code - a numeric scalar containing a zero if the function returned correctly, a zero if the normalization failed for any reason.

spectrum.average (version 1.2) - a numeric vector containing the average spectrum on which peaks are detected. The names of the vector elements contain the m/z for each point.

peak.mask - a Boolean vector with length equal to the number of peaks detected before F-test screening. Specifies if each peak passed the F-test with the specified confidence.

F.confidence (version 1.2)- a numeric vector containing the F-test confidence values that each peak is more reproducible than expected by chance.

spectrum.mask - a Boolean vector with length equal to the number of spectra. Specifies whether each spectrum passed the chi-squared plausibility test.

exclude.confidence - numeric vector length one entry per spectrum, giving the p-value of the chi squared test that each spectrum is "bad."

peaks.normalized - an n x m matrix where n is the number of peaks, m is the number of spectra. Peak values are given after EM-normalization and every peak is included regardless of its performance on the F-test. Each entry corresponds to the magnitude that peak in each spectrum (technically the AUC divided by peak width).

peaks.screened - similar to peaks.normalized, but including only peaks that passed the F-test with given confidence.

peaks.averaged - an i x k matrix where i is the number of peaks passing the F-test and k is the number of individuals in the experiment (not the number of spectra). Replicate measurements of individuals have been averaged.

identities.screened - Vector of identities for columns (spectra) retained in of peaks.normalized and peaks.screened

identities.averaged - Vector of identities for columns (samples) appearing in peaks.averaged

scale.factors - Numeric vector of constants by which each spectrum was multiplied during normalization. These values should be checked for bias between groups.

peaks.bounds - A data frame with three columns, giving the center, left width, and right width of every peak

status - A vector containing group memberships for spectra in the order that they appear in the output matrices.

status.averaged - A vector containing group memberships for columns of averaged replicates in the order that they appear in the output matrix with averaged replicates.

metadata - names, the second has group membership, and additional columns are as specified.