Instructions for Handling SAM Output
This document is also available for download in pdf or word format.
Introduction
The data should be analyzed at this point and one has a list of significant antigens that they are eager to display in a manner that makes interpreting the data a manageable task. Before we can display the data in a visually intuitive format, we need to interpret the statistical output and organize the information for the cluster/treeview programs.
Save Your Sheets
As mentioned in the previous topic (see Howto Analyze mArray Data, SAM creates two new Excel sheets entitled SAM Plot and SAM Output. Once SAM has created these sheets, any future analysis will overwrite their contents. If one is going to perform multiple analyses within a single Excel file, then one must create a copy of these two sheets to avoid losing them. It would also be a good idea to print out the antigen list and graph because future analyses tend to alter the graphical output. At a minimum, one needs to copy or rename the SAM Output sheet. To do this, right click on the SAM Output sheet tab. Select Rename or Move or Copy…. If one selects the Move or Copy… option, then be sure to check the Create a copy box on the window that appears.
Collect the Significant Antigens
Decide how many antigens to cluster based on their significance (q-value (%)). Remember that the q-value represents the chance the antigen reported is a false positive. The most common cut-off for a significance value is <5%, however, <10% or even <15% can be used as well.
Highlight all antigens less than the desired cut-off value. For the example below, I’ve chosen the cut-off to be q-value <5%. Make sure to include the Row, Gene Name, … q-value (%) in the selection (see below).
Select Edit, Copy (Ctrl+C)
Select File, New… (Ctrl+N) – this file will be called Spreadsheet C for reference
Select the Sheet 2 tab, select cell A1 and select Edit, Paste (Ctrl+V)
Return to Spreadsheet SAM and select the sheet of the recently analyzed data. Highlight the complete data set including the antigen and sample names (see below). In this case, the complete set starts at A1 and extends to L30.
Select Edit, Copy (Ctrl+C)
Return to Spreadsheet C, Sheet 1
Right click on cell A1 and select Edit, Paste Special…
Select Values, and select OK
Highlight columns B, C, & D
Right click and select Insert
Select on Spreadsheet C, Sheet 2 and highlight the antigens for cluster[1]
Select Edit, Copy (Ctrl+C)
Return to Spreadsheet C, Sheet 1
Right click on cell B3 and select Edit, Paste (Ctrl+V)
Enter the following algorithm in cell B3
Select cell C3
Select Edit, Copy (Ctrl+C)
Highlight cells C3 – Cn
Select Edit, Paste (Ctrl+V)
Highlight cells C3 – Cn
Select Edit, Copy (Ctrl+C)
Select cell D3
Select Edit, Paste Special…
Select Values, and select OK
Highlight cells A3 – MN (where M is the final data column and N is the final data row)
Select Data, Sort…
Select Column A, Check Ascending, Check No Header Row, and Click OK
Delete all rows after the last number in column D (that would be row 17 – row XX in this example)
The Cluster Format
Select column B, right click within the highlighted region and select Delete
Select row 2, right click within the highlighted region and select Insert
Note that spelling and location matter in the following steps.
Select cell A1
Type Unique ID
Select cell A2
Type EWEIGHT
Select cell A3
Type EORDER
Select cell B1
Type GWEIGHT
Select cell C1
Type GORDER
Leave cells B2 – C3 empty
Add the number 1[3] to every cell beneath the Sample/Slide names in the EWEIGHT and EORDER rows (that would be cells D2 – N3in this example)
Add the number 1 to every cell next to an Antigen name in the GWEIGHT and GORDER columns (that would be cells B4 – C17 in this example)
The final format should look like the image shown below.
At this point, one has completed the minimal format requirements to proceed. Congratulations!
Select File, Save As
Save the file in Excel (*.xls) format. It’s a good idea to identify the file as for_clutster
Select File, Save As
Save the file in text (Tab delimited)(*.txt) format.
Excel will complain about multiple sheets. Click OK. Excel will now complain about containing features that are not compatible with Text. Click Yes.
Select File, Close (or Exit)[4]
Adding values other than ones to intentionally bias the data
There are times when one wants to group samples or antigens together because the unbiased cluster didn't do it or one wants to highlight something about the data. Remember that cluster groups items based on their similarities or differences, much like a family or evolutionary tree. Unbiased grouping assumes the data contains no inherent similarities and the program uses the data to guess at similarities. Therefore, cluster may guess at a family tree that is not satisfactory for displaying and interpreting the results. To that end, there is good reason for putting values in the *WEIGHT columns or rows to influence how the data is clustered. In fact, one cannot consider him/herself an expert at displaying data in clear ways until one has mastered the fine art, and it is definitely an art, of selecting the values of the weights to influence clustering.
So, with that long-winded introduction, I shall not delay any longer in unveiling the secret art of clustering for superb images.
Values in the EORDER row influence the order of the sample list. The values can range from 1 – 10,000, where values with 10,000, in general, move left and values with 1 move right.
Values in the GORDER column influence the order of the antigen list. The values can range from 0.1 – 10,000, where values with 10,000, in general, move to the top and values with 0.1 move to the bottom.
Values in the EWEIGHT row influence the order of the antigen list based on a particular sample. The values can range from 1 – 10,000, where values with 10,000, in general, move to the top and values with 1 move to the bottom.
Values in the GWEIGHT column influence the order of the sample based on a particular antigen. The values can range from 0.1 – 10,000, where values with 10,000, in general, move left and values with 0.1 move to right.
[1] Some lists have many (>70) significant antigens. Since we like to show antigen lists of ~20 dominant antigens, it becomes necessary to choose antigens based on q-value scores and numerator values. Decide on a numerator threshold that will achieve this list size.
[2] Note that B$16 is used in this example because the antigen list spans cells B3 to B16. A different list might be longer or shorter so to make the algorithm more general, B$16 should be B$n. The match function compares the text in cell A[i] with the text in cells B3 – B16. If an exact match is found, then cell C[i] will contain the location within list B3 – B16 (a number between 1 and 14). If an exact match is not found, then cell C[i] will contain #N/A.
[3] One can also put values other than one in these rows and columns. A value of one in the row or column sets the weight of the node to one and means this node has no bias. Values other than one will bias the node to group into specific clusters. Note that biasing the clustering does not change the data in any way. It simply changes how the data is presented. See the section, that appears later in this document, entitled Adding Values Other than Ones… for an explanation on biasing the data to influence the grouping of either antigens or samples.
[4] It’s essential to close the text file so you can open it with the cluster program.
Brian A. Kidd Ó 2004