GenMAPP Help Topics    
  GenMAPP Introduction   MAPP Sets
  Drafting Board   MAPPFinder
  Drafting Board Toolbar   MAPPBuilder
  The Gene Database   Downloader
  The Gene Database Manager   Advanced Concepts
  Expression Datasets   GenMAPP Knowledge Base
  Expression Dataset Manager   Converter

Utilizing Rim Coloring for Complex Data in GenMAPP

Due to the microarray probe design, it is not uncommon to have several probes mapping to the same gene on a pathway. With conflicting data from multiple rows of data (for example probes) available for a particular gene, GenMAPP has a set of rules to determine how the gene is colored to facilitate interpretation. We can take advantage of these rules to display complex data.

GenMAPP coloring rules

The GenMAPP gene box is made up of two distinct sections, a center and a rim. In the simple case where only one row of data links to a particular gene, both of these sections will be colored the same, but if there are multiple rows of data for a gene these sections can be used to display this discrepancy. There are two main rules:

Mode

If more than one row of data contributing to different coloring links to a gene, the predominating color, or mode, will be used for coloring the center of the gene, with the second color on the rim. The edge of the gene will also appear dashed. For example, if 3 rows of data link to gene A, two of which contribute to red coloring and one row contributes to blue coloring, the gene will be red with a blue rim and dashed edge.

Tie

If more than one row of data contributing to different coloring links to a gene and there is a tie between the coloring, the row of data that appears first in the dataset (ie. the original spreadsheet) will color the center of the gene and the second color will color the rim. For example, if 2 rows of data link to gene A, the first of which contributes to blue coloring and the second of which contributes to red coloring, the center of the gene box will be blue and the rim will be red with a dashed border.

Strategy for utilizing rim coloring for complex data

With the above rules for coloring in mind, we can format our data to take advantage of the rules to display complex data using the center and the rim of the gene box. The strategy is to include each gene twice on separate rows. For one gene ID, each of these two rows will contain information about a specific data type. This will allow us to simultaneously view the two different data types in GenMAPP using the rim for the second row of data.

The alternative, and more common, approach would be to include the second data type ONLY as a separate column in the data for each row, rather than on a new row for each gene, and then use the striped gene feature in GenMAPP to view both types of data. The disadvantage to this is that it can become messy when stripes are also used for each of the two data types. The below figure attempts to describe the difference in strategies.

A: Gene A is represented twice in the dataset, with two types of associated data. The first row for gene A contains real data for the Fold data type, but contains false data for the Splice p-value data type. The second row for gene A is opposite; it contains real data for the Splice p-value and false data for Fold. This data organization will result in the center of the gene coloring based on Fold and the rim coloring based on Splice p-value. The false data in this case can be any numeric that will not pass the criteria you plan to establish for Fold and Splice p-value. For example, if the criteria for the p-value will be "p-value < 0.05", entering a very large number for the p-value for the first row of data will work.

B: Similar to A, this strategy can also be used with the striped gene feature, using several columns of data for coloring the center of the gene.

Preparing the data

GenMAPP analysis requires that the raw data is pre-processed into a form that can be used by GenMAPP. Pre-processing typically includes things like background adjustment, normalization and probe-level summarization, but since each experiment is unique it is not possible to make recommendations as to exactly what type of pre-processing should be done. Because of this, the below instructions list set of typical pre-processing steps which does not represent a solution for all datasets.

Example Data

For the purpose of these instructions, an example dataset is used. The data is from Affymetrix and profiles 11 adult human tissues on an all known and predicted exons. From this data, both expression values and alternative splicing scores were generated.

Background adjustment, normalization and probe-level summarization

Background adjustment and normalization algorithms are typically included in array image processing applications, so these algorithms may be applied to your data by the core lab or facility that processes the arrays. Similarly, summarizing the data at the level of probes (for Affymetrix arrays) is commonly done at this stage as well, before the data reaches the end user. Since there are many algorithms available that will all have different effects on the data, consulting a statistician in regards to your specific dataset is advised.

Example data: Expression values were summarized at the probe set level using the ExACT 1.0 software provided by Affymetrix using quantile normalization and sketch summarization.

Filtering

Although pre-filtering before GenMAPP analysis is not recommended for typical gene expression datasets, it is sometimes necessary for more complex datasets to reduce complexity and the number of measurements.

Example data: Associated DABG or “Detection Above Background” (a metric for comparing perfect-match probes to the distribution of background probes) p values were generated for all probe sets to determine the likelihood of expression. Probe sets were aligned to genes and exons based on the genomic coordinates provided for each probe set from the Affymetrix design time annotation files (genome build 35) and from Ensembl. Probe sets not aligning to an Ensembl gene encoding genomic loci were excluded from the analysis. The remaining probe sets were annotated according to the exon structures provided by Ensembl. Constitutive exons were identified from the Affymetrix annotation files (most over-represented exons in mRNAs or expressed sequence tags). To eliminate non-optimal hybridization results, if less than 9% of all samples possessed a probe sets with a DABG p value <0.001, these probe set were filtered out.

Calculate metrics

Any type of metric or parameter can be used to color genes in GenMAPP, including text-based parameters. Calculating the metrics can be done in several ways, for example programmatically, in a database program or most commonly in Excel.

Example data: For the probe sets remaining after filtering, expression levels for those associated with constitutive exons were averaged per gene to obtain a gene expression intensity value. Expression values for non-constitutive probe sets and the summarized gene expression values were used to determine the likelihood of splicing using the MiDAS algorithm through the Affymetrix Power Tools application. For each gene measured, there is therefore two distinct metrics, the MIDAS p-value and the expression value.

Combining all data in one spreadsheet

To use GenMAPP, the complete dataset must be contained in one file. If the data is not immediately available in this summary format, all relevant files must be combined. Most often this means combining separate data files containing data for individual arrays into one file, but it can also mean combining different types of data.

Example data: To take advantage of GenMAPP's coloring rules, simply combining all measurements into one spreadsheet is not sufficient. For the example data we have two distinct metrics for each gene, the MIDAS p-value and the expression value. To utilize rim coloring for one of these metrics, the order of data in the input file is crucial; the first set of measurements will dictate coloring of the center of each gene box, while the send set of measurements will dictate rim coloring. To accomplish this, the dataset is constructed according to the figure below:

Format data

Before import to GenMAPP, the data needs to be formatted according to GenMAPP specifications. Briefly, this includes adding a System Code column containing a system code for each entry and organizing the columns to have a GenMAPP supported ID in the first column and the System Code as the second column. For details on how to do this, see the Expression Dataset Manager.

Example data: A System Code column was inserted as the second column, and filled with the System Code for Ensembl (En).

Creating a GenMAPP dataset

Importing the data

Once the data is properly formatted, it can be imported to GenMAPP via the Expression Dataset Manager:

  1. Download and load the appropriate database in GenMAPP.
  2. In the Expression Dataset Manager, select File>New to begin the data import process. For details on data import, please refer to the Expression Dataset Manager.

Creating Color Sets

To create Color Sets for your dataset, use the Criteria Builder in the Expression Dataset Manager. To utilize rim coloring in GenMAPP, the Color Set should include criteria for coloring the central part of the gene box (expression in this case) AND criteria for rim coloring (splice score in this case). If multiple Color Sets for similar data are also used, there are some additional considerations for creating Color Sets:

Example data: A Color Set is created for each tissue, and contains criteria for the expression values as well as criteria for the splicing scores. The coloring and numerical cutoffs are the same for each Color Set and tissue.

Viewing the data

When the dataset contains the Color Sets you want to use for simultaneous display, selecting them for display is straightforward.

  1. First, load the appropriate database in GenMAPP, and open a pathway of interest.
  2. In the Color Sets drop-down list, select the Multiple Color Sets option.
  3. Select relevant Color Sets for display in the Multiple Color Sets window by Ctrl+click or click the All button to select all Color Sets. For example, if you have a time course experiment with one Color Set for each time point, then you should select all tie points for display. For detailed instructions on how to select multiple Color Sets, see the Drafting Board Toolbar.

Example data: All Color Sets were selected for display, resulting in one stripe for each tissue. For each stripe the center of the gene box represents the expression value for that tissue and the rim represents alternative splicing:

Note: Depending on how many Color Sets you choose for display and how complex these are, coloring the pathway may take a few seconds.

Legend

For the striped gene view, the Legend displays which Color Sets are currently displayed for which stripe of the gene, and what each color represents. The coloring legend can be displayed for the first Color Set only or for each Color Set. This behavior is controlled in the Options menu. For readability, it may be convenient to display the Legend for only the first Color Set, if many Color Sets are selected for display and if they all share a similar color scheme. For details on this please refer to the Legend.