% CompClustTk Tutorial
% Copyright (c) California Institute of Technology
%
% Authors: Brandon King
% $Revision: 1.12 $
% Modified $Date: 2005/04/14 19:03:35 $

\documentclass{article}
\usepackage{graphicx}
\usepackage{color}
\usepackage[margin=3cm,noheadfoot]{geometry}
\usepackage{apalike}
%\usepackage{makeidx}

\setlength{\parindent}{0in}
\setlength{\parskip}{2mm}
%\setlength{\textwidth}{6.5in}

%colors 
%\definecolor{darkblue}{rgb}{0,0,1}
\newcommand{\cb}{\color{blue}}

%\makeindex

%new commands
%\newcommand{\pymlink}{http://pymerase.sf.net}
\newcommand{\CompClustGui}{CompClustTk}
%document
\begin{document}
\bibliographystyle{apalike}

%title page

\begin{titlepage}
\title{\cb {\CompClustGui}\\Manual \& Tutorial}
\author{Brandon King & Diane Trout\\
Copyright \copyright  California Institute of Technology}
%FIXME: Increment version before check-in.
\date{Version 0.2.0\\\today}
\maketitle
\thispagestyle{empty}
\end{titlepage}

\tableofcontents
\thispagestyle{empty}
\newpage
\setcounter{page}{1}

\section{\cb Introduction}

%\subsection{\cb Document Information}
%\subsubsection{\cb Status}
%\begin{tabular}{ll}
%\bf Section & \bf Status\\
%Intro - Purpose & Partially Written\\
%Intro - CompClust Background & Partially Written\\
%Intro - Where to get help? & To Partially Written\\
%Tutorial Background - Microarray & Mostly complete\\
%Tutorial Background - Datasets & Mostly complete\\
%Tutorial Background - Labelings & Mostly complete\\
%Tutorial Background - Cho Data & Mostly complete\\
%CompClustTk - Intro & Partially Written\\
%CompClustTk - Load Data Set & Mostly complete\\
%CompClustTk - Attach Labelings & Mostly complete\\
%CompClustTk - Clustering & Partially Written\\
%Analysis - Intro & Partially Written\\
%Analysis - Trajectory Summary & Mostly complete\\
%Analysis - Confusion Matrix & Mostly complete\\
%Analysis - ROC Analysis & To Be Written\\
%Analysis - PCA Explorer & To Be Written\\

%CompClustWeb - Intro & Partially Written\\
%CompClustWeb - Load Data Set & Mostly complete\\
%CompClustWeb - Attach Labelings & Mostly complete\\
%CompClustWeb - Clustering & Partially Written\\

%Advanced Analysis & To Be Written\\
%More of CompClust & Mostly complete\\
%%Trouble Shooting & To Be Written
%\end{tabular}
%
%\subsubsection{\cb What's New?}
%\begin{tabular}{ll}
%\bf Version & \bf What's New?\\
%v0.1.12 & Added analysis shell section
%v0.1.8  & Added references\\
%        & Tutorial Background - Cho Data is now mostly complete\\
%        & More of CompClust is now mostly complete\\
%        & \\
%v0.1.7  & Intro - 'Where to get help?' is now partially written\\
%        & Tutorial Background - 'Data Sets' is now mostly complete\\
%        & Tutorial Background - 'Labelings' is now mostly complete\\
%        & \\
%v0.1.6  & Document Information Section\\
%        & Minor Updates\\
%\end{tabular}

\subsection{\cb Purpose}
%FIXME: Needs a lot of work... Any thoughts Chris?

CompClust provides utilities designed around the needs of biologists
explore various types of array data, such as microarray or ChIP array.

CompClust focuses on gaining a more quantitative and qualitative
understanding of clustering results and the relationships between
them.  The software was initially developed as a set of python modules
designed to provide a convenient environment for interactive use from
the python interpreter and for then automating analysis as python
scripts.  The generality of these modules has proved to be useful in
many applications within our lab for bioinformatics and genomic
analysis.  Although the most powerful and flexible way to use
CompClust is directly via the python command line interface we have
constructed CompClustTK and CompClustWeb as more familiar graphical
user interface (GUI) to many of the most useful analysis tools.  The
GUIs focus on the analysis tools for comparative cluster analysis as
described in \cite{Hart_etal04}.

This tutorial contains general introductory information for CompClust
as well as specific information on how to use {\CompClustGui}.  Other
tutorials and documentation have been written as an introduction to
using CompClust via the python interpreter directly.


\subsection{\cb CompClust Background}
%FIXME: Should contain a brief history and purpose of
%compClust/pyMLX/IPlot and how they tie into the development of
%CompClustTk.

%FIXME-START: This is copied from the main web site and possibly
%should aim for a higher level overview/description, but is probably
%better than nothing.
CompClust is a Python package written using the pyMLX and IPlot APIs. It
provides software tools to explore and quantify relationships between
clustering results. Its development has been largely built around requirements
for microarray data analysis, but can be easily used for other types
of biological array data and that of other scientific domains.

Briefly, pyMLX provides efficient and convenient execution of many clustering
algorithms using an extendable library of algorithms. It also provides
many-to-many linkages between data features and annotations (such as cluster
labels, gene names, gene ontology information, etc.) These linkages
persistent through user data manipulation. IPlot provides an abstraction of the
plotting process in which any arbitrary feature or derived feature of the data
can be projected onto any feature of the plot, including the X,Y coordinates of
points, marker symbol, marker size, marker/line color, etc. These plots are
intrinsically linked to the dataset, the View and the Labeling classes found
within pyMLX.

%FIXME-END: This is copied from the main web site and possibly should aim for a
%higher level overview/description, but is probably better than nothing.

\subsection{\cb Where to get help?}
If you need help with {\CompClustGui} or CompClust visit
http://woldlab.caltech.edu/compClust/. 

%FIXME: Add more information on who to contact?
Or you may send e-mail to Chris Hart (hartATcaltech.edu) or Brandon
King (kingbATcaltech.edu).

\subsection{\cb What's New}
To make the plots between the trajectory summary and the confusion matrix
more consistant, the ``Plot All'' button on the Trajectory Summary
plot was removed and now left-clicking on either the trajectory
summary plot or confusion matrix cells will bring up the appropriate
detail plot.

\newpage
\section{\cb Tutorial Background}
\subsection{\cb Microarray Data}

Users unfamiliar with microarrays in general are encouraged to familiarize
themselves with the the biology and technology behind the data.  For this
tutorial we assume the microarray data is gene expression measurements across
several different conditions (eg.  Different time points, different cell types,
different treatments, etc.).   The data will be needed to formatted into a gene
expression data matrix, where each row is represents a gene and each column
represents a different condition.  A row in that matrix represents a gene
expression vector, or the expression measurements for that gene across every
assayed condition.

% FIXME: Look up current reviews, or web references

\subsection{\cb CompClust Datasets}
\label{CompClust Datasets}
CompClust Datasets contain vectors of data, which can be loaded into
{\CompClustGui} through a simple tab delimited text format.

Take microarray data for example. If you've done an experiment with
four time points; hours 1, 2, 4, and 8. The columns of your data set
are the time points and the rows are the individual genes (see below).

\begin{tabular}{llll}
\bf \# Hour 1 & \bf Hour 2 & \bf Hour 4 & \bf Hour 8\\
0.72 & 0.56 & 0.32 & 0.06\\
0.01 & 0.15 & 0.80 & 0.73\\
0.97 & 0.95 & 0.91 & 0.94
\end{tabular}

Loading the above data set into {\CompClustGui} and then running a Trajectory
Summary plot (see section \ref{trajectorysummary}), you would see that the
first gene's trajectory (vector) starts high and gets lower of the four time
points. Formatting your data into a tab delimited formated similar to the one
shown above will allow you to load your own data sets into {\CompClustGui}.

\subsection{\cb CompClust Labelings}
\label{CompClust Labelings}
Adding a new labeling to any data set is fairly easy. All you need to do is
make a tab delimited text file with either one row or one column depending on
what type of labeling is appropriate. The only restriction is that labeling
must be the same dimensions as it's data set.   

For example, if you wanted to add a 'Gene Name' labeling to the data set in
section \ref{CompClust Datasets}. You would need a row labeling... i.e. one
column with three labels to match the three rows in the data set. Below is an
example of this labeling.

\begin{tabular}{l}
Gene Name 1\\
Gene Name 2\\
Gene Name 3
\end{tabular}

If you wanted to make your own cluster labeling (group labeling), you would
reuse the same label in one or more rows. For example if I wanted to create a
cluster labeling which groups Gene 1 and Gene 2 in one group and Gene 3 in
another group, I would create the following row labeling.

\begin{tabular}{l}
Cluster 1\\
Cluster 1\\
Cluster 2
\end{tabular}

One may wish to keep around the time point hours as column labeling as well. To
do this, create a tab delimited text file with one row as show in the
column label below.

\begin{tabular}{llll}
Hour 1 && Hour 2 && Hour 4 && Hour 8\\
\end{tabular}

In actuality, labeling files can be in either in row form, as one
label per row, or in column form as one label per tab separated column.

One of the beauties of CompClust is you can attach as many labels as you can
think of. In {\CompClustGui} you will see dialogs asking you to select cluster
labelings, which are row labelings which separates your data into groups. And
you will be requested for primary and secondary labelings, which are basically
arbitrary row labelings which you may wish to attach. For example, when viewing
gene expression data in a plot, you may wish to attach a primary labeling of
gene names and a secondary labeling of descriptions.

Column labelings currently can only be taking advantage of by using CompClust
from Python, but in the future, these features may be exposed in
{\CompClustGui}. 


\subsection{\cb Cho Example Data}

{\CompClustGui} uses example data collected from Cho et. al., 1998.
Briefly they synchronized yeast cells using a CDC28 temperature
sensitive mutant.  After releasing the yeast cells from arrest they
collected RNA from the cells every 10 minutes as the cells underwent
two rounds of cell division.  Using Affymetrix arrays they assayed the
gene expression profile of every gene in yeast during this experiment
\cite{Cho_etal98}.  The resulting gene expression matrix has roughly
6000 genes by 17 time points.  We provide a subset of this matrix
which includes a total of 380 genes that were both selected by the
authors to exhibit cell cycle dependency and meet a minimal noise
threshold \cite{Hart_et2005}.  Hart et. al. 2005 provides a
introduction and theoretical basis for these tools and also provides a
case study highlighted the types of biological insights that can be
gleaned from analysis similar to those described here.

\newpage
\section{\cb CompClustTk}

\subsection{\cb Introduction}
%FIXME: The whole introduction seems to wordy to me so far. I'm also
%not sure if the information about ipython and the analysis history
%belong here. Maybe I should just mention that people can skip ahead to
%the analysis history and ipython section of the tutorial?

CompClustTk was designed to expose the basic functionality of CompClust. The
idea is to bring the CompClust analysis environment to the biologist.
Previously, a basic knowledge of Python programming was required in order to
use CompClust. This is still true for some of the most advanced analysis one
may wish to do.

We hope that CompClustTk will simplify learning to use CompClust by allowing
the user to use CompClust without knowing any Python. If you find CompClustTk
too limiting, there are a few tools which will help you to adjust to using
Python along with the GUI to do more advanced analysis---whenever you trigger
an action, the Python code you would have used to do the same thing in pure
Python will be stored in the 'Analysis History' section (View|Toggle Analysis
History).

For those of you who are feeling daring or just don't like GUIs that much, you
can use iPython to access the internals of the GUI, including any data sets or
labelings which you have loaded.

In the following sections of this tutorial, we will be using the Cho et. al.,
1998 Cell Cycling data mentioned above  located in your
CompClustTk/Examples/ChoCellCycling directory.

%\newpage
%
%\begin{figure}[h]
%\includegraphics[width=\textwidth]{tkImages/compClustTk-start}
%\caption{\cb CompClustTk mGUI} \end{figure}
%
%\newpage

\newpage
\subsection{\cb Load Data Set}
\label{loaddataset}
The first thing we need to do after loading CompClustTk is to load a
data set to work with. Click on 'File|Load Data' Set from the menu.

\begin{figure}[h]
\includegraphics[width=\textwidth]{tkImages/compClustTk-File-LoadDataSet}
\caption{\cb CompClustTk File | Load Data Set}
\end{figure}

\newpage
Locate the file named 'ChoCycling.dat' and click 'Open'.

\begin{figure}[h]
\includegraphics[width=\textwidth]{tkImages/compClustTk-LoadDataSet}
\caption{\cb Load Data Set Dialog}
\end{figure}

If everything went well, you should see a dialog box like the one
below telling you the dimensions of the data set you have just
loaded. Make sure that the numbers look reasonable, otherwise it may
be a sign that your data set was not formatted properly.


\begin{figure}[h]
  \begin{center}
  \includegraphics{tkImages/compClustTk-LoadDataSetComplete}
  \caption{\cb Load Data Set Complete}
  \end{center}
\end{figure}

\newpage
\subsection{\cb Attach Labelings}
\label{labelings}
Once you have successfully loaded your data set, it will be useful to
attach additional information to the data set. The generic name for
these annotations is 'Labelings'.

First, let's load the Gene row Labeling so we will be able to know
which rows (vectors) represent which genes. Select 'File|Load
Labeling' as shown below.

\begin{figure}[h]
\includegraphics[width=\textwidth]{tkImages/compClustTk-File-LoadLabeling}
\caption{\cb Load Data Set Dialog}
\end{figure}

\newpage
The Load Labeling dialog box requires a little more information than
the Load Data Set dialog box. Later when you are running an analysis,
you will need to select one or more labelings to use, so you should
choose a Labeling name which is meaningful. Also, you must tell the
program whether you plan to load a row or column labeling (See section
\ref{CompClust Labelings} for more information about CompClust
labelings).

\begin{figure}[h]
  \begin{center}
  \includegraphics{tkImages/compClustTk-File-LoadLabelingDialog}
  \caption{\cb Load Labeling Dialog}
  \end{center}
\end{figure}

Give your Gene Labeling a name like 'blah', press browse and select
the file named 'CommonNames.rlab' and then choose the 'Row' radio
button as shown below.

\begin{figure}[h]
  \begin{center}
  \includegraphics{tkImages/compClustTk-File-LoadLabelingDialogGeneNames}
  \caption{\cb Load Labeling Dialog | Gene Names}
  \end{center}
\end{figure}

\newpage
Click 'Load' when you are ready. You should see a dialog box similar
to the one below if your Labeling loads successfully.

\begin{figure}[h]
  \begin{center}
  \includegraphics{tkImages/compClustTk-File-LoadLabelingDialogComplete}
  \caption{\cb Gene Name Load Labeling Complete}
  \end{center}
\end{figure}

Later when we start analyzing our data, we will compare our
clustering results (Coming up in section \ref{clustering}) to
classifications made by Cho. We should attach the Labeling file
'ChoClassification.rlab' as shown below. 

\begin{figure}[h]
  \begin{center}
  \includegraphics{tkImages/compClustTk-File-LoadLabelingDialogChoClass}
  \caption{\cb Loading Cho Classification Labeling}
  \end{center}
\end{figure}

%FIXME: I should had the time course column labeling if I know how it
%would be useful in IPlot? 
%
%  If you have a DataseRowPlotView v, then this should work:
%     v.getDataMapper().setXAxisLabeling(<labeling>)
%  You'll then need to refresh the plot.

\newpage
\subsection{\cb Clustering}
\label{clustering}

\subsubsection{\cb Introduction to Clustering}
%FIXME: Needs more thought and maybe a more meaningful
%description... Chris, do you have any suggestions for this section?

Clustering Algorithms are a general class of unsupervised machine
learning techniques which attempt to find an approximation to the
optimal partitioning (or separation) of a given data set into discrete
classes or clusters.  An optimal partition is is defined as the
partitioning for which each data vector in a cluster is more similar
to all other data vectors with the same cluster memberships than to
all other data vectors with different cluster memberships.  It can be
shown that this problem is NP-hard (computationally intractable, if
you think about it, to be certain you have found the `''true'' optimal
clustering you would have to search through all possible clusterings -
a rapidly growing set of possibilities as the number of data points grows.).

In the context of microarray data analysis clustering has become a
staple technique to provide insights into which genes have similar
behavior across the conditions being assayed.  Clusters essentially
form groups of co-expressed genes, using these data as a starting
point biologists can then start to address the more interesting
questions regarding why these genes are co-expressed. For instance, are
the co-regulated genes part of a similar biochemical pathway. The
comparative tools provide utilities to compare clusterings, which can
elucidate such things as: how different parameters affect a
clustering; how different algorithms partition the data; and how
different experimental perturbations affect which cluster,
representing a similar expression response, a gene might belong
to. This can have potentially profound effects on downstream analysis.

Although we don't stress it in this tutorial, you can cluster
conditions (columns) just as easily as you can cluster genes (rows)
using the same techniques.

\newpage
\subsubsection{\cb DiagEM}
The first clustering program we will use is DiagEM.
%
DiagEM is an implementation of the expectation maximization (EM)
algorithm that attempts to fit data vectors to Gaussian clusters. The
DiagEM algorithm is so named because it only uses the diagonal of the
covariance matrix that describes the cluster mean and variance.
Because of this the Gaussians discovered by DiagEM can only vary along
the axes of the data space.

In principal the more common KMeans algorithm is a simplification of
the EM algorithm which is limited to a simple high-dimensional sphere
instead of the ellipsoid, or even more complex shapes, that can be
described by a covariance matrix.

The EM algorithm using a full covariance matrix can create Gaussians
that not only are of different widths on each dimension but can also
be rotated in different direction in the data space. Unfortunately,
since biological datasets frequently have many conditions this
corresponds to a high dimensionality data space and because of this
high dimensionality there rarely is enough data to properly estimate
all of the parameters required to fill the full covariance matrix, and
thus we standardized on the Diagonal EM algorithm.



We will use most of the default parameters, but we will change the
number of clusters we would like DiagEM to create. Change K from two
(default) to five as we will compare our results to the Cho
Classification, which has been partitioned into five clusters which
represent five stages of the yeast cell cycle.

Choose 'DiagEM' from the 'Clustering' menu and then change K from 2 to
5 as shown below.

\begin{figure}[ht]
  \begin{center}
  \includegraphics[height=4in]{tkImages/compClustTk-Clustering-DiagEMDialog}
  \caption{\cb Clustering|DiagEM}
  \end{center}
\end{figure}

\newpage
When clustering is finished you should see a dialog box pop up similar
to the one below.

\begin{figure}[h]
  \begin{center}
  \includegraphics{tkImages/compClustTk-Clustering-DiagEMDialogComplete}
  \caption{\cb Clustering|DiagEM}
  \end{center}
\end{figure}

\newpage
\subsubsection{\cb KMeans}
Since every clustering algorithm is different, each one may return
different results. We will compare the results of KMeans with DiagEM
and Cho's classifications in the analysis section of this
tutorial. Select 'KMeans' from the 'Clustering' menu and then change K
from 2 to 5. Click 'Cluster' to begin clustering. The KMeans dialog
should be similar to the one shown below.

\begin{figure}[h]
  \begin{center}
  \includegraphics[height=450pt]{tkImages/compClustTk-Clustering-KmeansDialog}
  \caption{\cb Clustering|KMeans}
  \end{center}
\end{figure}

\newpage
\subsection{\cb Analysis}
\label{Analysis}

\subsubsection{\cb Introduction}
Okay, so we have labelings, both ones that we have loaded, but also
labelings created by DiagEM and KMeans. In section \ref{Analysis} we
will explore the data and view/compare the results of the clustering
algorithms. This will be done with a visualization tool called IPlot,
written by Chris Hart, which was built on top of pyMLX, written by Ben
Bornstein, Chris Hart, Lucas Scharenbroich, and Diane Trout.

At any time if you feel you have too many Tabs open, or you are done
with a plot, select 'Close Current Tab' from the 'Tabs' menu.

\subsubsection{\cb Trajectory Summary}
\label{trajectorysummary}
%FIXME: Chris, could you please describe this tool and how one might
%want to use it.

The Trajectory Summary is a great tool for viewing your data. Given a
Cluster Labeling and a Gene Labeling, this tool will allow you to
easily visualize and explore your data set. 

Let's start by choosing 'Trajectory Summary' from the 'Analysis'
menu. To start, select 'Cho Classification' for the 'Cluster
Labeling'. So that we know which row (vector) represents which Gene,
choose 'Gene Names' for 'Primary Labeling'. The 'Primary Labeling' is
the labeling which is displayed when you click on an individual
vector. Click 'Plot' when you are ready to view the Trajectory
Summary.

\begin{figure}[h]
  \begin{center}
  \includegraphics{tkImages/compClustTk-TrajectorySummaryPlotDialog-ChoClass}
  \caption{\cb Clustering|KMeans}
  \end{center}
\end{figure}

\newpage
Your 'Cho Classification' Trajectory Summary should look similar to
the image below. You should have five clusters separating the gene
expression data into five stages of the yeast cell cycle. The blue
lines are the mean trajectory for a given cluster. The red lines are
the standard deviation from the mean.

This is a helpful view to get an idea of what your clusters are doing,
but if you want a more detailed view, left click on the plot. In this
case, lets look at the 'Late G1' cluster. Click on the plot for the
'Late G1' cluster now.

\begin{figure}[h]
  \begin{center}
  \includegraphics[width=\textwidth]{tkImages/compClustTk-TrajectorySummaryTab-ChoClass}
  \caption{\cb Clustering|KMeans}
  \end{center}
\end{figure}

\newpage
You should get a plot that looks like the following. The coloring of
the trajectories is based on the expression level at time 0 by
default. In a future version we may expose the ability to easily
change the coloring schema within the GUI itself, but for now if you
have the desire to change the coloring, you will have to use the
'Analysis Shell' which is discussed in section \ref{ipython}.

\begin{figure}[h]
  \begin{center}
  \includegraphics[width=\textwidth]{tkImages/compClustTk-Analysis-TrajectorySummary-ChoClass-G1-PlotAll}
  \caption{\cb Plot All - Late G1}
  \end{center}
\end{figure}

\newpage
If you click on any point in the plot, a box in the top right will
show up with the text 'Gene Name: (x-cord, y-cord)'. Click on the
point shown in the diagram below and you should see 'SCW11: (10.00,
4.92)'. If you control click on the point shown below, a new dialog
box appears showing the trajectory for 'SCW11' as shown in Figure 16
on the next page.

%% FIXME: replace Figure 16 with symbolic reference

\begin{figure}[h]
  \begin{center}
  \includegraphics[width=\textwidth]{tkImages/compClustTk-Analysis-TrajectorySummary-ChoClass-G1-PlotAll-Select}
  \caption{\cb Plot All - Late G1 - SCW11}
  \end{center}
\end{figure}

\newpage
Note that at the bottom of the figure below, there are two Labelings,
'Gene Names' and 'Cho Classification'. If you click on the 'Gene
Names' labeling you should see a pull down menu show up in the bottom
right. If you then change this from 'default display' to 'Highlight
This Group', this gene will be highlighted in your 'Late G1 - Plot
All' display as shown in Figure 18 on the next page.

\begin{figure}[h]
  \begin{center}
  \includegraphics[width=\textwidth]{tkImages/compClustTk-Analysis-TrajectorySummary-ChoClass-GeneView-SCW11}
  %FIXME: Labeling 'Cho Clustering' should be 'Cho Classification'.
  \caption{\cb Plot All - Late G1 - SCW11}
  \end{center}
\end{figure}


\newpage
At this point you can 'Ctrl + Click' on other gene vectors and
highlight them as well. This is all we will cover on the Trajectory
Summary Plot for this tutorial.  Feel free to explore more on your own
or continue on to the next section of the tutorial.
\begin{figure}[h]
  \begin{center}
  \includegraphics[width=\textwidth]{tkImages/compClustTk-Analysis-TrajectorySummary-ChoClass-G1-PlotAll-Highlight}
  \caption{\cb Plot All - Late G1 - SCW11 Highlighted}
  \end{center}
\end{figure}

\newpage
\subsubsection{\cb Confusion Matrix}
%FIXME: Chris, could you please describe this tool and how one might
%want to use it.

Now that we have an idea of what our clusters look like from the
Trajectory Summary Plots, we will compare the 'Cho Classification',
'DiagEM w/ K=5', and 'K-means w/ K=5'. This will be done using a
Confusion Matrix Plot.

For starters, let's compare the 'Cho Classification'\footnote{See
section \ref{labelings} for information on loading labelings.} to
itself. Select 'Build Confusion Matrix' from the 'Analysis' menu. Then
select 'Cho Classification' for the '1st Clustering Labeling' and '2nd
Clustering Labeling' as shown in the figure below. Click 'Plot' when
you are ready to move on.

\begin{figure}[h]
  \begin{center}
  \includegraphics{tkImages/compClustTk-Analysis-ConfusionMatrix-Diaglog-ChoClass-vs-ChoClass}
  \caption{\cb Confusion Matrix Dialog - Cho vs. Cho}
  \end{center}
\end{figure}

\newpage
You should get a Confusion Matrix plot similar to the following
figure. Notice that there are two 'Trajectory Summary' sections being
displayed with white backgrounds (top row and last column). Each one
of these sections is a clustering, in this case 'Cho Classification'
versus itself. If you look at the five clusters in the top row, you'll
notice that I have super-imposed green and red bars in the figure
below. The green bars are highlighting the number of
genes\footnote{CompClust is capable of supporting other types of data
beyond gene expression data.} in a given cluster. The red bars are
highlighting the name of the clustering.

\begin{figure}[h]
  \begin{center}
  \includegraphics[width=\textwidth]{tkImages/compClustTk-Analysis-ConfusionMatrixTab-ChoClass-vs-ChoClass-HighlightedLabels}
  \caption{\cb Confusion Matrix Tab - Cho vs. Cho}
  \end{center}
\end{figure}

\newpage
What is this matrix telling us? It's showing us the number of members
of column Y that are showing up in row X. For example, if we look at
column 2 (M Phase Cluster) and compare it to row 2 (S Phase Cluster),
we see that the 'M Phase Cluster' has no members that are shared with
the 'S Phase Cluster' (see figure below). Later when we compare our
clustering results to Cho's, things won't be as clear as this. If you
look at row 1 (Late G1) vs column 5 (Late G1) you'll see that 134 out
of 134 members are shared between the two clusters (because they are
the same cluster).
\begin{figure}[h]
  \begin{center}
  \includegraphics[width=\textwidth]{tkImages/compClustTk-Analysis-ConfusionMatrixTab-ChoClass-vs-ChoClass-Row2-Col2}
  \caption{\cb Confusion Matrix Tab - Cho vs. Cho}
  \end{center}
\end{figure}

\newpage
Now let's move onto a more interesting comparison. Let's compare the
'Cho Classification' to our clustering of 'DiagEM w/ K=5'\footnote{See
section \ref{clustering} if you haven't run DiagEM yet.}. Select
'Build Confusion Matrix' from the 'Analysis' menu. Then select 'Cho
Classification' for the '1st Clustering Labeling' and
'DiamEM...k=5...' for the '2nd Clustering Labeling' as shown in the
figure below. Click 'Plot' when you are ready to move on.

\begin{figure}[h]
  \begin{center}
  \includegraphics[width=\textwidth]{tkImages/compClustTk-Analysis-ConfusionMatrix-Diaglog-ChoClass-vs-DiagEM}
  \caption{\cb Confusion Matrix Dialog - Cho vs. DiagEM}
  \end{center}
\end{figure}

\newpage
As you can see from the figure below, you get a much more interesting
plot. We have 'DiagEM' on the X axis and 'Cho Classification' on the Y
axis. Note that DiagEM gave the clusters numeric labelings as it had
no way of knowing any biologically related information for naming
purposes. It is the job of the user to try to figure the meaning of
those cluster labels.

I'll start by describing the coloring scheme. The colors range from
red to blue, where pure red is 0\% of the members for a given row and
column are shared by the two clusters. Pure blue would mean that 100\%
of the members are shared between the two clusters. If we look at
column 1 (DiagEM Cluster \#4) and compare it to row 2 (Cho's S Phase
Cluster) we see that 2 out of 13 (15.38\%) members are shared between
the two clusters, which is why it has an orange color. \footnote{Note
that all color calculations are based on all the comparisons for a
given column. If you were to redo this plot as 'DiagEM' vs 'Cho'
instead of 'Cho' vs 'DiagEM', the results will probably be very
similar, but the color coding may change significantly. If we looked
at the same comparison of 'DiagEM Cluster \#4' vs 'Cho's S Phase
Cluster' in the 'DiageEM' vs 'Cho' plot, the color would be calculated
as 2 out of 74 (2.70\%), which would make it much more red than it is
in our current plot.}


\begin{figure}[h]
  \begin{center}
  \includegraphics[width=\textwidth]{tkImages/compClustTk-Analysis-ConfusionMatrixTab-ChoClass-vs-DiagEM}
  \caption{\cb Confusion Matrix Tab - Cho vs. DiagEM}
  \end{center}
\end{figure}

If you look at column 5 (DiagEM Cluster \#2) vs row 1 (Late G1) on the
figure on the previous page, you will notice that 117 members out of
151 are shared between the two clusters.

%FIXME: Maybe I should add more information about some conclusions you
%can make based on the comparison of human clustered data vs computer
%clustered data.

%\paragraph{\cb NMI Score}
%FIXME: Needs describing

%\paragraph{\cb Linear Assignment - LA}
%FIXME: Needs describing

\newpage
\subsubsection{\cb ROC Analysis}

ROC Analysis is a tool to examine the overlap of the members of a
cluster with that of the surrounding space. It does this by comparing
the false positive rate on the X axis and the false negative rate on
the Y axis. 

To visualize the way that the ROC curve is computed imagine a
hypersphere that starts with zero size at the cluster center, then as
the sphere grows for every point you hit that is in the cluster you
increment along the Y axis, for every point that you hit that is not in
the cluster you increment along the X axis. A perfect ROC score would
have a vertical line at x=0, followed by a horizontal line at y=1.

The way that {\CompClustGui} visualizes this shows the standard ROC
curve along the left, with histograms of the cluster members and
non-members on the right.

To visualize an ROC curve go to the Analysis menu and select Cluster
ROC analysis.

\begin{figure}[ht]
  \begin{center}
  \includegraphics[height=4in,keepaspectratio]{tkImages/compClustTk-Analysis-ROC}
  \caption{\cb Cluster ROC Analysis}
  \end{center}
\end{figure}

\newpage
The menu item will bring up a dialog box that allows you to select
which Cluster Labeling to use. In this case let's select the
Early G1 cluster from the Cho Classification.

The Clustering Labeling specifies which clustering that one wants to
explore, while Cluster label allows one to specify which cluster you
want to consider the ``inside'' cluster for the analysis.

\begin{figure}[h]
  \begin{center}
  \includegraphics[height=2in,keepaspectratio]{tkImages/compClustTk-Analysis-ROC-ChoClass-Dialog}
  \caption{\cb ROC Analysis Dialog Box}
  \end{center}
\end{figure}

\newpage
Once selected you can see the comparisons, on the left is the standard
ROC curve, and on the right is the plot of how many points were found
at each distance bin, red represents cluster members, and blue
represents everything else.

The more separated those two histograms are the better the clustering.

This is not a particularly well separated cluster. If one looks at the
histogram of distances, there are several data points that are
considered part of this cluster that are actually quite far from the
cluster center.

\begin{figure}[h]
  \begin{center}
  \includegraphics[height=4in,keepaspectratio]{tkImages/compClustTk-Analysis-ROC-ChoClass-EarlyG1}
  \caption{\cb ROC Analysis Cho Classification of Early G1}
  \end{center}
\end{figure}

\newpage
In this case we look at one of the better clusters found by our EM
algorithm. You can see in the histogram that the elements in the
cluster rapidly taper off as one moves from the cluster center.

\begin{figure}[h]
  \begin{center}
  \includegraphics[height=4in,keepaspectratio]{tkImages/compClustTk-Analysis-ROC-DiagEM-5}
  \caption{\cb ROC Analysis DiagEM of cluster 5}
  \end{center}
\end{figure}

To be fair, the EM algorithm we use is building clusters using a
Gaussian cloud, which does a very good job of 

\subsubsection{\cb PCA Explorer}
%FIXME: Chris, could you please describe this tool and how one might
%want to use it.
To Be Written

\subsection{\cb Advanced Analysis - IPython}
\label{ipython}

\subsubsection{\cb Introduction}
The ``Analysis Shell'' is basically just an IPython\footnote{IPython -
http://ipython.scipy.org/} command prompt which allows you to use
the Python\footnote{Python - http://www.python.org/} programming
language. This also gives you access to all of the CompClust Python
package and the GUI internals. What this means is that if the GUI
does NOT currently do something you would like it to, you can
probably make it happen using Python in the ``Analysis Shell''.

To get you started with Python, we recommend reading the Python
tutorial at http://docs.python.org/tut/tut.html.

\subsubsection{\cb Analysis Shell and Log 2 Transform Example}
One common thing you may wish to do which currently is not supported
in \CompClustGui  is the ability to transform your data set. The CompClust
Python package supports this extensively among many other powerful
features. In this case, let's say you want to log2 transform your data
set. Load a data set as shown in section \ref{loaddataset} and then
launch the 'Analysis Shell' from by going to the 'Analysis' menu and
choosing 'Analysis Shell'. You will find the 'Analysis Shell' in the
original shell window you used to launch \CompClustGui  or in the 2nd
window that was launch upon starting \CompClustGui.

To give you an glimps of the many powerful things one can do with the
CompClust Python package, I'm going to show you how to do a log2
transform of the dataset, but I'm going to do it by using a data set
'View' called the 'FunctionView'. The 'FunctionView' allows you to
transform your data set by passing in a function which will be applied
to every element of your data set. The function you pass to the
'FunctionView' needs to take one argument and return one value. In our
case, we will make a function which will take on element of the data
set, convert it to log2, and return. As you can imagine, any function
you can come up with can be applied using this method.

The 'FunctionView' is only one of the many 'Views' one can use on your
dataset. The nice thing about a 'View' is that it doesn't actually
store a copy of the data set. It gets the data from the original
dataset when you access the 'View'. This also means that any
'Labelings' you have attached to your original data set will also be
accessable by your 'View', even if your 'View' is only a subset of
your original data set. Since a 'View' implements all the functions a
data set object has, it's usable where ever a function asks for a data
set. This also means you can create another view from a view. Don't
worry if that didn't make much sense, basically what it means is that
it's relatively memory effecient and easy to use (from a programmer's
point of view).

If you would like to see what views are available type the following
from the analysis shell and then press TAB after typing the period:

\begin{verbatim}
views.
\end{verbatim}

In the case that you are using the CompClust Python package from
within Python itself without a GUI, then you will need to import the
views module by typing the following command. If your using the
'Analysis Shell' then the following command has already been executed
for you.

\begin{verbatim}
from compClust.mlx import views
\end{verbatim}

To get information on any particular view, or for any Python object
for that matter, type the variable/function/object name, then a '?'
and press enter. For example for the 'FunctionView' you would type:

\begin{verbatim}
views.FunctionView?<press-enter>
\end{verbatim}

Now onto example. The first thing we are going to do is create the
log2 function we are going to pass to the 'FunctionView'. To do this,
we will need to load the 'math' Python module by typing:

\begin{verbatim}
import math
\end{verbatim}

We are going to use the math.log function, which takes two arguments,
number and base. But the 'FunctionView' expects to receive a function
which is takes only one argument, we need wrap the math.log function
with the 'base' argument set to 2. Type the following do define the
log2 function:

\begin{verbatim}
def log2(x):
  return math.log(x, 2)
\end{verbatim}

Note that you need to press enter twice after writting the last line
of the function. This tells Python to go ahead and define the
function. If everything went well, your command prompt should look
like this:

\begin{verbatim}
In [5]:def log2(x):
   ...:    return math.log(x, 2)
   ...:

In [6]:
\end{verbatim}

If you write more Python code on the next line rather than pressing
enter twice, you'll probably end up with a SyntaxError like the
following:

\begin{verbatim}
In [5]:def log2(x):
   ...:    return math.log(x, 2)
   ...:log2(2)
------------------------------------------------------------
   File "<console>", line 3
     log2(2)
        ^
SyntaxError: invalid syntax
\end{verbatim}

Feel free to try out your new function by typing:

\begin{verbatim}
In [9]:log2(2)
Out[9]:1.0
\end{verbatim}

Now that we have our function, it's time to get the data set which
you've already loaded from within the GUI. To grab the data set, type
the following:

\begin{verbatim}
dataSet = gui.data['myDataSet']
\end{verbatim}

Since we are going to replace gui.data['myDataSet'] with the
transformed data set, if you want access to the original dataSet, you
should save the data set to a variable you can access later. If you
save it to the gui.data Python dictionary, then you will be able to
access the original data set even if you close the 'Analysis Shell'
and re-open it later. To do this type the following:

\begin{verbatim}
gui.data['originalDataSet'] = dataSet
\end{verbatim}

Now we will create the log 2 transformed view (a.k.a. data set). To do
this, call the 'FunctionView' with the data set and the log2 function
by typing:

\begin{verbatim}
log2DataSet = views.FunctionView(dataSet, log2)
\end{verbatim}

Now that you have the log 2 view of your data, if you want to be able
to view it in \CompClustGui, you will need to store it in
gui.data['myDataSet'] so that the GUI knows that you want it to use
the log 2 view when doing visualizations. To do this type the
following:

\begin{verbatim}
gui.data['myDataSet'] = log2DataSet
\end{verbatim}

That's it, now you can go back to the GUI and use your log 2
transformed data. At this point you can either quit out of the
'Analysis Shell' or leave it open; it's up to you. Note that if you
close the 'Analysis Shell', you will be able to launch it again, but
all of your local variables such as your log2 function will be lost.

If you don't want to lose what you've written, or you want to add a
lot of code all at once, IPython (a.k.a Analysis Shell) will allow you
to use a text editor to write your code. To launch the default text
editor, type the following where <path> is that path to the file you
want to create/use.

\begin{verbatim}
edit <path>.py
\end{verbatim}

That command should launch a text editor for you to use. Type your
Python code and when your done, save the file and exit out of the text
editor. IPython will then read in and execute your Python code. If you
get some sort of error or you want to make a change, just type the
same with command as before and you will be able to modify the code
some more.

Using the edit command to load in and write Python code will allow to
to quickly load, test, and edit your Python code. This can be used to
do automated loading of data/labelings or some advanced analysis and
then view your results from within the GUI.

By the way, to change the default editor, set the environment variable
'EDITOR' to the name of or the path to the editor you wish to use.

\subsubsection{\cb Advanced Analysis Shell}
At some point if you want to make plots appear within the gui, but
doing so from the analysis shell, there are a few things you should
know. When making plots or adding your on Tkinter code from within the
'Analysis Shell', you may get strange errors like 'blt::graph' won't
make much sence. This happens if you try to create a new root Tkinter
object when one already exists. The root Tkinter object can be found
at 'gui.parent'. If you want to create a new 'Pop Up' style window,
type the following from the 'Analysis Shell':

\begin{verbatim}
toplevel = Tkinter.Toplevel(master=gui.parent)
\end{verbatim}

This 'toplevel' object can be passed to just about any IPlot
visualization to become the parent window for that Plot or it can be
used as the parent window for new Tkinter widgets.

If you want new plots or new Tkinter widgets to show up in a tab in
the CompClustTk GUI, then you need to create a new page and use that
variable instead of a root Tkinter object or a 'toplevel' object. Type
the following to create the new 'page' object and then tell Python to
automatically select this new page for you:

\begin{verbatim}
page = gui.notebook.add('My New Tab')
gui.notebook.selectpage('My New Tab')
\end{verbatim}

If this doesn't make much sense and/or if your now interested in
learning to make GUIs using Python and Tkinter, check out:
http://www.pythonware.com/library/tkinter/introduction/

\subsubsection{\cb CompClust Python Package}
If your interested in learning more about what you can do with the
CompClust Python Package, check out the tutorials in the next
section. When using the CompClust Python packacge the analysis
posibilities are only limited by your imagination ... Well, that and
CPU power and RAM, your understanding of Python and CompClust Python
package. Well, at least you don't have to be limited by what the GUI
can do.

\newpage
\section{\cb More of CompClust}

\subsection{\cb Other CompClust Tutorials}
The following tutorials can be found at
http://woldlab.caltech.edu/compClust/.

\subsubsection{\cb A Quick Start Guide to Microarray Analysis using
  CompClust} ''A Quick Start Guide to Microarray Analysis using
CompClust'' written by Christopher Hart, covers how to use the Python
CompClust environment to do microarray analysis. It may give you a
better understanding of the IPlot tools (Trajectory Summary, Confusion
Matrices, etc.). It will also teach you how to use some of the more
advanced features of CompClust which haven't been exposed to
{\CompClustGui}.

\subsubsection{\cb A First Tutorial on the MLX schema}
''A First Tutorial on the MLX schema'' written by Lucas Scharenbroich,
covers the MLX schema. If your want use the full power of compClust
using python.

%\subsection{\cb Future CompClust Work}
%FIXME: To Be Written
%To Be Written

\newpage
%\section{\cb Trouble Shooting}
%FIXME: To Be Written
%To Be Written

\section{\cb Acknowledgements}

\bibliography{refs}

%\subsection{\cb CompClustTk Installation}
%
%\subsubsection{\cb All Platforms}
%
%\subsubsection{\cb Linux Specific}
%
%\subsubsection{\cb MacOS Specific}
%
%\subsubsection{\cb Windows Specific}
%
%\subsection{\cb Known CompClustTk Issues}
%
%\subsubsection{\cb All Platforms}
%
%\subsubsection{\cb Linux Specific}
%
%\subsubsection{\cb MacOS Specific}
%
%\subsubsection{\cb Window Specific}
%
%\subsection{\cb Bugs and Feature Requests}
%
%\subsubsection{\cb Bug Reports}
%
%\subsubsection{\cb Feature Requests}

%\begin{tabular}{ll}
%\bf Output Modules & \bf Description\\
%CreateSQL & SQL Statements for creating database\\
%CreateDBAPI & Python Database API for a given database\\
%CreatePyTkWidgets & Python Tkinter GUI Widgets\\
%CreatePyTkDbWidgets & Python Tk Database GUI Widgets
%\end{tabular}

%\newpage

%\begin{figure}[h]
%\includegraphics[width=\textwidth]{images/Fig1-NewArgoUMLFile}
%\caption{\cb Getting Familiar with ArgoUML}
%\end{figure}

\end{document}
