Estimating Tree-Structured Covariance Matrices via Mixed-Integer Programming with an Application to Phylogenetic Analysis of Gene Expression

TitleEstimating Tree-Structured Covariance Matrices via Mixed-Integer Programming with an Application to Phylogenetic Analysis of Gene Expression
Publication TypeReports
Year of Publication2008
AuthorsCorrada Bravo H, Eng KH, Keles S, Wahba G, Wright S
Date Published2008///
InstitutionDepartment of Statistics, University of Wisconsin

We present a novel method for estimating tree-structured covariance matrices directly fromobserved continuous data. A representation of these classes of matrices as linear combinations
of rank-one matrices indicating object partitions is used to formulate estimation as instances of
well-studied numerical optimization problems.
In particular, we present estimation based on projection where the covariance estimate is
the nearest tree-structured covariance matrix to an observed sample covariance matrix. The
problem is posed as a linear or quadratic mixed-integer program (MIP) where a setting of the
integer variables in the MIP specifies a set of tree topologies of the structured covariance matrix.
We solve these problems to optimality using efficient and robust existing MIP solvers. We also
show that the least squares distance method of Fitch and Margoliash (1967) can be formulated as
a quadratic MIP and thus solved exactly using existing, robust branch-and-bound MIP solvers.
Our motivation for this method is the discovery of phylogenetic structure directly from
gene expression data. Recent studies have adapted traditional phylogenetic comparative anal-
ysis methods to expression data. Typically, these methods first estimate a phylogenetic tree
from genomic sequence data and subsequently analyze expression data. A covariance matrix
constructed from the sequence-derived tree is used to correct for the lack of independence in phy-
logenetically related taxa. However, recent results have shown that the hierarchical structure of
sequence-derived tree estimates are highly sensitive to the genomic region chosen to build them.
To circumvent this difficulty, we propose a stable method for deriving tree-structured covariance
matrices directly from gene expression as an exploratory step that can guide investigators in
their modelling choices for these types of comparative analysis.
We present a case study in phylogenetic analysis of expression in yeast gene families. Our
method is able to corroborate the presence of phylogenetic structure in the response of expression
in a subset of the gene families under particular experimental conditions. Additionally, when
used in conjunction with transcription factor occupancy data, our methods show that alternative
modelling choices should be considered when creating sequence-derived trees for this comparative