0% found this document useful (0 votes)
8 views21 pages

A Framework For Low-Level Data Fusion

This chapter discusses a framework for low-level data fusion, focusing on the analysis of multiple data sets measured on the same biological system. It emphasizes the importance of data integration and model-based data fusion, distinguishing between symmetric and asymmetric methods, and outlines various goals and methodologies for achieving effective data fusion. The chapter also provides motivating examples from fields such as microbial metabolomics and medical biology to illustrate the application of these methods.

Uploaded by

Lilit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views21 pages

A Framework For Low-Level Data Fusion

This chapter discusses a framework for low-level data fusion, focusing on the analysis of multiple data sets measured on the same biological system. It emphasizes the importance of data integration and model-based data fusion, distinguishing between symmetric and asymmetric methods, and outlines various goals and methodologies for achieving effective data fusion. The chapter also provides motivating examples from fields such as microbial metabolomics and medical biology to illustrate the application of these methods.

Uploaded by

Lilit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

C H A P T E R

2
A Framework for Low-Level
Data Fusion
Age K. Smilde*, 1, Iven Van Mechelenx
* Biosystems Data Analysis, Swammerdam Institute for Life Sciences,
University of Amsterdam, Amsterdam, The Netherlands; x Research Group
on Quantitative Psychology and Individual Differences, KU Leuven,
Leuven, Belgium
1
Corresponding author

1. INTRODUCTION AND MOTIVATION

In this chapter we describe systematic ways to analyze multiple data


sets or data blocks simultaneously measured on the same system. Our
examples are from the field of (systems) biology, but the methods and
frameworks discussed are generic. The applications of these types of
analyses are wide ranging, from medical biology [1e3] to microbial
biology [4,5] to plant biology [6] and studies in mice [7,8] and rats [9].
This chapter draws heavily from earlier publications of the authors
[10,11].

1.1 Data Integration


Owing to the abundant availability of multiple data sets or data blocks
that have been measured on the same biological system, there is a
growing need of analyzing and visualizing such data blocks simulta-
neously to arrive at a global understanding of the system in question. This
is generally called data integration or data fusion. Data integration can
comprise many things. One of the most basic ways of integrating data is
based on simple descriptive statistics such as correlations [12,13].

Data Fusion Methodology and Applications


https://doi.org/10.1016/B978-0-444-63984-4.00002-8 27 Copyright © 2019 Elsevier B.V. All rights reserved.
28 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION

Association networks are very popular in this regard [14]. A more


demanding way of integrating data relies on models; we will further refer
to this endeavor by model-based data fusion. The models underlying
model-based data fusion can be either ad hoc structures (often associated
with a primary goal of mere data reduction) or structures rooted in sub-
stantive theories (such as biological accounts). Examples of the latter
include genome-scale models [15], whole body models [16], and models
of parts of a system. Otherwise, all forms of data integration or data fusion
can be combined with clever or advanced types of visualization, as
exemplified by charts of association networks, and visual representations
of genome-scale models.

1.2 Model-Based Data Fusion


Within the area of model-based data fusion, several distinctions can be
drawn. A first distinction one can make is between symmetric fusion and
predictive or asymmetric fusion, with one or several data blocks being
used to predict an outcome in the latter. Examples of asymmetric data
fusion methods are multiblock-PLS (partial least squares) [17e19],
PO-PLS (parallel and orthogonalised partial least squares) [20], and
SO-PLS (sequential and orthogonalised partial least squares) [21]. Latent-
path models are special cases in this respect and used a lot in management
and food applications [22]. We will not discuss these types of asymmetric
fusion methods in this chapter.
Symmetric data fusion considers all data sets as taking the same roles
and as having equal importance. There is neither an importance hierarchy
nor a distinction between criterion (response or dependent variable) and
predictor (explanatory or independent variable) data blocks; in other
words, the blocks are exchangeable. Within symmetric data fusion, a
second distinction is on what level the fusion takes place. High-level data
fusion models each data set separately and takes all modeling results and
combines these, e.g., using majority voting schemes [23]. Midlevel fusion
first subjects each data set to some kind of preprocessing (such as a form
of variable selection) and then uses low-level methods to fuse the
preprocessed data. Low-level fusion fuses the different data sets without a
prior preprocessing of each of them. We focus in this chapter on low-level
data fusion.

1.3 Goals of Data Fusion


Many goals of model-based data fusion can be envisaged. In current
practice, these goals are usually implicit. By making them explicit one can
start looking for custom-made models and tools for data analysis.
1. INTRODUCTION AND MOTIVATION 29

A recurring element in goals of model-based data fusion is that many


of them pertain to differences or heterogeneity in each of the data blocks
under study. As an example, one may think of between-person or indi-
vidual differences, which are highly relevant in precision medicine and
nutritional interventions. Fusing data may help to find such differences
(e.g., in terms of to-be-estimated model parameters) and thereby facilitate
population stratification.
Four specific types of goals with regard to within-block differences can
be distinguished. A first one is purely exploratory in nature. The idea here
is to simply chart or describe the heterogeneity under study, possibly
making use of proper visualization tools. A second type of goal is to
capture some particular aspects of heterogeneity in each data set and to
subsequently look for a synthesis or consensus of these aspects across all
data sets. A third type of goal, which is becoming more and more popular,
is separating common versus distinctive sources of variation in the
respective data sets or data blocks. Such a separation may greatly simplify
a subsequent interpretation of the results. We describe these types of
applications in more detail in Section 4.
The three types of goals as explained earlier pertain to the study of
within-block heterogeneity or differences in themselves, whereas a fourth
type of goal pertains to the study of within-block differences in relation to
a known covariate. As an example, one may think of a covariate that
represents treatment condition (treatment vs. control, or different active
treatment conditions), as in data regarding an intervention or clinical trial.
In that case, one may wish to identify treatment effects, both in terms of
global effects for all data blocks as a whole and in terms of distinctive
effects for the different individual data blocks separately.

1.4 Motivating Examples


Some examples will serve to motivate discussing low-level data fusion
methods. The first example is from the field of microbial metabolomics
[24] and deals with integrating gas chromatographyemass spectrometry
(GC-MS) and liquid chromatography (LC)-MS data of the same microbial
system. The structure of this data set is visualized in Fig. 2.1 and can be
considered a case of two coupled two-way two-mode data blocks that are
connected through a common fermentation mode. The research question
is now to arrive at a global view of the metabolism given the two different
sets of measurements and what drives the differences in the fermentation
process. Hence, this is an example of exploratory analysis.
The second example is from the field of medical biology: the effects of
gastric bypass surgery on obese and diabetic subjects [3]. A total of 14
obese patients with diabetes mellitus type II (DM2) underwent gastric
30 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION

Metabolites 1 Metabolites 2

Experimental conditions
LC-MS GC-MS

FIGURE 2.1 Coupled GC-MS and LC-MS data along the sampling mode.

bypass surgery, and blood samples were taken 4 weeks before and
3 weeks after surgery; on each occasion samples were taken both before
and after a meal. The blood samples were then analyzed on multiple
analytical platforms for the determination of amines, lipids, and
oxylipins. Clearly, there is an experimental design underlying the shared
sampling mode (see Fig. 2.2) and thus the goal is to establish (common
and distinct) treatment effects in the different data sets.
The third example is from analytical chemistry applied to resolving
mixtures of chemical compounds into their underlying spectra and
concentrations. This is visualized in Fig. 2.3 where nuclear magnetic
resonance (NMR) and LC-MS measurements are performed on the same
set of samples. Owing to the combined modeling, concentrations and
pure spectral profiles can be obtained for both the NMR and LC-MS data
[25,26].
We will make use of many figures like the ones in Figs. 2.1, 2.2, and 2.3.
We always depict matrices with the first (row) mode pertaining to the
sampling mode (each row of a matrix corresponds to a sample) and the
second (column) mode pertaining to the variables (each column repre-
sents a variable).
Experimental conditions

Design Amines Lipids Oxylipins

FIGURE 2.2 Amines, lipids, and oxylipins measured repeatedly on the same persons.
The data block design contains the encoding of the underlying experimental design (see
text for details).
2. DATA STRUCTURES 31

Chemical shifts Features

Mixtures
LC-MS
NMR

FIGURE 2.3 NMR and LC-MS measurements performed on the same set of mixtures of
chemical compounds.

2. DATA STRUCTURES

When discussing data fusion methods, a central notion is the idea of


coupled data. Without any coupling of the data, data-driven fusion is not
possible. There are different multiset data structures with corresponding
coupling characteristics.
The first case pertains to data coupled in the sampling mode. Both
examples given earlier are from this type. This is exemplified in Fig. 2.4 by
placing the sets of data next to each other: the measurements (variables 1,
2, 3) are obtained on the same samples. Another possibility is coupling
along the variable mode, as visualized in Fig. 2.5, where now the same
variables are measured on different sets of samples. Also, hybrid cases are
possible as shown in Fig. 2.6 and of course there are many more situations.
One special case is worth mentioning and that is if data sets share both
modes. Such structures are called multiway structures, and special

Variables 1 Variables 2 Variables 3


Samples

FIGURE 2.4 Coupling along the sampling mode.


32 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION

Variables

Samples 1
Samples 2
Samples 3

FIGURE 2.5 Coupling along the variables mode.

methods are in place for multiway analysis [27]. In this chapter, we


illustrate our framework with data coupled in the sampling mode, that is,
situations as shown in Fig. 2.4.

3. FRAMEWORK FOR LOW-LEVEL DATA FUSION


In this section, we describe a novel model for data fusion [11]. This
model is generic in that it subsumes a very broad range of specific models
(both existing and to be developed ones) as special cases. The generic
model will appear to be a global model for the whole of all coupled data

Variables 1 Variables 2
Samples 1
Samples 2

FIGURE 2.6 Partly coupled along variables and samples.


3. FRAMEWORK FOR LOW-LEVEL DATA FUSION 33

blocks. This global model will consist of (1) a submodel for each data
block that accounts for the individual data entries in that block, along
with (2) a linking structure between these submodels. We first outline
those two aspects in the following text. Subsequently, we describe a few
existing examples of our generic proposal.

3.1 Submodel per Data Block


The first ingredient of our framework is a submodel for each data block
as described in more detail elsewhere [28]. This submodel is made of two
parts: quantifications of the modes per data block and association rules
that define how these quantifications can be combined to model each
block. We denote the different blocks by Xb;b ¼ 1, .,B with sizes I " Jb for
block Xb. We will describe each of those two ingredients in more detail
and use a two-way two-block case as a guiding example.

3.1.1 Quantifications of Data Block Modes


The first constituent of the submodel is a quantification of the two
modes of the two blocks of data. Such quantifications can be seen as
reductions of the modes in question. For our example (fusing two blocks
of data), this is visualized in Fig. 2.7. The matrix X1 has quantifiers
A1(I " P1) and B1(I " Q1) for the first and second modes, respectively.
Likewise, there are quantifiers for the second block: A2(I " P2) and
B2(I " Q2).
There are many alternatives for choosing quantifiers. When the
quantification matrix A1 is real valued, it would imply a representation of
the rows of X1 as points in a low-dimensional (i.e., a P1-dimensional)
space. Similar interpretations are available for the other quantification
matrices. Note that the dimensionalities of the quantification matrices for
the same block do not have to be equal: P1 is not necessarily equal to Q1.
The flexibility of using such quantifiers is further illustrated by choosing
A1 to be the identity matrix (of size I " I) resulting in no reduction of the
first mode of X1. A special case exists for data in which the two modes of
the two-way data coincide (so-called one-mode two-way data), such as in
distance matrices: then the quantifiers for both ways should be chosen to
be the same.

L(.)
X1 A1 A2 X2

B1 B2

FIGURE 2.7 Idea of a linking structure.


34 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION

3.1.2 Block-Specific Association Rule


Next to the quantifiers, it is necessary to define rules that associate the
quantifiers to model the different blocks. Again, there are many alter-
natives and the generic scheme for the first data block can be written as:
X1 ¼ f ðA1 ; B1 ; W1 Þ þ E1 ; (2.1)
with E1 denoting a matrix of residuals, W1(P1 " Q1) denoting a core
matrix, and the function f defining a mapping. This rather abstract rep-
resentation serves to show the flexibility, and a few examples are given to
clarify the concept of an association rule. For the other block(s) similar
rules can be made, but these association rules do not necessarily have to
be equal across all blocks.
The simplest case is to choose W1 to be the identity matrix of size
(P1 " P1) (which implicitly assumes that P1 ¼ Q1) and the function f to be
the outer product of A1 and B1. When minimizing the sum of squared
residuals in E1 one obtains the familiar principal component model for X1.
In the above-mentioned case of distance matrices and W1 an identity
matrix, Eq. (2.1) reduces to:
" #1
X
P1 ! "2 2
f ðA1 ; B1 Þij ¼ aip1 & bjp1 ; (2.2)
P1 ¼1

which is known by the name multidimensional unfolding in the psy-


chological literature. Other association rules are discussed in [28].

3.2 Linking Structure Between Different Submodels


The final ingredient of our framework ties the blocks together through
linking functions. For the two-block two-way case considered earlier we
have the two submodels for the two blocks:
X1 ¼ f1 ðA1 ; B1 ; W1 Þ þ E1
(2.3)
X2 ¼ f2 ðA2 ; B2 ; W2 Þ þ E2
and the linking should now be done by making assumptions about A1
and A2 because these represent the quantifiers for the shared mode, the
sample mode. This idea is also visualized in Fig. 2.6.
In our generic model, this mode sharing is captured through con-
straints on the quantification matrices of the shared modes; these con-
straints can be conceived as representing the linking structure of the
model. There are many alternatives for linking structures, and some of
them are shown in Fig. 2.8.
In principle, a broad range of constraints could be considered as
linking structure. The most simple of them is an identity constraint, which
3. FRAMEWORK FOR LOW-LEVEL DATA FUSION 35

FIGURE 2.8 Different linking structures. For explanation, see text.

is also used in most cases of low-level data fusion (see Fig. 2.8A). Such a
constraint simply implies that a shared mode is given the same quanti-
fication in all submodels in which it shows up. The global model for such
a linking structure would be:
X1 ¼ f1 ðA; B1 ; W1 Þ þ E1
(2.4)
X2 ¼ f2 ðA; B2 ; W2 Þ þ E2
where the identity constraint becomes visible through the fact that the
quantification matrix A does no longer bear a block-specific subscript.
Rather than a full identity constraint on the quantification matrices of
shared modes, Fig. 2.8 also shows two partial identity constraints. The
first of these reads that a number of columns of the quantification
matrices of a shared mode are constrained to be identical, whereas other
columns are left unconstrained; through such a partial identity
constraint one may wish to capture both commonalities in the structures
of the linked data blocks (in terms of the identical quantification col-
umns) and distinctive aspects (in terms of the unconstrained columns)
(see Fig. 2.8C). This type of linking structure will be explained in more
detail in Section 4.
A second partial identity constraint reads that quantification matrices
of a shared mode are constrained to be identical with regard to the vast
majority of their rows (see Fig. 2.8D). This means that, for the vast
majority of the elements of the mode involved, but not for all, the
quantifications have to be the same (with elements that require different
quantifications having to be identified during the data-analytic process).
A special type of linkage structure may be needed if one of the shared
36 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION

modes is the time mode (see Fig. 2.8B). In such cases, the linking structure
may have to account for lags in dynamics (indicated by the symbol s).
This is, for instance, the case if measurements of metabolites are
performed in blood and urine, where usually the metabolite appears
earlier in the blood.
We give examples of two other possible linkage structures. The first of
these pertains to the case of binary quantification matrices (which can be
conceived as membership matrices in some clustering). A constraint on
such matrices could read that the clustering as implied by the first
quantification matrix is nested in the second. As a special case of this, in
case one would consider partitioning matrices only, a nestedness
constraint would imply that the first partitioning is a refinement of the
second (i.e., the first partitioning then is to be obtained by splitting a
number of classes of the second one). As a second possibility, in case of
real-valued quantification matrices, one may require two quantifications
of the same mode to be in a space-subspace relation.

3.3 Examples of the Framework


In chemometrics, SUM-PCA (or Consensus-PCA) is a much used data
fusion method [18]. Related methods in psychometrics are multiple
factor analysis (MFA; [29]) and STATIS [30]. The exact relationships
between these methods have been published elsewhere [31] and will
not be repeated here. All methods fit within our generic framework.
This has been illustrated for our two-block case. Assuming that
appropriate preprocessing has taken place per data block, SUM-PCA
assumes that:

qm1 X1 ¼ AðB1 ÞT þ E1 ¼ f ðA; B1 Þ þ E1


(2.5)
qm2 X2 ¼ AðB2 ÞT þ E2 ¼ f ðA; B2 Þ þ E2 ;
where, depending on the specific method, different weights qmk are
assigned to the data blocks (see [31] for details). This is an example of a
data fusion model with an identity link where the blocks X1 and X2 have
the sampling mode in common. It can easily be generalized to more than
two blocks of data. An example of the use of this method can be found in
[9], where metabolomics and gene expression data are coupled in a
toxicology experiment. Also, our example in Section 5.1 is of this kind,
using the MFA method.
Simultaneous component analysis (SCA) has already a long history in
psychometrics [32,33]. It was developed for cases in which the same set of
3. FRAMEWORK FOR LOW-LEVEL DATA FUSION 37

variables has been measured in different sets of samples, for example,


stemming from different cultures. Assuming a suitable preprocessing, the
basic SCA model (SCA-P) for a two-block case can be cast as follows
within our generic framework:

X1 ¼ A1 ðBÞT þ E1 ¼ f ðA1 ; BÞ þ E1
(2.6)
X2 ¼ A2 ðBÞT þ E2 ¼ f ðA2 ; BÞ þ E2 ;
where again an identity link is used. This example shows that coupling
along the variables mode (see Fig. 2.5) can also be cast in our framework.
Two members of the SCA family are further worth mentioning, because
they have been used in several fields of science: multilevel SCA (MSCA), as
used in psychometrics, functional genomics, and process chemometrics
([34e37]), and ANOVA-SCA (ASCA [36,38e40]), as used in functional
genomics. We start by discussing ASCA, making use of the notation in
ASCA applications ([35]). The typical background of ASCA is a set of
designed experiments in which functional genomics data are collected
from several subjects exposed to a treatment (k) and measured over time
(t). Assuming that proper preprocessing has been done, each block Xb
contains the measurements performed for treatment group b and can be
modeled as:
Xb ¼ 1tTb PT1 þ Tt PT2 þ Tbt PT3 þ Eb ; (2.7)
where 1 is a vector of ones of the proper order; tb, Tt, and Tbt contain the
scores representing the samples; P1, P2, and P3 contain the loadings, and
Eb contains the residuals. The matrices Tt and Tbt have a specific structure,
which is not important for this paper and which can be found elsewhere
([40]). On rewriting Eq. (2.7), we obtain:
# $
Xb ¼ 1tTb Tt Tbt ½ P1 P2 P3 (T þ Eb
(2.8)
¼ Ab ðBÞT þ Eb ;
which clearly is a special case of SCA-P.
The basic equation for the MSCA model reads as:
Xb ¼ 1tTb PT1 þ Tbw PT2 þ Eb ; (2.9)
assuming again proper preprocessing. The block Xb represents now an
individual subject (or object) measured, for example, over time (t). Then
tTb PT1 represents the between-subject variation and Tbw PT2 the within-
subject variation. Eq. (2.9) clearly shows that MSCA is a special case of
ASCA. Hence, MSCA also fits within our generic framework.
38 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION

An example of combining a PARAFAC and a PCA model for such data


reads:

X1 ¼ AðB1 ÞT þ E1 ¼ f1 ðA; B1 Þ þ E1
(2.10)
X2 ¼ AðB2 1C2 ÞT þ E2 ¼ f2 ðA; B2 ; C2 Þ þ E2 ;
where B2 and C2 are loading matrices pertaining to the second and third
modes of the properly matricized three-way array X2 and 1 is the symbol
for the Khatri-Rao product ([27]). The area of analyzing multiway arrays
is already covered in some text books [27,41], and extensions of fusing
data sets in which multiway arrays are involved are also available
[25,26,42,43].

4. COMMON AND DISTINCT COMPONENTS

4.1 Generic Model for Common and Distinct Components


This section leans heavily on previous work [10]. The two spaces
spanned by the columns of X1 and X2 (R(X1) and R(X2)) are located in the
same I-dimensional column-space RI; see Fig. 2.9 for an illustration in
three-dimensional space. Each variable is a vector in this coordinate
system indicating the level of that variable for each sample (row). These
Row 2

X12C
X1

X2
Row 1

FIGURE 2.9 The I-dimensional space having R(X1) (blue (dark gray in print version))
and R(X2) (green (light gray in print version)) as subspaces. Only three axes of this
I-dimensional space are drawn. The red line (gray in print version) X12C represents the com-
mon subspace. For the sake of illustration the dimensions of both column-spaces are equal
(two). This is not necessarily always the case.
4. COMMON AND DISTINCT COMPONENTS 39

variables are not explicitly shown in this figure but lie within the space
indicated by the blue and green column-spaces.
If the two column-spaces intersect nontrivially (the zero is always
shared), then the intersection space is called the common space. In Fig. 2.9,
there is only one common direction (i.e., the common space is one
dimensional), but there can be more dimensions or none. The common
subspace will be called R(X12C), where the subscript C stands for “Com-
mon.” Note that R(X12C)4R(X1) and R(X12C)4R(X2). The common part of
the two blocks will in most cases not span the whole of R(X1) and R(X2).
Some definitions regarding the rest of these spaces are therefore needed.
These subspaces representing the rest after identification of the common
part will be called “distinct” subspaces. The requirement is that the space
spanned by the columns in a block Xb(b ¼ 1,2) is a direct sum of the
common space and the distinct space within that block. Hence, these two
parts within a block are linearly independent (two subspaces are linearly
independent if no vector in one subspace can be written as a linear
combination of the vectors of the other and vice versa). These subspaces
are called R(X1D) and R(X2D) where the subscript D stands for “Distinct.”
There are several possibilities for selecting subspaces to be orthogonal
to each other. One option is to select R(X1D) and R(X2D) to be orthogonal to
R(X12C). Another option is to select R(X1D) orthogonal to R(X2D) and, of
course, it is also possible to not impose orthogonality at all. The choice
whether or not to choose which type of orthogonality depends on the
application.
What we have accomplished now is decomposing R(X1) and R(X2) into
direct sums of spaces:
RðX1 Þ ¼ RðX12C Þ4RðX1D Þ
(2.11)
RðX2 Þ ¼ RðX12C Þ4RðX2D Þ
because R(X12C)XR(X1D) ¼ {0} and R(X12C)XR(X2D) ¼ {0} [44]. Hence, it
also holds that:
dimRðX1 Þ ¼ dimRðX12C Þ þ dimRðX1D Þ
(2.12)
dimRðX2 Þ ¼ dimRðX12C Þ þ dimRðX2D Þ
If the distinct-orthogonal-to-common option is chosen, then R(X12C)t
R(X1D) and R(X12C)tR(X2D). Note that, for this case, given the common
space, the decomposition is unique because then R(X1D) is the orthogonal
complement of R(X12C) within R(X1) and likewise for R(X2D) (but not
necessarily the basis within the subspaces if these have dimension higher
than one). In the nonorthogonal case, the distinct part can be defined by
any set of linearly independent vectors that are in the original spaces but
not in the common space. For a thorough description of direct sums of
spaces, see [45].
40 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION

4.2 DISCO (Distinct and Common Components)


We will illustrate our generic model with the DISCO-SCA (or DISCO,
for short) method [46,47]. The first step in DISCO is to solve an SCA
problem to find scores AðI e " RÞ and loadings BððJ e 1 þ J2 Þ " RÞ of the
concatenated matrix [X1jX2]. Note that we use now the SCA method for a
shared sampling mode. The loading matrix B e can be partitioned in
e e
B1 ðJ1 " RÞ and B2 ðJ2 " RÞ. Subsequently, the matrix B e is orthogonally
rotated to a simple structure reflecting distinct and common components.
For the sake of illustration, assume that R ¼ 3; there is one common and
two distinct components (one for each block). Then B e is orthogonally
rotated to a structure Btarget according to:
% ! "%
min %V) BQ e & Btarget %2 (2.13)
QT Q¼I

where V is a matrix of zeros and ones selecting the elements across which
the minimization occurs, the symbol * indicates the Hadamard or ele-
mentwise product, and
& &
&x 0 x&
& &
&x 0 x&
& &
& &
&x 0 x&
& &
Btarget ¼ && 0 x x && (2.14)
&0 x x&
& &
& &
&0 x x&
& &
&0 x x&

where the symbol


& x means an arbitrary value not necessarily zero and B ¼
e ¼ ½BT &BT (T . This will result in the first component being distinct for
BQ 1 2
X1, the second component being distinct for X2, and the third component
being the common one. After finding the optimal Q, the scores A e are
e
counterrotated, resulting in A ¼ AQ ¼ ½a1 a2 a3 (, and the following
decomposition is obtained:
X1 ¼ ABT1 ¼ a1 bT11 þ a2 bT12 þ a3 bT13 þ E1
(2.15)
X2 ¼ ABT2 ¼ a1 bT21 þ a2 bT22 þ a3 bT23 þ E2
where b11 gives loadings for the distinct component for X1; b22 for the
distinct component for X2; and b13, b23 for the common component.
DISCO has been used in metabolomics [47] and in gene expression
analysis [48].
5. EXAMPLES 41

5. EXAMPLES

We give two examples in this section. The first one is an example of our
generic framework for data fusion and concerns the goal of exploring
relationships between two data blocks. The second one is an example of
common and distinct components and how these are affected by a
covariate, which is a treatment effect.

5.1 Microbial Metabolomics Example


As already introduced in Section 1.4, the microbial metabolomics
example concerns the fermentation of Escherichia coli and the use of this
information to construct improved production strains. The E. coli strain was
cultivated under different environmental conditions, and from these
fermentations of in total 28 samples, the metabolomes were analyzed using
a GC-MS and LC-MS method, resulting in 144 and 44 metabolites,
respectively. Hence, the blocks as visualized in Fig. 2.1 consist of LC-MS
(28 " 44) and GC-MS (28 " 144) data with an underlying experimental
design [24].
The two data blocks were preprocessed by log-transformation and
subsequent column-centering. Then they were subjected to an MFA,
which is an instantiation of our data fusion framework (see Section 3.3).
The weights put on the GC versus the LC block in the MFA were 0.66 and
0.34, respectively. The proportion of variance accounted for by the
successive MFA components is shown in Fig. 2.10. Based on this figure,
we choose five MFA components to analyze the data.
The common scores were Varimax rotated to a simple structure [49].
This resulted in components in which the first two had contributions from
GC and LC, the third and fifth mostly from LC, and the fourth from GC.
The corresponding loadings are shown in Fig. 2.11. Some interpretation is
given here and can be based on the experimental design. The first MFA
component seems to consist of metabolic processes related to oxygen
limitation and to the early stationary growth phase (after approximately
40 h of fermentation), which might be the result of oxygen stress. The
metabolites fumarate, malate, aspartate, a-ketoglutarate, and
2-hydroxyglutarate are grouped together in the fourth MFA component
and are all related to succinate metabolism; these metabolites are more
abundant in samples with succinate as carbon source. A detailed inter-
pretation of the results is given elsewhere [31].
42 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION

GC block

proportion of variance

0.4
accounted for
0.2
0.0

1 2 3 4 5 6 7 8 9 10

component

LC block
proportion of variance

0.4
accounted for
0.2
0.0

1 2 3 4 5 6 7 8 9 10
component

FIGURE 2.10 Proportion of explained variances for the MFA solutions. The bars repre-
sent variances within each block by MFA components and the curve represents explained
variance within a block by separate component analysis.

5.2 Medical Biology Example


The data set is a subset of a larger study on the effects of gastric bypass
surgery on obese and diabetic subjects [3]. Here, we focus on 14 obese
patients with DM2 who underwent gastric bypass surgery. A description
of the data was already given in Section 1.4. The three data blocks, amines
(A), lipids (L), and oxylipins (O), consist of 14 subjects " 4 samples ¼ 56
rows and 34, 243, and 32 variables, respectively. All variables in all three
blocks were square-root transformed, to obtain more evenly distributed
data. Individual differences between subjects were removed by sub-
tracting each subject’s average profile. All variables were then scaled to
unit variance. The blocks were also scaled to unit norm before SCA, to
normalize scale differences between blocks.
Selecting the dimensions of the subspaces is more complicated when
the numbers of blocks increase. In this three-block example, we need to
decide the dimensions of seven subspaces: X123C, X12C, X13C, X23C, X1D,
X2D, and X3D. For DISCO, we start by deciding the sum of all the
5. EXAMPLES 43

FIGURE 2.11 Heatmap of the loadings of the five rotated MFA components. Only the 130
of the 188 metabolites are shown; for the other metabolites the identity was unknown.
44 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION

dimensions, i.e., the number of SCA components. Explained variance as a


function of components for SCA is given in Fig. 2.12. The curve of
cumulative variance does not have a clear bend, which makes it hard to
decide the cutoff between structure and noise. To allocate the common
and distinct components we need to fix the number of SCA components
and then compare the fit values of Eq. (2.13) using different target
matrices. The computations are time consuming, as there are, e.g., 462
possible target matrices for the five-component model. To illustrate the
complexity in selecting the dimensions for the subspaces, we have cal-
culated all possible rotations for models with three to five SCA compo-
nents. The fit values are very similar, making it hard to conclude which
rotation gives the best fit. Looking further into the actual rotated score
vectors, we discover that many of the models agree on some of the sub-
spaces. We choose to interpret the five-component model with fit value
0.24, the best five-component model, which contains one component that
is common across all three blocks, two components common for A and L,
and one distinct component from both A and O. The decomposition of
each block is illustrated by pie charts in Fig. 2.13AeC. Notice that there is
a substantial contribution of one of the C-AL components also in the O
block (7%), which implies that this component could perhaps also be
regarded as common across all three blocks.

SCA
100
Amines
90 Lipids
Oxylipins
80
Explained variation (%)

70

60

50

40

30

20

10

0
1 2 3 4 5 6 7 8 9 10
Component #

FIGURE 2.12 Explained variances for SCA. The bars represent variances within each
block, and the curve represents cumulative explained variance in all blocks combined.
5. EXAMPLES 45

(A) (B) (C)


DISCO Amines Lipids Oxylipins
C-ALO:19%
C-ALO:28% C-ALO:31%

C-AL:5%
D-O:2%
D-O:2% C-AL:2%
C-AL:11% D-A:2%
C-AL:19% C-AL:7% D-O:14%

C-AL:28% D-A:2%
D-A:17%

FIGURE 2.13 Subplots A, B, and C show the decomposition by DISCO for blocks A, L,
and O, respectively. Each segment represents a component (dimension).

To interpret the different subspaces, we plot the scores and loadings


from the DISCO model. Fig. 2.14 shows the one-dimensional subspace
that is common for all three blocks (C-ALO), which accounts for 19%,
28%, and 31% of the variation in A, L, and O, respectively. The scores are
shown in the top panel of Fig. 2.14. It is clear that the component contains

Scores

2
After surgery -After meal
1 After surgery -Before meal
Before surgery-After meal
0 Before surgery-Before meal
-1
-2

Loadings (A) Loadings (L) Loadings (O)


o-acetyl-L-serine PGF2a
dopamine 8,9-DiHETrE
L-phenylalanine 14,15-DiHETE
L-leucine 9,10,13-TriHOME
L-lsoleucine TG 12S-HEPE
L-tryptophan PGE2
L-kynurenine 12(13)-EpOME
L-proline PGF1a
L-valine 15S-HETrE
L-tyrosine CE
L-lysine 11,12-DiHETrE
methionine SM 19(20)-EpDPE
ornithine 9(10)-EpOME
L-alpha-aminobutyric acid PE 17,18-DiHETE
epinephrine 5S-HEPE
L-2-aminoadipic acid 20-HETE
DL-3-aminoisobutyric acid PC TXB2
L-alanine 9,12,13-TriHOME
L-threonine 11-HETE
L-methioninesulfoxide LPE 5,6-DiHETrE
ethanolamine LPC 13-KODE
citrulline DG 15-HETE
sarcosine Cer 9-HOTrE
L-glutamic acid 19,20-DiHDPA
L-aspartic acid FA 9-KODE
L-glycine 12S-HHTrE
L-glutamine 5-HETE
L-arginine
N-methylhistidine SN2 14,15-DiHETrE
L-serine 12-HETE
taurine SN1 12,13-DiHOME
L-asparagine 13-HODE
L-4-hydroxyproline EPC 9,10-DiHOME
L-histidine LPC 9-HODE

FIGURE 2.14 Scores and loadings for the one-dimensional DISCO subspace common
between all three blocks.
46 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION

information related to both surgery and meal; the scores are increasing
after surgery and decreasing after the meal. The variables spanning this
dimension in each of the three blocks are shown in the bar plots of
Fig. 2.14 (bottom). The most striking observation is that the branched-
chain amino acids leucine, valine (and to a lesser extent leucine), and L-
2-aminoadipic acid (closely related to branched-chain amino acids) are
downregulated after surgery, which confirms earlier findings [3]. There is
more in common between amines and lipids than oxylipids; both amines
and lipids are involved in central carbon and energy metabolism and
therefore they may show higher correlation among some amino acids and
some lipid groups (as reflected by common subspace).
The two-dimensional subspace common between A and L is shown in
Fig. 2.15. These two components together account for 24% and 39% in the
A and L blocks, respectively, and they even explain 9% in the O block. We
also see groupings here according to both surgery and meal, especially in
the vertical dimension. Note that the two groups that were overlapping in
the C-ALO component (“before surgery-before meal” vs. “after surgery-
after meal”) are completely separated in this subspace. Plots of the dis-
tinct components (not shown) did not reveal clear patterns related to the
factors treatment and meal. Hence, all effects are seen in the common
parts, meaning that a large part of the metabolism is affected simulta-
neously by these two factors.
2
57 20
27

1.5 41
42 41
49 6
37 23
1 56
53 57
C-AL (A:19%, L:28%, O:7%)

40
49 20
55 2727
0.5 55 39 39 40 6 56
53 42
23
0 6
56 39 49
37 53 55
-0.5 40 27 23
39 42
20 57 23 53 000 55
42
-1 41
56
37
-1.5 6
After surgery -After meal 41 49
After surgery -Before meal 20 27
-2 Before surgery-After meal 57
Before surgery-Before meal
-2.5
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
C-AL (A:5%, L:11%, O:2%)

FIGURE 2.15 Scores for the two-dimensional DISCO subspace common between the
amine and lipids blocks.
REFERENCES 47

6. CONCLUSIONS

We hope to have shown that fusing data sets can have advantages. Our
framework for data fusion and our generic model for finding common
and distinct components could serve as guidelines for how to perform
certain tasks in data fusion. We would like to stress that data fusion
requires several steps that should be carefully considered:
1. Consideration of the goal of the data fusion based on substantive
questions.
2. Careful examination as to the nature of the data sets to be fused (see
also [50]).
3. Casting the fusion problem in a formal mathematical model.
4. Selection of a proper global loss function related to the mathematical
model to estimate the parameters.
5. Estimating the parameters and performing a proper validation of
the whole model.
These steps may require several rounds of interactions between the key
players, e.g., biologists, data scientists, analytical chemists, to arrive at a
proper problem definition. This shows that fusing data is an inter-
disciplinary endeavor!

References
[1] O. Azimzadeh, W. Sievert, H. Sarioglu, J. Merl-Pham, R. Yentrapalli, M. Bakshi,
D. Janik, M. Ueffing, M. Atkinson, G. Multhoff, S. Tapio, Integrative proteomics and tar-
geted transcriptomics analyses in cardiac endothelial cells unravel mechanisms of long-
term radiation-induced vascular dysfunction, J. Proteome Res. 14 (2) (2015) 1203e1219.
[2] R. Higdon, R. Earl, L. Stanberry, C. Hudac, E. Montague, E. Stewart, I. Janko,
J. Choiniere, W. Broomall, N. Kolker, R. Bernier, E. Kolker, The promise of multi-
omics and clinical data integration to identify and target personalized healthcare
approaches in autism spectrum disorders, OMICS. 19 (4) (2015) 197e208.
[3] M. Lips, J. Van Klinken, V. Van Harmelen, H. Dharuri, P. ‘t Hoen, J. Laros, G. Van
Ommen, I. Janssen, B. Van Ramshorst, B. VanWagensveld, D. Swank, F. Van Dielen,
A. Dane, A. Harms, R. Vreeken, T. Hankemeier, J. Smit, H. Pijl, K. Willems van Dijk,
Roux-en-Y gastric bypass surgery, but not calorie restriction, reduces plasma branched
chain amino acids in obese women independent of weight loss or the presence of type 2
diabetes mellitus, Diabetes Care 37 (12) (2014) 3150e3156.
[4] P.H. Bradley, M.J. Brauer, J.D. Rabinowitz, O.G. Troyanskaya, Coordinated concentra-
tion changes of transcripts and metabolites in Saccharomyces cerevisiae, PLoS Comput.
Biol. 5 (1) (2009) e1000270.
[5] M.T.A.P. Kresnowati, W.A. vanWinden, M.J.H. Almering, A. ten Pierick, C. Ras,
T.A. Knijnenburg, P. Daran-Lapujade, J.T. Pronk, J.J. Heijnen, J.M. Daran, When tran-
scriptome meets metabolome: fast cellular responses of yeast to sudden relief of glucose
limitation, Mol. Syst. Biol. 2 (2006) 49.
[6] C. Caldana, T. Degenkolbe, A. Cuadros-Inostroza, S. Klie, R. Sulpice, A. Leisse,
D. Steinhauser, A. Fernie, L. Willmitzer, M. Hannah, High-density kinetic analysis of

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy