A Framework For Low-Level Data Fusion
A Framework For Low-Level Data Fusion
2
A Framework for Low-Level
Data Fusion
Age K. Smilde*, 1, Iven Van Mechelenx
* Biosystems Data Analysis, Swammerdam Institute for Life Sciences,
University of Amsterdam, Amsterdam, The Netherlands; x Research Group
on Quantitative Psychology and Individual Differences, KU Leuven,
Leuven, Belgium
1
Corresponding author
Metabolites 1 Metabolites 2
Experimental conditions
LC-MS GC-MS
FIGURE 2.1 Coupled GC-MS and LC-MS data along the sampling mode.
bypass surgery, and blood samples were taken 4 weeks before and
3 weeks after surgery; on each occasion samples were taken both before
and after a meal. The blood samples were then analyzed on multiple
analytical platforms for the determination of amines, lipids, and
oxylipins. Clearly, there is an experimental design underlying the shared
sampling mode (see Fig. 2.2) and thus the goal is to establish (common
and distinct) treatment effects in the different data sets.
The third example is from analytical chemistry applied to resolving
mixtures of chemical compounds into their underlying spectra and
concentrations. This is visualized in Fig. 2.3 where nuclear magnetic
resonance (NMR) and LC-MS measurements are performed on the same
set of samples. Owing to the combined modeling, concentrations and
pure spectral profiles can be obtained for both the NMR and LC-MS data
[25,26].
We will make use of many figures like the ones in Figs. 2.1, 2.2, and 2.3.
We always depict matrices with the first (row) mode pertaining to the
sampling mode (each row of a matrix corresponds to a sample) and the
second (column) mode pertaining to the variables (each column repre-
sents a variable).
Experimental conditions
FIGURE 2.2 Amines, lipids, and oxylipins measured repeatedly on the same persons.
The data block design contains the encoding of the underlying experimental design (see
text for details).
2. DATA STRUCTURES 31
Mixtures
LC-MS
NMR
FIGURE 2.3 NMR and LC-MS measurements performed on the same set of mixtures of
chemical compounds.
2. DATA STRUCTURES
Variables
Samples 1
Samples 2
Samples 3
Variables 1 Variables 2
Samples 1
Samples 2
blocks. This global model will consist of (1) a submodel for each data
block that accounts for the individual data entries in that block, along
with (2) a linking structure between these submodels. We first outline
those two aspects in the following text. Subsequently, we describe a few
existing examples of our generic proposal.
L(.)
X1 A1 A2 X2
B1 B2
is also used in most cases of low-level data fusion (see Fig. 2.8A). Such a
constraint simply implies that a shared mode is given the same quanti-
fication in all submodels in which it shows up. The global model for such
a linking structure would be:
X1 ¼ f1 ðA; B1 ; W1 Þ þ E1
(2.4)
X2 ¼ f2 ðA; B2 ; W2 Þ þ E2
where the identity constraint becomes visible through the fact that the
quantification matrix A does no longer bear a block-specific subscript.
Rather than a full identity constraint on the quantification matrices of
shared modes, Fig. 2.8 also shows two partial identity constraints. The
first of these reads that a number of columns of the quantification
matrices of a shared mode are constrained to be identical, whereas other
columns are left unconstrained; through such a partial identity
constraint one may wish to capture both commonalities in the structures
of the linked data blocks (in terms of the identical quantification col-
umns) and distinctive aspects (in terms of the unconstrained columns)
(see Fig. 2.8C). This type of linking structure will be explained in more
detail in Section 4.
A second partial identity constraint reads that quantification matrices
of a shared mode are constrained to be identical with regard to the vast
majority of their rows (see Fig. 2.8D). This means that, for the vast
majority of the elements of the mode involved, but not for all, the
quantifications have to be the same (with elements that require different
quantifications having to be identified during the data-analytic process).
A special type of linkage structure may be needed if one of the shared
36 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
modes is the time mode (see Fig. 2.8B). In such cases, the linking structure
may have to account for lags in dynamics (indicated by the symbol s).
This is, for instance, the case if measurements of metabolites are
performed in blood and urine, where usually the metabolite appears
earlier in the blood.
We give examples of two other possible linkage structures. The first of
these pertains to the case of binary quantification matrices (which can be
conceived as membership matrices in some clustering). A constraint on
such matrices could read that the clustering as implied by the first
quantification matrix is nested in the second. As a special case of this, in
case one would consider partitioning matrices only, a nestedness
constraint would imply that the first partitioning is a refinement of the
second (i.e., the first partitioning then is to be obtained by splitting a
number of classes of the second one). As a second possibility, in case of
real-valued quantification matrices, one may require two quantifications
of the same mode to be in a space-subspace relation.
X1 ¼ A1 ðBÞT þ E1 ¼ f ðA1 ; BÞ þ E1
(2.6)
X2 ¼ A2 ðBÞT þ E2 ¼ f ðA2 ; BÞ þ E2 ;
where again an identity link is used. This example shows that coupling
along the variables mode (see Fig. 2.5) can also be cast in our framework.
Two members of the SCA family are further worth mentioning, because
they have been used in several fields of science: multilevel SCA (MSCA), as
used in psychometrics, functional genomics, and process chemometrics
([34e37]), and ANOVA-SCA (ASCA [36,38e40]), as used in functional
genomics. We start by discussing ASCA, making use of the notation in
ASCA applications ([35]). The typical background of ASCA is a set of
designed experiments in which functional genomics data are collected
from several subjects exposed to a treatment (k) and measured over time
(t). Assuming that proper preprocessing has been done, each block Xb
contains the measurements performed for treatment group b and can be
modeled as:
Xb ¼ 1tTb PT1 þ Tt PT2 þ Tbt PT3 þ Eb ; (2.7)
where 1 is a vector of ones of the proper order; tb, Tt, and Tbt contain the
scores representing the samples; P1, P2, and P3 contain the loadings, and
Eb contains the residuals. The matrices Tt and Tbt have a specific structure,
which is not important for this paper and which can be found elsewhere
([40]). On rewriting Eq. (2.7), we obtain:
# $
Xb ¼ 1tTb Tt Tbt ½ P1 P2 P3 (T þ Eb
(2.8)
¼ Ab ðBÞT þ Eb ;
which clearly is a special case of SCA-P.
The basic equation for the MSCA model reads as:
Xb ¼ 1tTb PT1 þ Tbw PT2 þ Eb ; (2.9)
assuming again proper preprocessing. The block Xb represents now an
individual subject (or object) measured, for example, over time (t). Then
tTb PT1 represents the between-subject variation and Tbw PT2 the within-
subject variation. Eq. (2.9) clearly shows that MSCA is a special case of
ASCA. Hence, MSCA also fits within our generic framework.
38 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
X1 ¼ AðB1 ÞT þ E1 ¼ f1 ðA; B1 Þ þ E1
(2.10)
X2 ¼ AðB2 1C2 ÞT þ E2 ¼ f2 ðA; B2 ; C2 Þ þ E2 ;
where B2 and C2 are loading matrices pertaining to the second and third
modes of the properly matricized three-way array X2 and 1 is the symbol
for the Khatri-Rao product ([27]). The area of analyzing multiway arrays
is already covered in some text books [27,41], and extensions of fusing
data sets in which multiway arrays are involved are also available
[25,26,42,43].
X12C
X1
X2
Row 1
FIGURE 2.9 The I-dimensional space having R(X1) (blue (dark gray in print version))
and R(X2) (green (light gray in print version)) as subspaces. Only three axes of this
I-dimensional space are drawn. The red line (gray in print version) X12C represents the com-
mon subspace. For the sake of illustration the dimensions of both column-spaces are equal
(two). This is not necessarily always the case.
4. COMMON AND DISTINCT COMPONENTS 39
variables are not explicitly shown in this figure but lie within the space
indicated by the blue and green column-spaces.
If the two column-spaces intersect nontrivially (the zero is always
shared), then the intersection space is called the common space. In Fig. 2.9,
there is only one common direction (i.e., the common space is one
dimensional), but there can be more dimensions or none. The common
subspace will be called R(X12C), where the subscript C stands for “Com-
mon.” Note that R(X12C)4R(X1) and R(X12C)4R(X2). The common part of
the two blocks will in most cases not span the whole of R(X1) and R(X2).
Some definitions regarding the rest of these spaces are therefore needed.
These subspaces representing the rest after identification of the common
part will be called “distinct” subspaces. The requirement is that the space
spanned by the columns in a block Xb(b ¼ 1,2) is a direct sum of the
common space and the distinct space within that block. Hence, these two
parts within a block are linearly independent (two subspaces are linearly
independent if no vector in one subspace can be written as a linear
combination of the vectors of the other and vice versa). These subspaces
are called R(X1D) and R(X2D) where the subscript D stands for “Distinct.”
There are several possibilities for selecting subspaces to be orthogonal
to each other. One option is to select R(X1D) and R(X2D) to be orthogonal to
R(X12C). Another option is to select R(X1D) orthogonal to R(X2D) and, of
course, it is also possible to not impose orthogonality at all. The choice
whether or not to choose which type of orthogonality depends on the
application.
What we have accomplished now is decomposing R(X1) and R(X2) into
direct sums of spaces:
RðX1 Þ ¼ RðX12C Þ4RðX1D Þ
(2.11)
RðX2 Þ ¼ RðX12C Þ4RðX2D Þ
because R(X12C)XR(X1D) ¼ {0} and R(X12C)XR(X2D) ¼ {0} [44]. Hence, it
also holds that:
dimRðX1 Þ ¼ dimRðX12C Þ þ dimRðX1D Þ
(2.12)
dimRðX2 Þ ¼ dimRðX12C Þ þ dimRðX2D Þ
If the distinct-orthogonal-to-common option is chosen, then R(X12C)t
R(X1D) and R(X12C)tR(X2D). Note that, for this case, given the common
space, the decomposition is unique because then R(X1D) is the orthogonal
complement of R(X12C) within R(X1) and likewise for R(X2D) (but not
necessarily the basis within the subspaces if these have dimension higher
than one). In the nonorthogonal case, the distinct part can be defined by
any set of linearly independent vectors that are in the original spaces but
not in the common space. For a thorough description of direct sums of
spaces, see [45].
40 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
where V is a matrix of zeros and ones selecting the elements across which
the minimization occurs, the symbol * indicates the Hadamard or ele-
mentwise product, and
& &
&x 0 x&
& &
&x 0 x&
& &
& &
&x 0 x&
& &
Btarget ¼ && 0 x x && (2.14)
&0 x x&
& &
& &
&0 x x&
& &
&0 x x&
5. EXAMPLES
We give two examples in this section. The first one is an example of our
generic framework for data fusion and concerns the goal of exploring
relationships between two data blocks. The second one is an example of
common and distinct components and how these are affected by a
covariate, which is a treatment effect.
GC block
proportion of variance
0.4
accounted for
0.2
0.0
1 2 3 4 5 6 7 8 9 10
component
LC block
proportion of variance
0.4
accounted for
0.2
0.0
1 2 3 4 5 6 7 8 9 10
component
FIGURE 2.10 Proportion of explained variances for the MFA solutions. The bars repre-
sent variances within each block by MFA components and the curve represents explained
variance within a block by separate component analysis.
FIGURE 2.11 Heatmap of the loadings of the five rotated MFA components. Only the 130
of the 188 metabolites are shown; for the other metabolites the identity was unknown.
44 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
SCA
100
Amines
90 Lipids
Oxylipins
80
Explained variation (%)
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10
Component #
FIGURE 2.12 Explained variances for SCA. The bars represent variances within each
block, and the curve represents cumulative explained variance in all blocks combined.
5. EXAMPLES 45
C-AL:5%
D-O:2%
D-O:2% C-AL:2%
C-AL:11% D-A:2%
C-AL:19% C-AL:7% D-O:14%
C-AL:28% D-A:2%
D-A:17%
FIGURE 2.13 Subplots A, B, and C show the decomposition by DISCO for blocks A, L,
and O, respectively. Each segment represents a component (dimension).
Scores
2
After surgery -After meal
1 After surgery -Before meal
Before surgery-After meal
0 Before surgery-Before meal
-1
-2
FIGURE 2.14 Scores and loadings for the one-dimensional DISCO subspace common
between all three blocks.
46 2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
information related to both surgery and meal; the scores are increasing
after surgery and decreasing after the meal. The variables spanning this
dimension in each of the three blocks are shown in the bar plots of
Fig. 2.14 (bottom). The most striking observation is that the branched-
chain amino acids leucine, valine (and to a lesser extent leucine), and L-
2-aminoadipic acid (closely related to branched-chain amino acids) are
downregulated after surgery, which confirms earlier findings [3]. There is
more in common between amines and lipids than oxylipids; both amines
and lipids are involved in central carbon and energy metabolism and
therefore they may show higher correlation among some amino acids and
some lipid groups (as reflected by common subspace).
The two-dimensional subspace common between A and L is shown in
Fig. 2.15. These two components together account for 24% and 39% in the
A and L blocks, respectively, and they even explain 9% in the O block. We
also see groupings here according to both surgery and meal, especially in
the vertical dimension. Note that the two groups that were overlapping in
the C-ALO component (“before surgery-before meal” vs. “after surgery-
after meal”) are completely separated in this subspace. Plots of the dis-
tinct components (not shown) did not reveal clear patterns related to the
factors treatment and meal. Hence, all effects are seen in the common
parts, meaning that a large part of the metabolism is affected simulta-
neously by these two factors.
2
57 20
27
1.5 41
42 41
49 6
37 23
1 56
53 57
C-AL (A:19%, L:28%, O:7%)
40
49 20
55 2727
0.5 55 39 39 40 6 56
53 42
23
0 6
56 39 49
37 53 55
-0.5 40 27 23
39 42
20 57 23 53 000 55
42
-1 41
56
37
-1.5 6
After surgery -After meal 41 49
After surgery -Before meal 20 27
-2 Before surgery-After meal 57
Before surgery-Before meal
-2.5
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
C-AL (A:5%, L:11%, O:2%)
FIGURE 2.15 Scores for the two-dimensional DISCO subspace common between the
amine and lipids blocks.
REFERENCES 47
6. CONCLUSIONS
We hope to have shown that fusing data sets can have advantages. Our
framework for data fusion and our generic model for finding common
and distinct components could serve as guidelines for how to perform
certain tasks in data fusion. We would like to stress that data fusion
requires several steps that should be carefully considered:
1. Consideration of the goal of the data fusion based on substantive
questions.
2. Careful examination as to the nature of the data sets to be fused (see
also [50]).
3. Casting the fusion problem in a formal mathematical model.
4. Selection of a proper global loss function related to the mathematical
model to estimate the parameters.
5. Estimating the parameters and performing a proper validation of
the whole model.
These steps may require several rounds of interactions between the key
players, e.g., biologists, data scientists, analytical chemists, to arrive at a
proper problem definition. This shows that fusing data is an inter-
disciplinary endeavor!
References
[1] O. Azimzadeh, W. Sievert, H. Sarioglu, J. Merl-Pham, R. Yentrapalli, M. Bakshi,
D. Janik, M. Ueffing, M. Atkinson, G. Multhoff, S. Tapio, Integrative proteomics and tar-
geted transcriptomics analyses in cardiac endothelial cells unravel mechanisms of long-
term radiation-induced vascular dysfunction, J. Proteome Res. 14 (2) (2015) 1203e1219.
[2] R. Higdon, R. Earl, L. Stanberry, C. Hudac, E. Montague, E. Stewart, I. Janko,
J. Choiniere, W. Broomall, N. Kolker, R. Bernier, E. Kolker, The promise of multi-
omics and clinical data integration to identify and target personalized healthcare
approaches in autism spectrum disorders, OMICS. 19 (4) (2015) 197e208.
[3] M. Lips, J. Van Klinken, V. Van Harmelen, H. Dharuri, P. ‘t Hoen, J. Laros, G. Van
Ommen, I. Janssen, B. Van Ramshorst, B. VanWagensveld, D. Swank, F. Van Dielen,
A. Dane, A. Harms, R. Vreeken, T. Hankemeier, J. Smit, H. Pijl, K. Willems van Dijk,
Roux-en-Y gastric bypass surgery, but not calorie restriction, reduces plasma branched
chain amino acids in obese women independent of weight loss or the presence of type 2
diabetes mellitus, Diabetes Care 37 (12) (2014) 3150e3156.
[4] P.H. Bradley, M.J. Brauer, J.D. Rabinowitz, O.G. Troyanskaya, Coordinated concentra-
tion changes of transcripts and metabolites in Saccharomyces cerevisiae, PLoS Comput.
Biol. 5 (1) (2009) e1000270.
[5] M.T.A.P. Kresnowati, W.A. vanWinden, M.J.H. Almering, A. ten Pierick, C. Ras,
T.A. Knijnenburg, P. Daran-Lapujade, J.T. Pronk, J.J. Heijnen, J.M. Daran, When tran-
scriptome meets metabolome: fast cellular responses of yeast to sudden relief of glucose
limitation, Mol. Syst. Biol. 2 (2006) 49.
[6] C. Caldana, T. Degenkolbe, A. Cuadros-Inostroza, S. Klie, R. Sulpice, A. Leisse,
D. Steinhauser, A. Fernie, L. Willmitzer, M. Hannah, High-density kinetic analysis of