Jukuri, open repository of the Natural Resources Institute Finland (Luke) 
   
 
   
All material supplied via Jukuri is protected by copyright and other intellectual property rights. Duplication 
or sale, in electronic or print form, of any part of the repository collections is prohibited. Making electronic 
or print copies of the material is permitted only for your own personal use or for educational purposes.  For 
other purposes, this article may be used in accordance with the publisher’s terms. There may be 
differences between this version and the publisher’s version. You are advised to cite the publisher’s 
version. 
 
This is an electronic reprint of the original article.  
This reprint may differ from the original in pagination and typographic detail. 
 
Author(s): Daniel Fischer, Klaus Nordhausen and Hannu Oja 
Title: On linear dimension reduction based on diagonalization of scatter matrices for 
bioinformatics downstream analyses 
Year: 2020 
Version: Published version 
Copyright:   The Author(s) 2020 
Rights: CC BY-NC-ND 4.0 
Rights url: http://creativecommons.org/licenses/by-nc-nd/4.0/ 
 
Please cite the original version: 
Fischer D., Nordhausen K., Oja H. (2020). On linear dimension reduction based on diagonalization 
of scatter matrices for bioinformatics downstream analyses. Heliyon 6(12), e05732. 
https://doi.org/10.1016/j.heliyon.2020.e05732. 
Heliyon 6 (2020) e05732
Contents lists available at ScienceDirect
Heliyon
journal homepage: www.cell.com/heliyon
Research article
On linear dimension reduction based on diagonalization of scatter matrices 
for bioinformatics downstream analyses
Daniel Fischer a,∗, Klaus Nordhausen b, Hannu Oja c
a Natural Resources Institute Finland (Luke), Applied Statistical Methods, Myllytie 1, 31600 Jokionen, Finland
b CSTAT - Computational Statistics, Institute of Statistics & Mathematical Methods in Economics, Vienna University of Technology, Wiedner Hauptstraße 7, A-1040 
Vienna, Austria
c Department of Mathematics and Statistics, University of Turku, 20014 Turku, Finland
A R T I C L E I N F O A B S T R A C T
Keywords:
Computer science
Mathematics
Statistics
Bioinformatics
Microbial genomics
Genomics
Transcriptomics
Dimension reduction
SIR
ICA
Dimension reduction is often a preliminary step in the analysis of data sets with a large number of variables. Most 
classical, both supervised and unsupervised, dimension reduction methods such as principal component analysis 
(PCA), independent component analysis (ICA) or sliced inverse regression (SIR) can be formulated using one, two 
or several different scatter matrix functionals. Scatter matrices can be seen as different measures of multivariate 
dispersion and might highlight different features of the data and when compared might reveal interesting 
structures. Such analysis then searches for a projection onto an interesting (signal) part of the data, and it is 
also important to know the correct dimension of the signal subspace. These approaches usually make either no 
model assumptions or work in wide classes of semiparametric models. Theoretical results in the literature are 
however limited to the case where the sample size exceeds the number of variables which is hardly ever true 
for data sets encountered in bioinformatics. In this paper, we briefly review the relevant literature and explore 
if the dimension reduction tools can be used to find relevant and interesting subspaces for small-𝑛-large-𝑝 data 
sets. We illustrate the methods with a microarray dataset of prostate cancer patients and healthy controls.1. Introduction
In contemporary data analysis linear dimension reduction is often 
the first step in reducing the number of variables. For a numeric 𝑝-
variate random vector 𝒙 this means that a 𝑞 × 𝑝 transformation matrix 
𝑾 with 𝑞 ≪ 𝑝 is searched such that all relevant information for the 
analysis at hand is contained in 𝑾 𝒙. The question is then naturally how 
“information” content is measured but in general two types of linear 
dimension reduction can be distinguished:
Unsupervised Dimension Reduction:
𝒙|𝑾 𝒙 is considered as uninformative noise.
Supervised Dimension Reduction:
𝑦 ⟂ 𝒙|𝑾 𝒙 for some response variable 𝑦 of interest.
There is an abundance of supervised and unsupervised methods, and 
depending on the data and problem, they might give quite different re-
sults as they adopt different concepts of information. One prevalent 
supervised method is, e.g. the linear discriminant analysis (LDA) or 
* Corresponding author.
E-mail address: daniel.fischer@luke.fi (D. Fischer).
more recently developed non-linear dimension reduction methods like 
t-SNE (van der Maaten and Hinton, 2008), Isomap (Tenenbaum et al., 
2000) or UMAP (McInnes et al., 2018).
However, in the context of linear unsupervised dimension reduction, 
for example, principal component analysis (PCA) understands informa-
tion as a large variation whereas independent component analysis (ICA) 
usually measures information as a degree of non-gaussianity. The di-
mension 𝑞 of the information or signal subspace is preselected visually 
or by using various information criteria or testing strategies. For a more 
detailed comparison of the two concepts with practical examples, please 
see Nordhausen and Oja (2018b).
Surprisingly many unsupervised and supervised linear dimension re-
duction methods can be expressed as a joint diagonalization problem of 
two scatter matrices which yields a nice unifying theory in wide semi-
parametric models. A problem, however, is that the assumption 𝑛 > 𝑝
for the sample size 𝑛 is needed and the technique cannot be directly 
applied to bioinformatics data with almost always 𝑛 ≪ 𝑝.
In this paper, we will first provide in Section 2 a general but brief 
overview of linear dimension reduction methods based on the use of https://doi.org/10.1016/j.heliyon.2020.e05732
Received 18 January 2020; Received in revised form 1 June 2020; Accepted 11 Dec
2405-8440/© 2020 Published by Elsevier Ltd. This is an open access article under tember 2020
he CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732two scatter matrices. Section 3 introduces a genetic data set which is 
used to illustrate the ideas. Section 4 then explores if there are any 
ways how to apply these methods in the case 𝑛 ≪ 𝑝 with the example 
data discussed in Section 3.
A GitHub repository that contains the R-code and data to reproduce 
all results and figures is also available
(https://github .com /fischuu/
LinearDimensionReductionInBioinformatics).
2. Linear dimension reduction using two of scatter matrices
Let 𝒙 denote a 𝑝-variate random vector having cdf 𝐹𝒙 and the joint 
cdf of random variable 𝑦 and 𝒙 is denoted by 𝐹𝒙,𝑦. An (unsupervised) 
scatter functional is then any 𝑝 × 𝑝 matrix valued functional 𝑺 which is 
affine equivariant in the sense that
𝑺(𝐹𝑨𝒙+𝒃) =𝑨𝑺(𝐹𝒙)𝑨⊤
for all nonsingular 𝑝 × 𝑝 matrices 𝑨 and all 𝑝-vectors 𝒃. Similarly, a 
supervised scatter functional is defined as any 𝑝 × 𝑝 matrix valued func-
tional 𝑺 which is affine equivariant in the sense that
𝑺(𝐹𝑨𝒙+𝒃,𝑦) =𝑨𝑺(𝐹𝒙,𝑦)𝑨⊤,
with 𝑨 and 𝒃 as above. These functionals are in the following also 
denoted by 𝑺(𝒙) and 𝑺(𝒙; 𝑦). For a random sample 𝑿 (or (𝑿, 𝑦)), the 
estimates of the population values 𝑺(𝒙) and 𝑺(𝒙; 𝑦) are obtained as the 
value of the functional at the corresponding empirical distributions.
There is a wide literature on scatter functionals. The covariance ma-
trix cov(𝒙) serves as the first example but the scatter matrix may also 
be based on the fourth moments:
cov4(𝒙) =𝐸
(
𝑟2(𝒙−𝑬(𝒙))(𝒙−𝑬(𝒙))⊤
)
,
with 𝑟2 = (𝒙−𝐸(𝒙))⊤cov(𝒙)−1(𝒙−𝐸(𝒙)). The M-functionals of scatter are 
implicitly defined by
𝑺(𝒙) =𝐸
(
𝑤(𝑟)(𝒙− 𝑻 (𝒙))(𝒙− 𝑻 (𝒙))⊤
)
,
where 𝑻 (𝒙) is a companion location functional and 𝑤(⋅) a weight func-
tion for 𝑟 = ((𝒙 − 𝑻 (𝒙))⊤𝑺(𝒙)−1(𝒙 − 𝑻 (𝒙)))1∕2. Popular weight functions 
are for example Huber weights
𝑤𝐻 (𝑟) =
{
1∕𝜎2 𝑟 ≤ 𝑐
𝑐∕(𝑟2𝜎2) 𝑟 > 𝑐 ,
where 𝑐 is an user specified tuning constant and 𝜎2 a scaling factor. 
The weight function for Tyler’s scatter matrix is 𝑤𝑇 (𝑟) = 𝑝∕𝑟2. For a 
general discussion on these and other (unsupervised) scatter matrices, 
see for example Tyler (1987); Rousseeuw and Hubert (2013); Dümbgen 
et al. (2015). A supervised scatter functional used in the sliced inverse 
regression (SIR) is
𝑺𝑆𝐼𝑅(𝒙;𝑦) =𝐸((𝒙−𝐸(𝒙)|𝑦)(𝒙−𝐸(𝒙)|𝑦)⊤). (1)
For more examples of supervised functionals, see, e.g. Liski et al. 
(2014).
Initially scatter functionals were developed to provide robust and 
efficient competitors to the covariance matrix under the multivariate 
normality or ellipticity assumptions. In the elliptic models, the scatter 
functionals can be shown to be proportional to the covariance matrix 
(see, e.g. Nordhausen et al., 2011). A property of interest is also the 
so-called independence property (Nordhausen and Tyler, 2015) which 
states that if 𝒙 has independent components, then 𝑺(𝒙) is a diagonal 
matrix. The covariance matrix, the scatter matrix based on the fourth 
moments as well as symmetrized versions of Huber’s and Tyler’s M-
estimates all have this property, see for example Dümbgen (1998); 
Sirkiä et al. (2007).
Surprisingly many linear dimension reduction methods can be seen 
as a joint diagonalization of two scatter matrices. Let 𝑾 =𝑾 (𝒙) be a 2𝑝 × 𝑝 transformation matrix (functional) which diagonalizes two scatter 
functionals 𝑺1 = 𝑺1(𝒙) and 𝑺2 = 𝑺2(𝒙) so that
𝑾 𝑺1𝑾
⊤ = 𝑰𝑝 and 𝑾 𝑺2𝑾 ⊤ =𝑫,
where 𝑫 is a 𝑝 × 𝑝 diagonal matrix with diagonal elements 𝑑1 ≥… ≥ 𝑑𝑝. 
This method is called invariant coordinate selection (ICS) (Tyler et al., 
2009) as, for any nonsingular 𝑝 × 𝑝 matrix 𝑨, 𝑾 (𝒙)(𝒙) =𝑾 (𝑨𝒙)𝑨𝒙 up 
to the signs of the components. The transformed 𝑾 (𝒙)𝒙 in an invari-
ant coordinate system may reveal intrinsics and hidden structures in 
the data. For unsupervised 𝑺1 and supervised 𝑺2, the transformation is 
known as supervised invariant coordinate selection (SICS) (Liski et al., 
2014). ICS and SICS are very general methods with many several possi-
ble applications - but now have a look only at three well-known special 
cases.
2.1. Principal component analysis
PCA is one of the most popular multivariate methods and fits into 
this framework with 𝑺1 ∶= 𝑰𝑝 and 𝑺2 ∶= cov. The principal components 
in 𝑾 (𝒙)𝒙 are then ordered according to their variances, that is, the 
diagonal elements of 𝑫. The number of components 𝑞 to choose is often 
based on the cumulative proportion of variation or, visually, finding the 
elbow in the scree-plot indicating that the variances of the remaining 
components are approximately equal.
The idea of the scree-plot checking is formalized by assuming that, 
for some 𝑞, the principal values satisfy 𝑑1 ≥… ≥ 𝑑𝑞 > 𝑑𝑞+1 =… = 𝑑𝑝 > 0. 
The null hypothesis 𝐻0 ∶ 𝑞 = 𝑘 can then be tested by assuming ellipticity 
and using the variance of the 𝑝 − 𝑘 smallest estimated eigenvalues, say 
𝑇𝑘, as a test statistic. The limiting distribution of 𝑛𝑇𝑞 , properly scaled 
and when 𝑛 →∞, is then a chi squared distribution with (𝑝 − 𝑞 − 1)(𝑝 −
𝑞 + 2)∕2 degrees of freedom. For asymptotic and bootstrap tests and the 
estimates of 𝑞 based on these tests, see Schott (2006); Nordhausen et al. 
(2017a).
2.2. Independent component analysis
The independent component model is a semiparametric model with 
the assumption that
𝒙 =𝑨𝒛+ 𝒃,
where 𝑨 is a full-rank 𝑝 × 𝑝 mixture matrix, 𝒃 specifies the location 
center of 𝒙, and 𝒛 is a standardized random 𝑝-vector of independent 
components so that 𝐸(𝒛) = 𝟎 and cov(𝒛) = 𝑰𝑝. The goal in ICA is to esti-
mate an unmixing matrix 𝑾 to transform the data to the independent 
components (Nordhausen and Oja, 2018a).
The matrix 𝑾 also solves the ICA problem if both scatter matri-
ces 𝑺1 and 𝑺2 have the independence property. The eigenvalues in 𝑫
provide componentwise kurtosis measures (Nordhausen et al., 2011). 
The most popular choice providing the FOBI solution is 𝑺1 ∶= cov and 
𝑺2 ∶= cov4 with the componentwise Pearson’s kurtosis measures (see 
Cardoso (1989) and also Nordhausen and Virta (2019)). The 𝑞 non-
gaussian independent components can be estimated in this model if 
their kurtosis values are distinct. In signal processing, it is generally 
thought that these 𝑞 non-gaussian components present the signal part 
and the 𝑝 − 𝑞 gaussian components the noise part of the data.
For the FOBI approach, the eigenvalues for the gaussian compo-
nents are 𝑝 + 2 and, to find the non-gaussian components, the strategy 
is to choose the components whose values differ most from 𝑝 + 2. A 
natural test statistic 𝑇𝑘 for 𝐻0 ∶ 𝑞 = 𝑘 is then the sum of 𝑝 − 𝑘 small-
est diagonal entries of (?̂? − (𝑝 + 2)𝑰𝑝)2 and, if the eight moments are 
finite, then the limiting null distribution of 𝑛𝑇𝑞 as 𝑛 →∞ is the dis-
tribution of 2𝜎1𝜒21
2 (𝑝−𝑞−1)(𝑝−𝑞+2)
+
(
2𝜎1 + 4(𝑝− 𝑞)
)
𝜒21 with independent 
chi squared variables. The constant 𝜎1 = 𝑉 𝑎𝑟 
(‖𝒛‖2) + 8 can be con-
sistently estimated from the data. The bootstrap test versions are also 
D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732
Fig. 1. Scree-plot for the squared singular values.
Fig. 2. Pairwise scatter plots for the first four principal components.3
D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732
Fig. 3. Independent components based on FOBI.easily available. Consistent estimates of 𝑞 can be found using various 
sequential testing strategies. Further, the ICA assumptions can be re-
laxed by allowing that the non-gaussian components are dependent; 
this is known as non-gaussian component analysis (NGCA). For these 
and further results, see Nordhausen et al. (2017a) and Nordhausen et 
al. (2017b).
2.3. Sliced inverse regression
SIR is a supervised dimension reduction method originally suggested 
by Li (1991). It uses 𝑺1 ∶= cov and 𝑺2 = 𝑺𝑆𝐼𝑅 as defined in equation
(1). In practice, one discretizes 𝑦 and uses 𝑺𝑆𝐼𝑅(𝒙, 𝑦𝑑 ) where 𝑦𝑑 = ℎ
⇔ 𝑦 ∈ 𝑆ℎ, ℎ = 1, ..., 𝐻 , with disjoint intervals (slices) 𝑆1, ..., 𝑆𝐻 which 
satisfy 𝑆1 + ... + 𝑆𝐻 = ℝ. Note that the rank of the sample version of 
𝑺𝑆𝐼𝑅(𝒙, 𝑦𝑑 ) is at most 𝐻 −1. We then assume again the location-scatter 
model
𝒙 =𝑨𝒛+ 𝒃,
where now the standardized 𝒛 = (𝒛⊤1 , 𝒛
⊤
2 )
⊤ satisfies (𝑦, 𝒛⊤1 )
⊤ ⟂ 𝒛2. In the 
partitioning, 𝒛1 is chosen to have the smallest possible dimension 𝑞. 
The subvectors 𝒛1 and 𝒛2 present the signal and noise parts of 𝒛, respec-
tively. Under this model and using cov(𝒙) and 𝑺𝑆𝐼𝑅(𝒙, 𝑦𝑑 ), the diagonal 
entries of 𝑫 are
𝑑1 ≥… ≥ 𝑑𝑞 ≥ 𝑑𝑞+1 =…= 𝑑𝑝 = 0.
It is further required that 𝑑𝑞 > 𝑑𝑞+1 so that SIR is assumed to find the 
full signal.4To test the hypothesis 𝐻0 ∶ 𝑞 = 𝑘 Nordhausen et al. (2017a) used the 
test statistic 𝑇𝑘 which is simply the sum of the 𝑝 − 𝑘 smallest diagonal 
entries of ?̂?. Then, for the true value 𝑞, 𝑛𝑇𝑞 has a limiting 𝜒
2
(𝑝−𝑞)(𝐻−𝑞−1)
distribution as 𝑛 goes to the infinity. As in the PCA and FOBI cases, 
different sequential testing can be again used to find a consistent esti-
mate of 𝑞. Also, for small 𝑛, a bootstrap testing strategy may be used. For 
these and further results, see again Nordhausen et al. (2017a). Bura and 
Cook (2001) derived the limiting distribution of the same test statistic 
under weaker conditions.
2.4. Comparison to other dimension reduction methods
The here considered linear dimension reduction methods rely on 
linear transformations of the original data into a lower dimensional 
subspace. In contrast, non-linear dimension reduction methods like t-
SNE, Isomap and UMAP apply more drastical transformations to the 
data. These methods do not assume that the original data lies on a linear 
subspace.
Besides linear/non-linear methods for dimension reduction, even 
feature selection methods several classification methods like, e.g. the 
Random Forest can be considered as a dimension reduction method. 
Also, even a simple linear regression can be used for dimension reduc-
tion.
2.5. The case of 𝑛 ≪𝑝
The use of the above methodology (PCA, FOBI, SIR) involves prob-
lems in the context of genetic data with 𝑛 ≪𝑝. Tyler (2010) showed that 
D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732
Fig. 4. Independent components based on robust ICA.in this scenario, unfortunately, all scatter functionals with the affine 
equivariance property are proportional to the covariance matrix pro-
hibiting the subspace estimation. So to apply these methods, one should 
first reduce the dimension to some 𝑘 < 𝑛 by using SVD for example and 
then continue working in the resulting subspace.
The high dimensionality problem is approached in bioinformatics of-
ten by performing PCA via the singular value decomposition (SVD). This 
is the quasi-standard first step in bioinformatics data analysis, see the 
tools such as Genomatix, Genious or CLCbio that “solve” the problem of 
choosing a working dimension 𝑘 quite conveniently and simply by set-
ting the default to 𝑘 = 2. Let 𝑿 be our centered 𝑝 × 𝑛 data matrix, then 
the singular value decomposition (SVD) for 𝑿 with rank 𝑟 ≤min{𝑝, 𝑛} is 
defined as
𝑿 =𝑼𝑫𝑽 ⊤
where 𝑼 = (𝒖1, ..., 𝒖𝑟) is a 𝑝 ×𝑟matrix and 𝑽 = (𝒗1, ..., 𝒗𝑟) is an 𝑛 ×𝑟matrix 
both with orthonormal columns, and 𝑫 is an 𝑟 × 𝑟 diagonal matrix with 
positive diagonal elements in a decreasing order. The number of vari-
ables via PCA is then reduced to 𝑘 with a transformation 𝑿∗ = (𝑼 ∗)′𝑿
where 𝑼∗ = (𝑢1, ..., 𝑢𝑘).
The problem then naturally is how to choose 𝑘 without losing any 
(or too much) information.
3. A microRNA expression data set
Analysis of genetic data sets is one of the driving forces behind de-
veloping the tools for the 𝑛 ≪ 𝑝 problems. For example, the current 5sample sizes in typical transcriptomic experiments range from just a 
few to a couple of hundreds of individuals due to the high costs of se-
quencing. Also, the storage of massive data is often almost prohibitive. 
It is therefore not likely that in the foreseeable future, the case 𝑛 > 𝑝
would be realistic for genetic data.
The data set used in this work consists of human microRNA (miRNA) 
expressions from Agilent microarrays where 2125 different probes of 
813 different miRNAs were used for subjects coming from three dif-
ferent groups: 76 healthy individuals, 78 patients with a mild form of 
prostate cancer (PrCa) and 35 with an aggressive type of PrCa. In total, 
26 microarrays have been used, and the hybridization took place at four 
different time points. The data set was originally analyzed in Fischer et 
al. (2014, 2015), and all relevant data are available from EMBL-EBI Ar-
rayExpress (accession number E-MTAB-3397). Note that here 𝑛 = 189
may be seen untypically large compared to the relatively “small” di-
mension 𝑝 = 2125. The goal in this data is to identify miRNAs which 
are either responsible for the development of PrCa (predisposition) or 
which could serve as biomarkers for the detection of PrCa (diagnosis) 
and, further, to distinguish the two different types of PrCa.
Also, please note that the here described findings are not only valid 
for microarray data but also for any other kind of more recent data like 
e.g. whole-transcriptome sequencing (WTS). In WTS the 𝑛 << 𝑝 problem 
is even more prominent as, compared to the microarray data, usually 
all possible genes respective transcripts are quantified so that 𝑛 is in the 
level of magnitude > 20.000 and 𝑝 is often also smaller. Hence, also in 
WTS dimension reduction methods are eminent.
D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732
Fig. 5. Pairwise scatter plots for SIR components obtained from the 98-variate data and with the colors corresponding to the three health groups.Table 1. The first 10 squared singular values from SVD.
Squared singular value Cumulative explained variation (%)
1 49584782.29 50.3%
2 17020792.70 67.6%
3 10365087.69 78.1%
4 7799839.61 86.0%
5 3178859.60 89.2%
6 2036165.39 91.3%
7 1750698.70 93.1%
8 1386901.81 94.5%
9 937702.68 95.5%
10 846675.15 96.3%
4. Application
In the analysis of the microRNA data, the problem was to iden-
tify those miRNAs which allow us to separate healthy subjects from 
PrCa cases and, if possible, even distinguish between the two differ-
ent types of cancer. The following analysis was done entirely with R (R 
Core Team, 2017) using the packages ICS (Nordhausen et al., 2008) and 
ICtest (Nordhausen et al., 2017c).
As 𝑛 ≪ 𝑝, the SVD was performed with the results reported in Ta-
ble 1 (the first 10 eigenvalues of cov) and the corresponding scree-plot 
in Fig. 1. The scree-plot suggests that four components might be rea-
sonable as they already explain more than 80% of variation of the data 
(which has also sometimes been used as a rule). Note that as many as 
98 components are needed to explain 99.99% of the variation.6Table 2. Ordered kurtosis val-
ues from FOBI and robust ICA 
using the first four principal 
components.
FOBI robust ICA
1 9.7428 3.6838
2 7.2991 0.8069
3 5.3417 0.6101
4 6.2365 0.5514
ICA in the four-variate subspace We first start with four principal com-
ponents (𝑘 = 4) plotted in Fig. 2. The pairwise scatter plots for these 
four first principal components reveal one group that is separate from 
the bulk of the data but not three groups with different cancer types 
as desired. This is because, in searching for subgroups of the data, the 
kurtosis measures are more natural than the variance as an information 
criterion. We therefore next try ICS for this four-variate data with two 
choices of the pairs of scatter matrices (𝑺1, 𝑺2), namely, (i) FOBI based 
on cov and cov4, and (ii) robust ICA based on symmetrized versions 
Tyler’s and Huber’s scatter matrices. Table 2 describes the estimated 
kurtosis measures and the diagonal entries of ?̂? .
The test results reported in Table 3 suggest that the number of non-
gaussian components is 𝑞 = 1. The four ICS components from FOBI and 
robust ICA are plotted in Figs. 3 and 4, respectively. In both cases, the 
first component seems to separate the two groups, which unfortunately 
are not at all connected to the three health status groups. After some de-
tective work, it was found out that the group of 24 subjects, highlighted 
with red color in the plots, were outliers from three microarrays created 
D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732
Fig. 6. Pairwise scatter plots for SIR components obtained from the four-variate data and with the colors corresponding to the three health groups.Table 3. The 𝑝-values for testing the hy-
potheses of 𝐻0 ∶ 𝑞 = 𝑘, 𝑘 = 0, 1, 2 where 𝑞
the number of non-gaussian components. 
The tests are asymptotic and bootstrap tests 
(with 500 bootstrap samples) using the FOBI 
approach for the four-variate data.
𝐻0 Asymp Boot
𝑞 = 0 <0.0001 0.0020
𝑞 = 1 0.1050 0.0818
𝑞 = 2 0.3963 0.4711
by a single person at different time points. Also, the second components 
seem just to find another small group of outliers in the data. Hence, 
based on these results, it can be concluded that ICA cannot find the 
groups of interest if only the first four principal components are used 
in the analysis. We repeated ICA with 98 principal components as well, 
but even in this case could not find structures in the data to identify the 
health status of the subjects.
SIR in the 98-variate subspace We next perform SIR and use the data 
with the 98 principal components. As we hope to separate the three 
health groups, it is natural to let 𝑦 indicate the group memberships. 
Note that, as 𝐻 = 3, the dimension of the interesting subspace is 𝑞 = 0, 1
or 2. The 𝑝-values for the asymptotic and bootstrap tests for 𝑞 = 0 and 
𝑞 = 1 are reported in Table 4. The tests reject both 𝑞 = 0 and 𝑞 = 1
and therefore suggest that 𝑞 = 2. For visualization purposes we plot, 
in Fig. 5, the first five components and highlight the three response 
classes with different colors. We also applied SIR to the four-variate 7Table 4. The 𝑝-values for testing 𝐻0 ∶ 𝑞 = 0 and 
𝐻0 ∶ 𝑞 = 1. The tests are asymptotic and boot-
strap tests (with 500 bootstrap samples) using 
the SIR approach for the 98-variate data.
𝐻0 Asymp Boot
𝑞 = 0 ≤0.0001 0.0448
𝑞 = 1 0.0001 0.0050
data, and the SICS components are plotted in Fig. 6. It is then clear that 
the information to separate the three health groups is lost if only the 
first four principal components are used.
5. Conclusions
In the paper, we discussed the use of two scatter matrices for the 
unsupervised and supervised linear dimension reduction under broad 
model assumptions. The signal and noise spaces are then easily sep-
arated using the scatter matrices, and the eigenvalues of one scatter 
matrix with respect to another one listed in 𝑫. The eigenvalues can 
also be used to decide what is the dimension of the signal space. The 
theory is however developed only for large-𝑛-small-𝑝 cases which rules 
out genetic applications.
Following the general practice in bioinformatics, we tried to cir-
cumvent this problem by reducing the dimension using SVD (PCA) and 
hoping that a few first principal components capture all the relevant in-
formation. In our case study, we explored this strategy and used ICA 
and SIR for four- and 98-variate data sets obtained in this way. ICA 
and SIR have been used to analyze bioinformatics data earlier also in 
D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732Liebermeister (2002); Kong et al. (2008); Zhong et al. (2005); Fischer 
et al. (2017). Here we, however, wanted to explore what is the impact 
of the SVD step on the results, and the conclusion is that the number 
of principal components to retain is crucial. It seems advisable to keep 
the number of components as high as the further analysis tools allow 
without suffering from the course of dimensionality.
Based on this study, it would be worthwhile to develop models and 
techniques which allow the SVD preprocessing step and can then for-
malize the rules for the number of PCs to retain. The SVD step could, 
for example, be incorporated into the bootstrapping procedure to ac-
commodate the variation coming from that step. Note however that for 
example El Karoui and Purdom (2018, 2019) argue that bootstrapping 
is in general difficult in the 𝑛 ≪𝑝 setting.
Declarations
Author contribution statement
D. Fischer: Performed the experiments; Analyzed and interpreted 
the data; Contributed reagents, materials, analysis tools or data; Wrote 
the paper. K. Nordhausen: Conceived and designed the experiments; 
Performed the experiments; Analyzed and interpreted the data; Con-
tributed reagents, materials, analysis tools or data; Wrote the paper. 
H. Oja: Conceived and designed the experiments; Contributed reagents, 
materials, analysis tools or data; Wrote the paper.
Funding statement
The work of KN was supported by the Austrian Science Fund (FWF) 
Grant number P31881-N32.
Data availability statement
Data associated with this study has been deposited at https://github .
com /fischuu /LinearDimensionReductionInBioinformatics.
Declaration of interests statement
The authors declare no conflict of interest.
Additional information
No additional information is available for this paper.
Acknowledgements
The authors wish to acknowledge CSC – IT Center for Science, Fin-
land, for computational resources. We thank two anonymous reviewers 
for helpful comments on an earlier draft of the manuscript.
References
Bura, E., Cook, R.D., 2001. Extending sliced inverse regression: the weighted chi-squared 
test. J. Am. Stat. Assoc. 96, 996–1003.
Cardoso, J.F., 1989. Source separation using higher order moments. In: Proceedings of 
IEEE International Conference on Accoustics, Speech and Signal Processing. Glasgow, 
UK, pp. 2109–2112.
Dümbgen, L., 1998. On Tyler’s 𝑀 -functional of scatter in high dimension. Ann. Inst. Stat. 
Math. 50, 471–491.
Dümbgen, L., Pauly, M., Schweizer, T., 2015. 𝑀 -functionals of multivariate scatter. Stat. 
Surv. 9, 32–105.
El Karoui, N., Purdom, E., 2018. Can we trust the bootstrap in high-dimensions? The case 
of linear models. J. Mach. Learn. Res. 19, 1–66.
El Karoui, N., Purdom, E., 2019. The bootstrap, covariance matrices and pca in moderate 
and high-dimensions. In: Proceedings of Machine Learning Research: the 22nd Inter-
national Conference on Artificial Intelligence and Statistics, vol. 89, pp. 2115–2124.
Fischer, D., Honkatukia, M., Tuiskula-Haavisto, M., Nordhausen, K., Cavero, D., 
Preisinger, R., Vilkki, J., 2017. Subgroup detection in genotype data using invari-
ant coordinate selection. BMC Bioinform. 18, 173–181.
Fischer, D., Oja, H., Schleutker, J., Sen, P.K., Wahlfors, T., 2014. Generalized Mann-
Whitney type tests for microarray experiments. Scand. J. Stat. 41, 672–692.
Fischer, D., Wahlfors, T., Mattila, H., Oja, H., Tammela, T., Schleutker, J., 2015. Mirna 
profiles in lymphoblastoid cell lines of Finnish prostate cancer families. PLoS ONE 10, 
e0127427.
Kong, W., Vanderburg, C., Gunshin, H., Rogers, J., Huang, X., 2008. A review of indepen-
dent component analysis application to microarray gene expression data. BioTech-
niques 45, 501–520.
Li, K.-C., 1991. Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86, 
316–327.
Liebermeister, W., 2002. Linear modes of gene expression determined by independent 
component analysis. Bioinformatics 18, 51–60.
Liski, E., Nordhausen, K., Oja, H., 2014. Supervised invariant coordinate selection. Statis-
tics 48, 711–731.
van der Maaten, L., Hinton, G., 2008. Visualizing data using t-sne. J. Mach. Learn. Res. 9, 
2579–2605.
McInnes, L., Saul, J.H., Grossberger, L., 2018. Umap: uniform manifold approximation 
and projection. J. Open Sour. Softw. 3, 2861.
Nordhausen, K., Oja, H., 2018a. Independent Component Analysis: A Statistical Perspec-
tive. Wiley Interdisciplinary Reviews: Computational Statistics, vol. 10, p. e1440.
Nordhausen, K., Oja, H., 2018b. Robust nonparametric inference. Annu. Rev. Stat. Appl. 5, 
473–500.
Nordhausen, K., Oja, H., Ollila, E., 2011. Multivariate models and the first four moments. 
In: Hunter, D.R., Richards, D.S.R., Rosenberger, J.L. (Eds.), Nonparametric Statistics 
and Mixture Models. World Scientific, Singapore, pp. 267–287.
Nordhausen, K., Oja, H., Tyler, D.E., 2008. Tools for exploring multivariate data: the 
package ICS. J. Stat. Softw. 28, 1–31.
Nordhausen, K., Oja, H., Tyler, D.E., 2017a. Asymptotic and bootstrap tests for subspace 
dimension. ArXiv e-prints.
Nordhausen, K., Oja, H., Tyler, D.E., Virta, J., 2017b. Asymptotic and bootstrap tests for 
the dimension of the non-Gaussian subspace. IEEE Signal Process. Lett. 24, 887–891.
Nordhausen, K., Oja, H., Tyler, D.E., Virta, J., 2017c. ICtest: estimating and testing the 
number of interesting components in linear dimension reduction. https://CRAN .R -
project .org /package =ICtest. R package version 0.3.
Nordhausen, K., Tyler, D.E., 2015. A cautionary note on robust covariance plug-in meth-
ods. Biometrika 102, 573–588.
Nordhausen, K., Virta, J., 2019. An overview of properties and extensions of FOBI. 
Knowl.-Based Syst. 173, 113–116.
R Core Team, 2017. R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria. https://www .R -project .org/.
Rousseeuw, P., Hubert, M., 2013. High-breakdown estimators of multivariate location 
and scatter. In: Becker, C., Fried, R., Kuhnt, S. (Eds.), Robustness and Complex Data 
Structures: Festschrift in Honour of Ursula Gather. Berlin, Heidelberg: Springer Berlin 
Heidelberg, pp. 49–66.
Schott, J.R., 2006. A high-dimensional test for the equality of the smallest eigenvalues of 
a covariance matrix. J. Multivar. Anal. 97, 827–843.
Sirkiä, S., Taskinen, S., Oja, H., 2007. Symmetrised 𝑀 -estimators of scatter. J. Multivar. 
Anal. 98, 1611–1629.
Tenenbaum, J.B., de Silva, V., Langford, J.C., 2000. A global geometric framework for 
nonlinear dimensionality reduction. Science 290, 2319–2323.
Tyler, D.E., 1987. A distribution-free 𝑀 -estimator of multivariate scatter. Ann. Stat. 15, 
234–251.
Tyler, D.E., 2010. A note on multivariate location and scatter statistics for sparse data 
sets. Stat. Probab. Lett. 80, 1409–1413.
Tyler, D.E., Critchley, F., Dümbgen, L., Oja, H., 2009. Invariant coordinate selection. J. R. 
Stat. Soc. B 71, 549–592.
Zhong, W., Zeng, P., Ma, P., Liu, J.S., Zhu, Y., 2005. Rsir: regularized sliced inverse 
regression for motif discovery. Bioinformatics 21, 4169–4175.8