Jukuri, open repository of the Natural Resources Institute Finland (Luke) All material supplied via Jukuri is protected by copyright and other intellectual property rights. Duplication or sale, in electronic or print form, of any part of the repository collections is prohibited. Making electronic or print copies of the material is permitted only for your own personal use or for educational purposes. For other purposes, this article may be used in accordance with the publisher’s terms. There may be differences between this version and the publisher’s version. You are advised to cite the publisher’s version. This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail. Author(s): Daniel Fischer, Klaus Nordhausen and Hannu Oja Title: On linear dimension reduction based on diagonalization of scatter matrices for bioinformatics downstream analyses Year: 2020 Version: Published version Copyright: The Author(s) 2020 Rights: CC BY-NC-ND 4.0 Rights url: http://creativecommons.org/licenses/by-nc-nd/4.0/ Please cite the original version: Fischer D., Nordhausen K., Oja H. (2020). On linear dimension reduction based on diagonalization of scatter matrices for bioinformatics downstream analyses. Heliyon 6(12), e05732. https://doi.org/10.1016/j.heliyon.2020.e05732. Heliyon 6 (2020) e05732 Contents lists available at ScienceDirect Heliyon journal homepage: www.cell.com/heliyon Research article On linear dimension reduction based on diagonalization of scatter matrices for bioinformatics downstream analyses Daniel Fischer a,∗, Klaus Nordhausen b, Hannu Oja c a Natural Resources Institute Finland (Luke), Applied Statistical Methods, Myllytie 1, 31600 Jokionen, Finland b CSTAT - Computational Statistics, Institute of Statistics & Mathematical Methods in Economics, Vienna University of Technology, Wiedner Hauptstraße 7, A-1040 Vienna, Austria c Department of Mathematics and Statistics, University of Turku, 20014 Turku, Finland A R T I C L E I N F O A B S T R A C T Keywords: Computer science Mathematics Statistics Bioinformatics Microbial genomics Genomics Transcriptomics Dimension reduction SIR ICA Dimension reduction is often a preliminary step in the analysis of data sets with a large number of variables. Most classical, both supervised and unsupervised, dimension reduction methods such as principal component analysis (PCA), independent component analysis (ICA) or sliced inverse regression (SIR) can be formulated using one, two or several different scatter matrix functionals. Scatter matrices can be seen as different measures of multivariate dispersion and might highlight different features of the data and when compared might reveal interesting structures. Such analysis then searches for a projection onto an interesting (signal) part of the data, and it is also important to know the correct dimension of the signal subspace. These approaches usually make either no model assumptions or work in wide classes of semiparametric models. Theoretical results in the literature are however limited to the case where the sample size exceeds the number of variables which is hardly ever true for data sets encountered in bioinformatics. In this paper, we briefly review the relevant literature and explore if the dimension reduction tools can be used to find relevant and interesting subspaces for small-𝑛-large-𝑝 data sets. We illustrate the methods with a microarray dataset of prostate cancer patients and healthy controls.1. Introduction In contemporary data analysis linear dimension reduction is often the first step in reducing the number of variables. For a numeric 𝑝- variate random vector 𝒙 this means that a 𝑞 × 𝑝 transformation matrix 𝑾 with 𝑞 ≪ 𝑝 is searched such that all relevant information for the analysis at hand is contained in 𝑾 𝒙. The question is then naturally how “information” content is measured but in general two types of linear dimension reduction can be distinguished: Unsupervised Dimension Reduction: 𝒙|𝑾 𝒙 is considered as uninformative noise. Supervised Dimension Reduction: 𝑦 ⟂ 𝒙|𝑾 𝒙 for some response variable 𝑦 of interest. There is an abundance of supervised and unsupervised methods, and depending on the data and problem, they might give quite different re- sults as they adopt different concepts of information. One prevalent supervised method is, e.g. the linear discriminant analysis (LDA) or * Corresponding author. E-mail address: daniel.fischer@luke.fi (D. Fischer). more recently developed non-linear dimension reduction methods like t-SNE (van der Maaten and Hinton, 2008), Isomap (Tenenbaum et al., 2000) or UMAP (McInnes et al., 2018). However, in the context of linear unsupervised dimension reduction, for example, principal component analysis (PCA) understands informa- tion as a large variation whereas independent component analysis (ICA) usually measures information as a degree of non-gaussianity. The di- mension 𝑞 of the information or signal subspace is preselected visually or by using various information criteria or testing strategies. For a more detailed comparison of the two concepts with practical examples, please see Nordhausen and Oja (2018b). Surprisingly many unsupervised and supervised linear dimension re- duction methods can be expressed as a joint diagonalization problem of two scatter matrices which yields a nice unifying theory in wide semi- parametric models. A problem, however, is that the assumption 𝑛 > 𝑝 for the sample size 𝑛 is needed and the technique cannot be directly applied to bioinformatics data with almost always 𝑛 ≪ 𝑝. In this paper, we will first provide in Section 2 a general but brief overview of linear dimension reduction methods based on the use of https://doi.org/10.1016/j.heliyon.2020.e05732 Received 18 January 2020; Received in revised form 1 June 2020; Accepted 11 Dec 2405-8440/© 2020 Published by Elsevier Ltd. This is an open access article under tember 2020 he CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732two scatter matrices. Section 3 introduces a genetic data set which is used to illustrate the ideas. Section 4 then explores if there are any ways how to apply these methods in the case 𝑛 ≪ 𝑝 with the example data discussed in Section 3. A GitHub repository that contains the R-code and data to reproduce all results and figures is also available (https://github .com /fischuu/ LinearDimensionReductionInBioinformatics). 2. Linear dimension reduction using two of scatter matrices Let 𝒙 denote a 𝑝-variate random vector having cdf 𝐹𝒙 and the joint cdf of random variable 𝑦 and 𝒙 is denoted by 𝐹𝒙,𝑦. An (unsupervised) scatter functional is then any 𝑝 × 𝑝 matrix valued functional 𝑺 which is affine equivariant in the sense that 𝑺(𝐹𝑨𝒙+𝒃) =𝑨𝑺(𝐹𝒙)𝑨⊤ for all nonsingular 𝑝 × 𝑝 matrices 𝑨 and all 𝑝-vectors 𝒃. Similarly, a supervised scatter functional is defined as any 𝑝 × 𝑝 matrix valued func- tional 𝑺 which is affine equivariant in the sense that 𝑺(𝐹𝑨𝒙+𝒃,𝑦) =𝑨𝑺(𝐹𝒙,𝑦)𝑨⊤, with 𝑨 and 𝒃 as above. These functionals are in the following also denoted by 𝑺(𝒙) and 𝑺(𝒙; 𝑦). For a random sample 𝑿 (or (𝑿, 𝑦)), the estimates of the population values 𝑺(𝒙) and 𝑺(𝒙; 𝑦) are obtained as the value of the functional at the corresponding empirical distributions. There is a wide literature on scatter functionals. The covariance ma- trix cov(𝒙) serves as the first example but the scatter matrix may also be based on the fourth moments: cov4(𝒙) =𝐸 ( 𝑟2(𝒙−𝑬(𝒙))(𝒙−𝑬(𝒙))⊤ ) , with 𝑟2 = (𝒙−𝐸(𝒙))⊤cov(𝒙)−1(𝒙−𝐸(𝒙)). The M-functionals of scatter are implicitly defined by 𝑺(𝒙) =𝐸 ( 𝑤(𝑟)(𝒙− 𝑻 (𝒙))(𝒙− 𝑻 (𝒙))⊤ ) , where 𝑻 (𝒙) is a companion location functional and 𝑤(⋅) a weight func- tion for 𝑟 = ((𝒙 − 𝑻 (𝒙))⊤𝑺(𝒙)−1(𝒙 − 𝑻 (𝒙)))1∕2. Popular weight functions are for example Huber weights 𝑤𝐻 (𝑟) = { 1∕𝜎2 𝑟 ≤ 𝑐 𝑐∕(𝑟2𝜎2) 𝑟 > 𝑐 , where 𝑐 is an user specified tuning constant and 𝜎2 a scaling factor. The weight function for Tyler’s scatter matrix is 𝑤𝑇 (𝑟) = 𝑝∕𝑟2. For a general discussion on these and other (unsupervised) scatter matrices, see for example Tyler (1987); Rousseeuw and Hubert (2013); Dümbgen et al. (2015). A supervised scatter functional used in the sliced inverse regression (SIR) is 𝑺𝑆𝐼𝑅(𝒙;𝑦) =𝐸((𝒙−𝐸(𝒙)|𝑦)(𝒙−𝐸(𝒙)|𝑦)⊤). (1) For more examples of supervised functionals, see, e.g. Liski et al. (2014). Initially scatter functionals were developed to provide robust and efficient competitors to the covariance matrix under the multivariate normality or ellipticity assumptions. In the elliptic models, the scatter functionals can be shown to be proportional to the covariance matrix (see, e.g. Nordhausen et al., 2011). A property of interest is also the so-called independence property (Nordhausen and Tyler, 2015) which states that if 𝒙 has independent components, then 𝑺(𝒙) is a diagonal matrix. The covariance matrix, the scatter matrix based on the fourth moments as well as symmetrized versions of Huber’s and Tyler’s M- estimates all have this property, see for example Dümbgen (1998); Sirkiä et al. (2007). Surprisingly many linear dimension reduction methods can be seen as a joint diagonalization of two scatter matrices. Let 𝑾 =𝑾 (𝒙) be a 2𝑝 × 𝑝 transformation matrix (functional) which diagonalizes two scatter functionals 𝑺1 = 𝑺1(𝒙) and 𝑺2 = 𝑺2(𝒙) so that 𝑾 𝑺1𝑾 ⊤ = 𝑰𝑝 and 𝑾 𝑺2𝑾 ⊤ =𝑫, where 𝑫 is a 𝑝 × 𝑝 diagonal matrix with diagonal elements 𝑑1 ≥… ≥ 𝑑𝑝. This method is called invariant coordinate selection (ICS) (Tyler et al., 2009) as, for any nonsingular 𝑝 × 𝑝 matrix 𝑨, 𝑾 (𝒙)(𝒙) =𝑾 (𝑨𝒙)𝑨𝒙 up to the signs of the components. The transformed 𝑾 (𝒙)𝒙 in an invari- ant coordinate system may reveal intrinsics and hidden structures in the data. For unsupervised 𝑺1 and supervised 𝑺2, the transformation is known as supervised invariant coordinate selection (SICS) (Liski et al., 2014). ICS and SICS are very general methods with many several possi- ble applications - but now have a look only at three well-known special cases. 2.1. Principal component analysis PCA is one of the most popular multivariate methods and fits into this framework with 𝑺1 ∶= 𝑰𝑝 and 𝑺2 ∶= cov. The principal components in 𝑾 (𝒙)𝒙 are then ordered according to their variances, that is, the diagonal elements of 𝑫. The number of components 𝑞 to choose is often based on the cumulative proportion of variation or, visually, finding the elbow in the scree-plot indicating that the variances of the remaining components are approximately equal. The idea of the scree-plot checking is formalized by assuming that, for some 𝑞, the principal values satisfy 𝑑1 ≥… ≥ 𝑑𝑞 > 𝑑𝑞+1 =… = 𝑑𝑝 > 0. The null hypothesis 𝐻0 ∶ 𝑞 = 𝑘 can then be tested by assuming ellipticity and using the variance of the 𝑝 − 𝑘 smallest estimated eigenvalues, say 𝑇𝑘, as a test statistic. The limiting distribution of 𝑛𝑇𝑞 , properly scaled and when 𝑛 →∞, is then a chi squared distribution with (𝑝 − 𝑞 − 1)(𝑝 − 𝑞 + 2)∕2 degrees of freedom. For asymptotic and bootstrap tests and the estimates of 𝑞 based on these tests, see Schott (2006); Nordhausen et al. (2017a). 2.2. Independent component analysis The independent component model is a semiparametric model with the assumption that 𝒙 =𝑨𝒛+ 𝒃, where 𝑨 is a full-rank 𝑝 × 𝑝 mixture matrix, 𝒃 specifies the location center of 𝒙, and 𝒛 is a standardized random 𝑝-vector of independent components so that 𝐸(𝒛) = 𝟎 and cov(𝒛) = 𝑰𝑝. The goal in ICA is to esti- mate an unmixing matrix 𝑾 to transform the data to the independent components (Nordhausen and Oja, 2018a). The matrix 𝑾 also solves the ICA problem if both scatter matri- ces 𝑺1 and 𝑺2 have the independence property. The eigenvalues in 𝑫 provide componentwise kurtosis measures (Nordhausen et al., 2011). The most popular choice providing the FOBI solution is 𝑺1 ∶= cov and 𝑺2 ∶= cov4 with the componentwise Pearson’s kurtosis measures (see Cardoso (1989) and also Nordhausen and Virta (2019)). The 𝑞 non- gaussian independent components can be estimated in this model if their kurtosis values are distinct. In signal processing, it is generally thought that these 𝑞 non-gaussian components present the signal part and the 𝑝 − 𝑞 gaussian components the noise part of the data. For the FOBI approach, the eigenvalues for the gaussian compo- nents are 𝑝 + 2 and, to find the non-gaussian components, the strategy is to choose the components whose values differ most from 𝑝 + 2. A natural test statistic 𝑇𝑘 for 𝐻0 ∶ 𝑞 = 𝑘 is then the sum of 𝑝 − 𝑘 small- est diagonal entries of (?̂? − (𝑝 + 2)𝑰𝑝)2 and, if the eight moments are finite, then the limiting null distribution of 𝑛𝑇𝑞 as 𝑛 →∞ is the dis- tribution of 2𝜎1𝜒21 2 (𝑝−𝑞−1)(𝑝−𝑞+2) + ( 2𝜎1 + 4(𝑝− 𝑞) ) 𝜒21 with independent chi squared variables. The constant 𝜎1 = 𝑉 𝑎𝑟 (‖𝒛‖2) + 8 can be con- sistently estimated from the data. The bootstrap test versions are also D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732 Fig. 1. Scree-plot for the squared singular values. Fig. 2. Pairwise scatter plots for the first four principal components.3 D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732 Fig. 3. Independent components based on FOBI.easily available. Consistent estimates of 𝑞 can be found using various sequential testing strategies. Further, the ICA assumptions can be re- laxed by allowing that the non-gaussian components are dependent; this is known as non-gaussian component analysis (NGCA). For these and further results, see Nordhausen et al. (2017a) and Nordhausen et al. (2017b). 2.3. Sliced inverse regression SIR is a supervised dimension reduction method originally suggested by Li (1991). It uses 𝑺1 ∶= cov and 𝑺2 = 𝑺𝑆𝐼𝑅 as defined in equation (1). In practice, one discretizes 𝑦 and uses 𝑺𝑆𝐼𝑅(𝒙, 𝑦𝑑 ) where 𝑦𝑑 = ℎ ⇔ 𝑦 ∈ 𝑆ℎ, ℎ = 1, ..., 𝐻 , with disjoint intervals (slices) 𝑆1, ..., 𝑆𝐻 which satisfy 𝑆1 + ... + 𝑆𝐻 = ℝ. Note that the rank of the sample version of 𝑺𝑆𝐼𝑅(𝒙, 𝑦𝑑 ) is at most 𝐻 −1. We then assume again the location-scatter model 𝒙 =𝑨𝒛+ 𝒃, where now the standardized 𝒛 = (𝒛⊤1 , 𝒛 ⊤ 2 ) ⊤ satisfies (𝑦, 𝒛⊤1 ) ⊤ ⟂ 𝒛2. In the partitioning, 𝒛1 is chosen to have the smallest possible dimension 𝑞. The subvectors 𝒛1 and 𝒛2 present the signal and noise parts of 𝒛, respec- tively. Under this model and using cov(𝒙) and 𝑺𝑆𝐼𝑅(𝒙, 𝑦𝑑 ), the diagonal entries of 𝑫 are 𝑑1 ≥… ≥ 𝑑𝑞 ≥ 𝑑𝑞+1 =…= 𝑑𝑝 = 0. It is further required that 𝑑𝑞 > 𝑑𝑞+1 so that SIR is assumed to find the full signal.4To test the hypothesis 𝐻0 ∶ 𝑞 = 𝑘 Nordhausen et al. (2017a) used the test statistic 𝑇𝑘 which is simply the sum of the 𝑝 − 𝑘 smallest diagonal entries of ?̂?. Then, for the true value 𝑞, 𝑛𝑇𝑞 has a limiting 𝜒 2 (𝑝−𝑞)(𝐻−𝑞−1) distribution as 𝑛 goes to the infinity. As in the PCA and FOBI cases, different sequential testing can be again used to find a consistent esti- mate of 𝑞. Also, for small 𝑛, a bootstrap testing strategy may be used. For these and further results, see again Nordhausen et al. (2017a). Bura and Cook (2001) derived the limiting distribution of the same test statistic under weaker conditions. 2.4. Comparison to other dimension reduction methods The here considered linear dimension reduction methods rely on linear transformations of the original data into a lower dimensional subspace. In contrast, non-linear dimension reduction methods like t- SNE, Isomap and UMAP apply more drastical transformations to the data. These methods do not assume that the original data lies on a linear subspace. Besides linear/non-linear methods for dimension reduction, even feature selection methods several classification methods like, e.g. the Random Forest can be considered as a dimension reduction method. Also, even a simple linear regression can be used for dimension reduc- tion. 2.5. The case of 𝑛 ≪𝑝 The use of the above methodology (PCA, FOBI, SIR) involves prob- lems in the context of genetic data with 𝑛 ≪𝑝. Tyler (2010) showed that D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732 Fig. 4. Independent components based on robust ICA.in this scenario, unfortunately, all scatter functionals with the affine equivariance property are proportional to the covariance matrix pro- hibiting the subspace estimation. So to apply these methods, one should first reduce the dimension to some 𝑘 < 𝑛 by using SVD for example and then continue working in the resulting subspace. The high dimensionality problem is approached in bioinformatics of- ten by performing PCA via the singular value decomposition (SVD). This is the quasi-standard first step in bioinformatics data analysis, see the tools such as Genomatix, Genious or CLCbio that “solve” the problem of choosing a working dimension 𝑘 quite conveniently and simply by set- ting the default to 𝑘 = 2. Let 𝑿 be our centered 𝑝 × 𝑛 data matrix, then the singular value decomposition (SVD) for 𝑿 with rank 𝑟 ≤min{𝑝, 𝑛} is defined as 𝑿 =𝑼𝑫𝑽 ⊤ where 𝑼 = (𝒖1, ..., 𝒖𝑟) is a 𝑝 ×𝑟matrix and 𝑽 = (𝒗1, ..., 𝒗𝑟) is an 𝑛 ×𝑟matrix both with orthonormal columns, and 𝑫 is an 𝑟 × 𝑟 diagonal matrix with positive diagonal elements in a decreasing order. The number of vari- ables via PCA is then reduced to 𝑘 with a transformation 𝑿∗ = (𝑼 ∗)′𝑿 where 𝑼∗ = (𝑢1, ..., 𝑢𝑘). The problem then naturally is how to choose 𝑘 without losing any (or too much) information. 3. A microRNA expression data set Analysis of genetic data sets is one of the driving forces behind de- veloping the tools for the 𝑛 ≪ 𝑝 problems. For example, the current 5sample sizes in typical transcriptomic experiments range from just a few to a couple of hundreds of individuals due to the high costs of se- quencing. Also, the storage of massive data is often almost prohibitive. It is therefore not likely that in the foreseeable future, the case 𝑛 > 𝑝 would be realistic for genetic data. The data set used in this work consists of human microRNA (miRNA) expressions from Agilent microarrays where 2125 different probes of 813 different miRNAs were used for subjects coming from three dif- ferent groups: 76 healthy individuals, 78 patients with a mild form of prostate cancer (PrCa) and 35 with an aggressive type of PrCa. In total, 26 microarrays have been used, and the hybridization took place at four different time points. The data set was originally analyzed in Fischer et al. (2014, 2015), and all relevant data are available from EMBL-EBI Ar- rayExpress (accession number E-MTAB-3397). Note that here 𝑛 = 189 may be seen untypically large compared to the relatively “small” di- mension 𝑝 = 2125. The goal in this data is to identify miRNAs which are either responsible for the development of PrCa (predisposition) or which could serve as biomarkers for the detection of PrCa (diagnosis) and, further, to distinguish the two different types of PrCa. Also, please note that the here described findings are not only valid for microarray data but also for any other kind of more recent data like e.g. whole-transcriptome sequencing (WTS). In WTS the 𝑛 << 𝑝 problem is even more prominent as, compared to the microarray data, usually all possible genes respective transcripts are quantified so that 𝑛 is in the level of magnitude > 20.000 and 𝑝 is often also smaller. Hence, also in WTS dimension reduction methods are eminent. D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732 Fig. 5. Pairwise scatter plots for SIR components obtained from the 98-variate data and with the colors corresponding to the three health groups.Table 1. The first 10 squared singular values from SVD. Squared singular value Cumulative explained variation (%) 1 49584782.29 50.3% 2 17020792.70 67.6% 3 10365087.69 78.1% 4 7799839.61 86.0% 5 3178859.60 89.2% 6 2036165.39 91.3% 7 1750698.70 93.1% 8 1386901.81 94.5% 9 937702.68 95.5% 10 846675.15 96.3% 4. Application In the analysis of the microRNA data, the problem was to iden- tify those miRNAs which allow us to separate healthy subjects from PrCa cases and, if possible, even distinguish between the two differ- ent types of cancer. The following analysis was done entirely with R (R Core Team, 2017) using the packages ICS (Nordhausen et al., 2008) and ICtest (Nordhausen et al., 2017c). As 𝑛 ≪ 𝑝, the SVD was performed with the results reported in Ta- ble 1 (the first 10 eigenvalues of cov) and the corresponding scree-plot in Fig. 1. The scree-plot suggests that four components might be rea- sonable as they already explain more than 80% of variation of the data (which has also sometimes been used as a rule). Note that as many as 98 components are needed to explain 99.99% of the variation.6Table 2. Ordered kurtosis val- ues from FOBI and robust ICA using the first four principal components. FOBI robust ICA 1 9.7428 3.6838 2 7.2991 0.8069 3 5.3417 0.6101 4 6.2365 0.5514 ICA in the four-variate subspace We first start with four principal com- ponents (𝑘 = 4) plotted in Fig. 2. The pairwise scatter plots for these four first principal components reveal one group that is separate from the bulk of the data but not three groups with different cancer types as desired. This is because, in searching for subgroups of the data, the kurtosis measures are more natural than the variance as an information criterion. We therefore next try ICS for this four-variate data with two choices of the pairs of scatter matrices (𝑺1, 𝑺2), namely, (i) FOBI based on cov and cov4, and (ii) robust ICA based on symmetrized versions Tyler’s and Huber’s scatter matrices. Table 2 describes the estimated kurtosis measures and the diagonal entries of ?̂? . The test results reported in Table 3 suggest that the number of non- gaussian components is 𝑞 = 1. The four ICS components from FOBI and robust ICA are plotted in Figs. 3 and 4, respectively. In both cases, the first component seems to separate the two groups, which unfortunately are not at all connected to the three health status groups. After some de- tective work, it was found out that the group of 24 subjects, highlighted with red color in the plots, were outliers from three microarrays created D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732 Fig. 6. Pairwise scatter plots for SIR components obtained from the four-variate data and with the colors corresponding to the three health groups.Table 3. The 𝑝-values for testing the hy- potheses of 𝐻0 ∶ 𝑞 = 𝑘, 𝑘 = 0, 1, 2 where 𝑞 the number of non-gaussian components. The tests are asymptotic and bootstrap tests (with 500 bootstrap samples) using the FOBI approach for the four-variate data. 𝐻0 Asymp Boot 𝑞 = 0 <0.0001 0.0020 𝑞 = 1 0.1050 0.0818 𝑞 = 2 0.3963 0.4711 by a single person at different time points. Also, the second components seem just to find another small group of outliers in the data. Hence, based on these results, it can be concluded that ICA cannot find the groups of interest if only the first four principal components are used in the analysis. We repeated ICA with 98 principal components as well, but even in this case could not find structures in the data to identify the health status of the subjects. SIR in the 98-variate subspace We next perform SIR and use the data with the 98 principal components. As we hope to separate the three health groups, it is natural to let 𝑦 indicate the group memberships. Note that, as 𝐻 = 3, the dimension of the interesting subspace is 𝑞 = 0, 1 or 2. The 𝑝-values for the asymptotic and bootstrap tests for 𝑞 = 0 and 𝑞 = 1 are reported in Table 4. The tests reject both 𝑞 = 0 and 𝑞 = 1 and therefore suggest that 𝑞 = 2. For visualization purposes we plot, in Fig. 5, the first five components and highlight the three response classes with different colors. We also applied SIR to the four-variate 7Table 4. The 𝑝-values for testing 𝐻0 ∶ 𝑞 = 0 and 𝐻0 ∶ 𝑞 = 1. The tests are asymptotic and boot- strap tests (with 500 bootstrap samples) using the SIR approach for the 98-variate data. 𝐻0 Asymp Boot 𝑞 = 0 ≤0.0001 0.0448 𝑞 = 1 0.0001 0.0050 data, and the SICS components are plotted in Fig. 6. It is then clear that the information to separate the three health groups is lost if only the first four principal components are used. 5. Conclusions In the paper, we discussed the use of two scatter matrices for the unsupervised and supervised linear dimension reduction under broad model assumptions. The signal and noise spaces are then easily sep- arated using the scatter matrices, and the eigenvalues of one scatter matrix with respect to another one listed in 𝑫. The eigenvalues can also be used to decide what is the dimension of the signal space. The theory is however developed only for large-𝑛-small-𝑝 cases which rules out genetic applications. Following the general practice in bioinformatics, we tried to cir- cumvent this problem by reducing the dimension using SVD (PCA) and hoping that a few first principal components capture all the relevant in- formation. In our case study, we explored this strategy and used ICA and SIR for four- and 98-variate data sets obtained in this way. ICA and SIR have been used to analyze bioinformatics data earlier also in D. Fischer, K. Nordhausen and H. Oja Heliyon 6 (2020) e05732Liebermeister (2002); Kong et al. (2008); Zhong et al. (2005); Fischer et al. (2017). Here we, however, wanted to explore what is the impact of the SVD step on the results, and the conclusion is that the number of principal components to retain is crucial. It seems advisable to keep the number of components as high as the further analysis tools allow without suffering from the course of dimensionality. Based on this study, it would be worthwhile to develop models and techniques which allow the SVD preprocessing step and can then for- malize the rules for the number of PCs to retain. The SVD step could, for example, be incorporated into the bootstrapping procedure to ac- commodate the variation coming from that step. Note however that for example El Karoui and Purdom (2018, 2019) argue that bootstrapping is in general difficult in the 𝑛 ≪𝑝 setting. Declarations Author contribution statement D. Fischer: Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper. K. Nordhausen: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Con- tributed reagents, materials, analysis tools or data; Wrote the paper. H. Oja: Conceived and designed the experiments; Contributed reagents, materials, analysis tools or data; Wrote the paper. Funding statement The work of KN was supported by the Austrian Science Fund (FWF) Grant number P31881-N32. Data availability statement Data associated with this study has been deposited at https://github . com /fischuu /LinearDimensionReductionInBioinformatics. Declaration of interests statement The authors declare no conflict of interest. Additional information No additional information is available for this paper. Acknowledgements The authors wish to acknowledge CSC – IT Center for Science, Fin- land, for computational resources. We thank two anonymous reviewers for helpful comments on an earlier draft of the manuscript. References Bura, E., Cook, R.D., 2001. Extending sliced inverse regression: the weighted chi-squared test. J. Am. Stat. Assoc. 96, 996–1003. Cardoso, J.F., 1989. Source separation using higher order moments. In: Proceedings of IEEE International Conference on Accoustics, Speech and Signal Processing. Glasgow, UK, pp. 2109–2112. Dümbgen, L., 1998. On Tyler’s 𝑀 -functional of scatter in high dimension. Ann. Inst. Stat. Math. 50, 471–491. Dümbgen, L., Pauly, M., Schweizer, T., 2015. 𝑀 -functionals of multivariate scatter. Stat. Surv. 9, 32–105. El Karoui, N., Purdom, E., 2018. Can we trust the bootstrap in high-dimensions? The case of linear models. J. Mach. Learn. Res. 19, 1–66. El Karoui, N., Purdom, E., 2019. The bootstrap, covariance matrices and pca in moderate and high-dimensions. In: Proceedings of Machine Learning Research: the 22nd Inter- national Conference on Artificial Intelligence and Statistics, vol. 89, pp. 2115–2124. Fischer, D., Honkatukia, M., Tuiskula-Haavisto, M., Nordhausen, K., Cavero, D., Preisinger, R., Vilkki, J., 2017. Subgroup detection in genotype data using invari- ant coordinate selection. BMC Bioinform. 18, 173–181. Fischer, D., Oja, H., Schleutker, J., Sen, P.K., Wahlfors, T., 2014. Generalized Mann- Whitney type tests for microarray experiments. Scand. J. Stat. 41, 672–692. Fischer, D., Wahlfors, T., Mattila, H., Oja, H., Tammela, T., Schleutker, J., 2015. Mirna profiles in lymphoblastoid cell lines of Finnish prostate cancer families. PLoS ONE 10, e0127427. Kong, W., Vanderburg, C., Gunshin, H., Rogers, J., Huang, X., 2008. A review of indepen- dent component analysis application to microarray gene expression data. BioTech- niques 45, 501–520. Li, K.-C., 1991. Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86, 316–327. Liebermeister, W., 2002. Linear modes of gene expression determined by independent component analysis. Bioinformatics 18, 51–60. Liski, E., Nordhausen, K., Oja, H., 2014. Supervised invariant coordinate selection. Statis- tics 48, 711–731. van der Maaten, L., Hinton, G., 2008. Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605. McInnes, L., Saul, J.H., Grossberger, L., 2018. Umap: uniform manifold approximation and projection. J. Open Sour. Softw. 3, 2861. Nordhausen, K., Oja, H., 2018a. Independent Component Analysis: A Statistical Perspec- tive. Wiley Interdisciplinary Reviews: Computational Statistics, vol. 10, p. e1440. Nordhausen, K., Oja, H., 2018b. Robust nonparametric inference. Annu. Rev. Stat. Appl. 5, 473–500. Nordhausen, K., Oja, H., Ollila, E., 2011. Multivariate models and the first four moments. In: Hunter, D.R., Richards, D.S.R., Rosenberger, J.L. (Eds.), Nonparametric Statistics and Mixture Models. World Scientific, Singapore, pp. 267–287. Nordhausen, K., Oja, H., Tyler, D.E., 2008. Tools for exploring multivariate data: the package ICS. J. Stat. Softw. 28, 1–31. Nordhausen, K., Oja, H., Tyler, D.E., 2017a. Asymptotic and bootstrap tests for subspace dimension. ArXiv e-prints. Nordhausen, K., Oja, H., Tyler, D.E., Virta, J., 2017b. Asymptotic and bootstrap tests for the dimension of the non-Gaussian subspace. IEEE Signal Process. Lett. 24, 887–891. Nordhausen, K., Oja, H., Tyler, D.E., Virta, J., 2017c. ICtest: estimating and testing the number of interesting components in linear dimension reduction. https://CRAN .R - project .org /package =ICtest. R package version 0.3. Nordhausen, K., Tyler, D.E., 2015. A cautionary note on robust covariance plug-in meth- ods. Biometrika 102, 573–588. Nordhausen, K., Virta, J., 2019. An overview of properties and extensions of FOBI. Knowl.-Based Syst. 173, 113–116. R Core Team, 2017. R: A Language and Environment for Statistical Computing. R Foun- dation for Statistical Computing, Vienna, Austria. https://www .R -project .org/. Rousseeuw, P., Hubert, M., 2013. High-breakdown estimators of multivariate location and scatter. In: Becker, C., Fried, R., Kuhnt, S. (Eds.), Robustness and Complex Data Structures: Festschrift in Honour of Ursula Gather. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 49–66. Schott, J.R., 2006. A high-dimensional test for the equality of the smallest eigenvalues of a covariance matrix. J. Multivar. Anal. 97, 827–843. Sirkiä, S., Taskinen, S., Oja, H., 2007. Symmetrised 𝑀 -estimators of scatter. J. Multivar. Anal. 98, 1611–1629. Tenenbaum, J.B., de Silva, V., Langford, J.C., 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323. Tyler, D.E., 1987. A distribution-free 𝑀 -estimator of multivariate scatter. Ann. Stat. 15, 234–251. Tyler, D.E., 2010. A note on multivariate location and scatter statistics for sparse data sets. Stat. Probab. Lett. 80, 1409–1413. Tyler, D.E., Critchley, F., Dümbgen, L., Oja, H., 2009. Invariant coordinate selection. J. R. Stat. Soc. B 71, 549–592. Zhong, W., Zeng, P., Ma, P., Liu, J.S., Zhu, Y., 2005. Rsir: regularized sliced inverse regression for motif discovery. Bioinformatics 21, 4169–4175.8