JSS Journal of Statistical Software June 2015, Volume 65, Issue 9. http://www.jstatsoft.org/ Mann-Whitney Type Tests for Microarray Experiments: The R Package gMWT Daniel Fischer University of Tampere Hannu Oja University of Turku Abstract We present the R package gMWT which is designed for the comparison of several treatments (or groups) for a large number of variables. The comparisons are made using certain probabilistic indices (PI). The PIs computed here tell how often pairs or triples of observations coming from different groups appear in a specific order of magnitude. Classical two and several sample rank test statistics such as the Mann-Whitney-Wilcoxon, Kruskal-Wallis, or Jonckheere-Terpstra test statistics are simple functions of these PI. Also new test statistics for directional alternatives are provided. The package gMWT can be used to calculate the variable-wise PI estimates, to illustrate their multivariate distribution and mutual dependence with joint scatterplot matrices, and to construct several classical and new rank tests based on the PIs. The aim of the paper is first to briefly explain the theory that is necessary to understand the behavior of the estimated PIs and the rank tests based on them. Second, the use of the package is described and illustrated with simulated and real data examples. It is stressed that the package provides a new flexible toolbox to analyze large gene or microRNA expression data sets, collected on microarrays or by other high-throughput technologies. The testing procedures can be used in an eQTL analysis, for example, as implemented in the package GeneticTools. Keywords: eQTL, Jonckheere-Terpstra test, Kruskal-Wallis test, Mann-Whitney test, permu- tation test, several samples, simultaneous testing, union-intersection test, U-statistic. 1. Introduction We consider nonparametric tests used in the analysis of gene or microRNA expression data sets with several treatments (groups). For each separate expression variable, the null hypoth- esis to be tested is that there is no difference between the distributions of the expression in different groups. To avoid strong (parametric) distributional assumptions, the alternatives are formulated using probabilities that pairs or triples of observations coming from different 2 gMWT: Generalized Mann-Whitney Tests in R groups are in a specific order of magnitude. The interesting probabilities are called proba- bilistic indices (PI), see also Thas, De Neve, Clement, and Ottoy (2012). The test statistics are based on natural estimates of these PIs, that is, the corresponding two and several sam- ple U-statistics. Classical several-sample rank test statistics such as the Kruskal-Wallis or Jonckheere-Terpstra test are special cases in this approach. Also, as the number of variables (microRNAs) is typically huge and the test statistics for different variables are dependent, we face a serious simultaneous testing problem. See Fischer, Oja, Sen, Schleutker, and Wahlfors (2014) for more details. The package gMWT (Fischer and Oja 2015) provides nonparametric tools for the comparison of several groups/ treatments when the number of variables is large, and is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package= gMWT. The tools are the following. (i) Computation of the PI estimates for the group comparisons. The probabilistic indices here are (a) the probability Ptt′ that a random observation from group t is smaller than a random observation from group t′, and (b) the probability Ptt′t′′ that observations from groups t, t′, t′′ appear in this same order. The tools are also given to produce the plots of variable-wise PIs. (ii) Computation of the p values of some classical and some new nonparametric tests for the comparison of several groups/treatments. The tests are based on the use of the probabilistic indices Ptt′ and Ptt′t′′ . Classical Mann-Whitney-Wilcoxon, Kruskal-Wallis and Jonckheere-Terpstra tests are included. (iii) Tools for the simultaneous testing problem. As the package is meant for the analysis of gene expression data for example, tools to control the family-wise error rate and/or the false discovery rate are provided as plots for expected versus observed rejected null hypotheses with the Simes (improved Bonferroni) and Benjamini-Hochberg rejection lines. A list of rejected null hypotheses may be obtained as well. Some standard nonparametric methods such as the Mann-Whitney and Kruskal-Wallis tests have been implemented in the R stats (R Core Team 2014) package. Linear rank statistics for the two and several sample location problems with ordered and unordered alternatives have been implemented also in the coin (Hothorn, Hornik, van de Wiel, and Zeileis 2008) package. Exact and permutation versions of the Jonckheere-Terpstra test are given by the package clinfun (Seshan 2014); this function is used in our package as the second option in our implementation. One contribution of our package gMWT is that these and several other nonparametric tests are collected with the same syntax under the same roof with a simultaneous testing possibility for several variables. The interfaces of the functions are tailored for large datasets with many groups and several variables, so that the application and comparisons of competing testing procedures are easier. With scatterplot matrices for the relevant PIs, it is also possible to illustrate and understand the joint variable-wise behavior of the standard tests. The structure of this paper is as follows. After a brief review of the theory in Section 2 we present some practical solutions in Section 3 for the computation of the PIs and the permutational p values of the corresponding tests. In Section 4 a general description of the package gMWT is given with a typical workflow for its use. We also discuss the calculation of Journal of Statistical Software 3 the PIs and their scatterplot matrices, and it is described how the tests are performed. Also, the tools for the multiple testing problem are described. In Section 5, the use of the package is illustrated with a simulated data set as well as with real genotype data. In the latter case, an expression quantitative trait locus (eQTL) analysis is performed with the packages gMWT and GeneticTools (Fischer 2014). 2. Statistical inference based on probabilistic indices 2.1. Null hypothesis and alternatives based on Ptt′ and Ptt′t′′ Consider first the univariate case and the comparison of T groups. Let xt1, . . . , xtNt be a random sample from a distribution with cumulative distribution function Ft, t = 1, . . . , T , and let the samples be independent. The total sample size is then N = N1 + · · · + NT . We wish to test the null hypothesis H0 : F1 = F2 = · · · = FT . The interesting alternatives are formulated using certain probabilistic indices. As ties may often be present, we write I(x, y) = I(x < y) + 1 2 I(x = y) and I(x, y, z) = I(x < y < z) + 1 2 I(x = y < z) + 1 2 I(x < y = z) + 1 6 I(x = y = z), with I(·) being the indicator function, which is 1 if the argument (·) is true and 0 else. The interesting alternatives are then given in terms of the probabilities Ptt′ = E(I(xt, xt′)) and Ptt′t′′ = E(I(xt, xt′ , xt′′)). Note that, as I(x, y) = I(x, y, z) + I(x, z, y) + I(z, x, y) the probabilities satisfy Ptt′ = Ptt′t′′ + Ptt′′t′ + Pt′′tt′ . Under the null hypothesis H0 : F1 = F2 = · · · = FT , for all t, t′, t′′, Ptt′ = 1 2 and Ptt′t′′ = 1 6 . We say that F1 and F2 are stochastically ordered and write F1 st F2 if F1(x) ≥ F2(x) ∀x ∈ R. Then Ft st Ft′ ⇒ Ptt′ ≥ 1 2 and Ft st Ft′ st Ft′′ ⇒ Ptt′t′′ ≥ 1 6 but the converse statements are not true. 4 gMWT: Generalized Mann-Whitney Tests in R In the comparison of T = 3 treatments interesting alternatives might then be formulated, for example, as H1 : P12 6= 12 or P13 6= 12 or P23 6= 12 , or H1 : P12 ≥ 12 or P13 ≥ 12 or P23 ≥ 12 with at least one strict inequality, or H1 : P13 ≥ 12 or P23 ≥ 12 with at least one strict inequality, or H1 : P123 > 1 6 . The tests will then be based on the estimates Pˆ12, Pˆ13, Pˆ23 and Pˆ123 and should be constructed keeping the interesting alternative in mind. 2.2. Estimation of Ptt′ and Ptt′t′′ The probabilities Ptt′ and Ptt′t′′ are naturally estimated by corresponding U-statistics Pˆtt′ = 1 NtNt′ Nt∑ i=1 Nt′∑ i′=1 I(xti, xt′i′) and Pˆtt′t′′ = 1 NtNt′Nt′′ Nt∑ i=1 Nt′∑ i′=1 Nt′′∑ i′′=1 I(xti, xt′i′ , xt′′i′′). A natural statistic for the comparison between group t and other groups is Pˆt = 1 N −Nt ∑ t′ 6=t Nt′Pˆtt′ . A general several-sample U-statistic theory can be used to find the (joint) limiting properties of Pˆt, Pˆtt′ and Pˆtt′t′′ under the null hypothesis. See, e.g., Chapter 5 in Serfling (1980). 2.3. Tests based on estimates Pˆtt′ and Pˆtt′t′′ As seen before, we have a hierarchy{ Pˆtt′t′′ } → { Pˆtt′ } → { Pˆt } and one can construct test statistics at different levels of this hierarchy. Some choices are the following. 1. Use test statistics Pˆtt′t′′ for H0 : Ft = Ft′ = Ft′′ vs. H1 : Ptt′t′′ 6= 16 . Of course one-sided alternatives are possible as well. Journal of Statistical Software 5 2. Use the Mann-Whitney (MW) test statistics Pˆtt′ for H0 : Ft = Ft′ vs. H1 : Ptt′ 6= 12 and the Jonckheere-Terpstra (JT) test statistics for H0 : F1 = · · · = FT vs. H1 : Ptt′ ≥ 12 for all t < t′ with at least one strict inequality. Note that F1 st F2 st · · · st FT with at least one strict inequality implies the latter H1. We have two versions of JT test statistic, namely, JT = ∑ t 1 2 . The test statistic can be found in Appendix A, see also Fischer et al. (2014) for the details. 3. Computational solutions 3.1. Fast computation of Pˆtt′ and Pˆtt′t′′ Consider the univariate case and write the N -vector x = (x1, x2, . . . , xN ) > = (x11, . . . , x1N1 , x21, . . . , x2N2 , . . . , xT1, . . . , xTNT ) > of observations coming from all T groups. The PIs are based on two N × N matrices Ist = Ist(x) and Ieq = Ieq(x) with the elements (Ist(x))ij = I(xi < xj) and (I eq(x))ij = I(xi = xj), i, j = 1, . . . , N . The matrices Ist = Ist(x) and Ieq = Ieq(x) can then be decomposed as Ist =  Ist11 I st 12 . . . I st 1T Ist21 I st 22 . . . I st 2T . . . . . . . . . . . . IstT1 I st T2 . . . I st TT  and Ieq =  I eq 11 I eq 12 . . . I eq 1T I eq 21 I eq 22 . . . I eq 2T . . . . . . . . . . . . I eq T1 I eq T2 . . . I eq TT  , where the Ni ×Nj submatrices Istij and Ieqij compare treatments i and j, i, j = 1, . . . , T . Then Pˆ tt′ = 1 NtNt′ 1>Nt ( Isttt′ + 1 2 I eq tt′ ) 1Nt′ , 6 gMWT: Generalized Mann-Whitney Tests in R 0 10 0 20 0 30 0 40 0 50 0 Simulated sample size Ti m e in s ec on ds 30 150 300 450 600 750 900 1050 l l l l Naive,R Submatrices,R Submatrices,C++ Naive,C++ Figure 1: Computation times for the p values using the permutation test version with test statistic Pˆ tt′t′′ . Three groups with equal group sizes were used, and the number of permu- tations in each case was 2000. Naive and submatrix approaches implemented in R and C++ are compared. where 1k is the notation for a k-vector full of ones. For the triples we get Pˆ tt′t′′ = 1 NtNt′Nt′′ 1>Nt ( Isttt′I st t′t′′ + 1 2 I eq tt′I st t′t′′ + 1 2 Isttt′I eq t′t′′ + 1 6 I eq tt′I eq t′t′′ ) 1Nt′′ . In case that no ties are present, the matrix Ieq is simply a zero matrix. Using a naive implementation, we would calculate the probabilities Pˆ tt′t′′ one by one while going through all NtNt′Nt′′ triple comparisons. In our submatrix approach we thus calculate the matrices Ist and Ieq only once and then use the submatrices to find the probabilities Pˆ tt′t′′ . This leads to improved calculation times especially for permutation versions of the tests. For the computation time comparisons in R and C++ (via Rcpp, Eddelbuettel and Franc¸ois 2011, and RcppArmadillo, Eddelbuettel and Sanderson 2014), see Figure 1. 3.2. Computation of p values If the null hypothesis H0 : F1 = · · · = FT is true, then Px ∼ x for all N × N permutation matrices P. By Px ∼ x we mean that the distributions of Px and x are the same. Matrix P is an N × N permutation matrix if it is obtained from an identity matrix IN by permuting its rows and/or columns. The number of distinct permutation matrices is N !. Note that our test statistics are functions of the matrices Ist and Ieq and that Ist(Px) = PIst(x)P> and Ieq(Px) = PIeq(x)P>. The exact p value from a permutation test using a test statistic Q = Q(Ist, Ieq) is then P { Q(PIstP>,PIeqP>) ≥ Q(Ist, Ieq) } , Journal of Statistical Software 7 where the probability is taken over N ! equally probable values of P. The p value can in practice be estimated by 1 M M∑ m=1 I { Q(PmI stP>m,PmI eqP>m) ≥ Q(Ist, Ieq) } , where P1, . . . ,PM is a random sample from a uniform distribution over the set of N × N permutation matrices. Naturally, the larger M , the better is the estimate of the exact p value. Approximate p values may also be based on the limiting joint normality of the estimates (U-statistics) Pˆt, Pˆtt′ , and Pˆtt′t′′ . If no ties are present, the test statistics based on the PIs are strictly distribution-free with limiting variances and covariances that are easily found. 4. The package gMWT 4.1. General features The R package gMWT can be used to calculate the variable-wise probabilistic indices Pˆt, Pˆtt′ , and Pˆtt′t′′ , to illustrate their joint distributions and dependence with scatterplot matrices, and to perform various rank tests based on Pˆt, Pˆtt′ and Pˆtt′t′′ described in Section 2.3. See Figure 2 for possible workflows. A practical application for the testing procedures is an eQTL analysis for a combined anal- ysis of microarray and genotype data. In Section 5.2 we illustrate the use of the packages GeneticTools and gMWT with the directional triple test for testing for eQTL. In the following, the input matrix X is a data matrix with observations as rows and variables as columns. The vector g indicates the group membership; its length is then the number of rows in X. 4.2. Computation of Pˆt, Pˆtt′ and Pˆtt′t′′ The estimated PIs, Pˆt, Pˆtt′ and Pˆtt′t′′ , are calculated using the command estPI(X, g, type = "pair", goi, mc = 1, order = TRUE) The options type = "single", "pair" or "triple" specify the PIs to be computed, that is, Pˆt, Pˆtt′ or Pˆtt′t′′ . The vector goi (”groups of interest”) specifies the groups (values of g) to be used in the comparisons. The option order specifies whether the PIs should be calculated for all possible pairs and triples (order = FALSE) or just for pairs and triples with increasing group labels. In the four group case, for example, an estPI call with the (default) options (type = "pair", order = TRUE) would calculate the estimated PIs Pˆ12, Pˆ13, Pˆ14, Pˆ23, Pˆ24, Pˆ34 and in case of using the parameters (type = "triple", order = TRUE) the estimated PIs Pˆ123, Pˆ124, Pˆ134, Pˆ234. For matrix valued X, the option mc can be used to execute the parallel calculation on mc-many cores in order to speed up the calculation (available only for Linux systems). The result of the function estPI is a list, containing a matrix probs with the PIs as rows and variables as columns. The other list items are the used parameters. 8 gMWT: Generalized Mann-Whitney Tests in R Data NxG G Variables: g=1,...,G N obs. n=1,...,N Groups t=1,...T single pairs triple Calculate Probabilistic Index Function: estPI Apply testFunction: gmw KW MW: 1 vs. All MW: pairs UIT JT JT* Triple Plot the test resultsFunction: plot plotProb plotProb Figure 2: Possible calculation workflows. Journal of Statistical Software 9 4.3. Scatterplots for Pˆt, Pˆtt′ and Pˆtt′t′′ Based on the probabilities calculated via estPI the package creates diagnostic scatterplot matrices for variable-wise PIs. The command is either plotPI(X, g, col, zoom = FALSE, highlight = NULL, hlCol = "red") if the diagnostic plots are produced directly from the data or pe <- estPI(X, g, type = "pair", goi) plot(pe, col, zoom = FALSE, highlight, hlCol) if the probabilities are first calculated by estPI. The function plotPI naturally allows also all options used in the estPI call. The type of the plots depends on the selected PIs; the pairwise scatterplots are then for all possible pairs of PIs. Additional plotting options are highlight, hlCol and zoom. Using the highlight option, indicated variables are plotted with a color specified in hlCol. If the Boolean flag zoomed is set, the plots are zoomed to the active area of the PIs. Without this flag, the bivariate plots are in [0, 1]× [0, 1]. 4.4. Tests based on Pˆt, Pˆtt′ and Pˆtt′t′′ The basic call for applying the testing procedure is gmw(X, g, goi, test = "mw", type = "permutation", prob = "pair", nper = 2000, alternative = "two.sided", mc = 1, output = "min", keepPM = FALSE, mwAkw = FALSE) where only input variables X and g are compulsory. If X is a matrix then the chosen test is applied variable-wise to the data and the results are reported as a matrix of p values. Specifying the option output = "full" leads to a more detailed list output with the same length as the number of different alternatives are tested and each list item contains then again a list with as many columns as there are in X. Each entry in the encapsulated list is a test result of class ‘htest’. The vector g gives the group numbers in a natural order. The groups used in the analysis can be specified via goi. If no goi is specified, all groups are used. With the option test, the test ("uit", "triple", "mw", "kw", "jt", "jt*") can be spec- ified. The option prob may be used only in the case of the Mann-Whitney test. For all pair- wise group comparisons one uses (prob = "pair") while option (prob = "single") compares each group to the rest of the data. The option type specifies whether the permutation type test ("permutation") or the asymptotical test version ("asymptotical") is used. For some standard tests the procedures from base R are available, and the Jonckheere-Terpstra test is implemented in the clinfun package. The option type = "external" allows the use of these test versions. A permutation type of test is available for all tests but, as the package is still under active development, the asymptotical versions are not yet available for all cases. The number of permutations is selected with the option nper. Different alternatives are avail- able whenever they are natural and are set with alternative = "smaller", "greater" and "two.sided". 10 gMWT: Generalized Mann-Whitney Tests in R If the Westfall & Young method, as proposed in Westfall and Young (1993), will be later applied for multiple testing, the permutation matrices used for the p value calculation must be stored for a later use. In that case the option keepPM = TRUE has to be set for the later maxT correction. For the option mwAkw = TRUE, pairwise Mann-Whitney tests are performed after the global Kruskal-Wallis test for a more detailed analysis of the differences between the groups. Please notice that, to keep the overall significance level, the second step is allowed only after the rejection by the Kruskal-Wallis test. Again, the additional option mc may be used to speed up the computation for a large number of variables if one works on a multiple core computer with Linux operating system. The amount of cores can be specified with the mc option. This option decreases the calculation time also on normal desktop computers drastically, since modern desktop computers usually have more than one core. However, using all available cores in the calculations can make the computer for the calculation time unusable for other tasks. Hence, for longer calculations we recommend to choose here mc = detectCores() - 1 if other tasks are performed simultaneously on the same computer. The full test output itself is a ‘htest’ R object and has the standard output showing the used data and the grouping object: R> library("gMWT") R> set.seed(123456) R> myData <- c(rnorm(50), rnorm(60), rnorm(40, 0.5, 1)) R> myGroups <- c(rep(1, 50), rep(2, 60), rep(3, 40)) R> gmw(myData, myGroups, test = "uit", type = "permutation", + alternative = "greater", output = "full") $`H1: Max(P13,P23) > 0.5` ********* Union-Intersection Test ********* data: Data:X, Groups:g, Order: max(P13,P23) obs.value = 2.8081, p-value = 0.077 alternative hypothesis: greater The attribute obs.value contains the value of the test statistic and the p.value contains the p value based on the selected test version. The minimal standard output looks like R> gmw(myData, myGroups, test = "uit", type = "permutation", + alternative = "greater") pValues H1: Max(P13,P23) > 0.5 0.077 Journal of Statistical Software 11 4.5. Multiple testing problem As the testing procedures are applied simultaneously for a large number of dependent vari- ables, we often face a severe multiple testing problem. Several attempts can be found in the literature to adjust the p values and/or the test levels so that the inference remains valid or the number of wrong decisions is as small as possible. The standard Bonferroni method and its improved version by Simes (1986) control the family-wise error rate (FWER). Ben- jamini and Hochberg (1995) suggested to control the false discovery rate (FDR) rather than the FWER. Their procedure, known as the Benjamini-Hochberg procedure, also controls the FDR at the same level and leads to the same practical rejection rule as the improved Bon- ferroni procedure. It can be shown that the procedures keep their control levels also under mild dependencies between the marginal test statistics. See, e.g., Fischer et al. (2014). For permutation type tests, a multiple testing procedure proposed by Westfall and Young controls the FWER and takes the dependence structure of the p values into account, see Westfall and Young (1993). The package gMWT allows a visualization of the p values as a plot of expected versus observed proportions of rejected null hypotheses (cumulative distribution of the observed p-values) by rejectionPlot(testResult, rejLine = "bh", alpha = 0.1, crit = NULL, xlim = c(0, 0.1)) The input testResult is a vector of variable-wise p values for a selected test statistic. It is also possible to pass a matrix testResult to the function. In that case each row is handled as an own test result and curves from different tests are shown in the same figure. The option col defines the colors for these curves. A solid line is the expected line under the null hypothesis. The FWER and FDR controlling lines are given by the option rejLine. Possible parameters are then "bonferroni", "bh" and "simes". The control level can be set with the option alpha. For the use of these rejection lines, see Fischer et al. (2014). For some examples, see Figures 6 and 7 in Section 5.1. Finally, the function getSigTests extracts the variables with rejected null hypotheses at a given FWER or FDR control level α. The function call getSigTests(X, alpha = 0.05, method) then provides a vector for the positions of these variables. Possible options for method are "plain", "bonferroni", "bh", "simes", "maxT", "csR" and "csD". Please keep in mind that in order to use the option "maxT" the permutation matrix of the test has to be available and that for this the option keepPM = TRUE has to be set when calling gmw. 5. Examples 5.1. Simulated data We illustrate the use of the gMWT package first with a simulated data set. We created a dataset with 500 variables and 3 groups with sample sizes N1 = N2 = N3 = 50. The 150×500 data matrix was obtained as follows. 1. Let zij , i, j = 0,±1,±2, . . . be independently identically distributed (iid) from N(0, 1). 12 gMWT: Generalized Mann-Whitney Tests in R 2. Let xij = 10∑ k=−10 ψkzi,j+k, i = 1, . . . , 150; j = 1, . . . , 500, where ψk = 11− |k|, k = 0,±1, . . . ,±10. This step thus makes the columns of the matrix X = (xij) dependent while the rows are iid. 3. The final data matrix is then X +  0 · 150(1/3) · 150 (2/3) · 150  s> where s is a random vector with 25 ones and 475 zeros. Thus the null hypothesis is not true for 25 variables indicated by s. For our illustration, we first calculate the variable-wise PIs, Pˆt, Pˆtt′ and Pˆtt′t′′ by R> ep1 <- estPI(X, groups, type = "single") R> ep2 <- estPI(X, groups) R> ep3 <- estPI(X, groups, type = "triple", order = FALSE) In the data set the 25 shifted variables are indicated by pickGenes. The scatterplot matrices in Figures 3, 4 and 5 are then obtained and the shifted observations highlighted as follows. As the same observations are used repeatedly for different PIs, the PIs are dependent as can easily be seen in the figures. R> plot(ep1, highlight = pickGenes, zoom = TRUE) R> plot(ep2, highlight = pickGenes, zoom = TRUE) R> plot(ep3, highlight = pickGenes, zoom = TRUE) Next, we simultaneously perform several tests for all the 500 variables. R> kw.results <- gmw(X, groups, test = "kw") R> jt.results <- gmw(X, groups, test = "jt") R> uit.results <- gmw(X, groups, test = "uit", alternative = "greater") R> triple.results <- gmw(X, groups, test = "triple", alternative = "greater") The results from the testing procedures are summarized in rejection plots, see Figures 6 and 7. In Figure 7 a Benjamini-Hochberg rejection line at the FDR control level α = 0.1 is provided as well. The null hypotheses with p values smaller than the (highest) crossing point of the straight rejection line and the curve are then rejected and at most 10% of the rejected hypotheses are then in fact true. The figures were created by Journal of Statistical Software 13 l l llll l llll ll ll l l l ll l lllll l l ll ll ll l llllllllll l llllll lll l l l l l llll l l l l l l llll lll lllll l l l l ll lllllll l ll l l l l llll ll l l l l ll llll l l l l ll ll ll l l l l l l l l l l llllll l l l lllll l lll ll l l l llll l ll l l l l l l l ll ll l l l l l l l l l ll lll l l ll l l llll l l lll l lll l l l l l l l l l l l ll l l lllll ll l l lll lll l llll ll l l l l ll l l ll lll l ll ll ll l l lll l lllll lll ll ll ll ll l l l l ll l llllllll l l ll ll l l lllll l l l l l l lll l ll l l l lll ll ll l l l l l l lllll l l l lll l l l P(2) P( 1) 0. 4 0. 6 0.4 0.6 l l llll l llll ll l l l l ll llll llll ll ll ll ll llllllllll l lll lll lll l l l l l lllll l l ll l lll lll llll llll lll lll lll l l ll l l l l l lll ll l l l l ll l lll l l l l lll l l lll llll l l l l l l llllll l l l llll l lll ll l l l llll l ll l l l l l l l ll ll l l l l l l l l l ll ll l l ll l l ll l l llll l ll l l l l l l ll l l l l ll l l lllll ll l l ll l llll llll ll l l ll lll l ll ll l l l l l ll l l lll lll l l llllll l lll lllll l l ll l l ll ll l l l lllll l l l l l l lll l ll l l l lll ll ll l l l l ll l l lllll ll l l l P(3) 0.4 0.6 P( 2) 0 0. 2 0. 4 0. 6 0. 8 1 ll ll lll lll l lllll l l llll ll lll l lllllll ll ll ll l llllllll l l llll l l ll l ll l ll l l l l lllll l lll ll l l l l ll lll ll l l llll l l l l l llll l l l l lll l llll lll ll l ll lll l ll l l l l l l l l l llll l l l l lll lll l l ll ll llll l l ll ll lll l lll lll l llll l lllll llll l lllll l ll ll llll l l lll l l l l l l lll l llll lllll ll l l l lllll l lllll l l ll lll l l l ll l lll ll l l ll l l lll l l ll llll ll ll ll l l l lll l l l l l llll lll lll l ll ll l l l l l l ll l l ll l l l l ll ll l l l l l l ll l l l l l ll l l l Figure 3: Scatterplot matrix for ep1. l l ll llllll l l ll l l l l l lll lllll l l ll ll l l ll ll lllllll l lllll l lll l l l l l lll l l l l l l l lll l ll l l ll l l l ll lll lll lll lllll l l l l l l lll l l l l l ll ll ll l l l l llll l l l l l l l l l l l l l l llll l l l l l lll lll l lll l ll ll l lll l l ll l l ll l l l l l ll l l l l l l l l ll ll l llllll l ll l llll ll lll l l l l ll l l l ll l lll l l l l lll l l llll ll l l llll l l l ll l l l llll ll l lll llllllll l l ll lll llll l l l l l l ll l ll ll l ll l l l ll l l l l lllll l l l l ll ll l l ll l l l l lll ll l l l l l l l l l ll ll l l l l ll l l l P(1<3) P( 1< 2) 0. 4 0. 6 0.4 0.6 0.8 l l ll llllll l l l ll l l l l l lll llll l ll ll l l ll ll lllllll l lllll l llll l l l l llll l l l l l l l llll l ll ll l l l l l l l ll ll lll ll l lll l l l l l l l ll l l l l ll ll ll l l l ll ll l l l l l l l l l l l l l l llll l l l l l lll lll l lll l ll ll l lll l l l l l lll l l l l ll l l l l l l l l ll lll l llllll l l l llll l l l l l l ll l l l l lll l l l lll lll l lllll ll l l l l ll l l l ll l ll l lll l l l l l ll ll ll ll l l ll l ll l l l l ll l lll l l l l llll l l lll l l l lllll l l l ll ll l l ll l l l l l lllll l l l l l lll l l ll l l l l lll l l l P(2<3) 0.4 0.6 P( 1< 3) 0 0. 2 0. 4 0. 6 0. 8 1 l l lllll llll l l ll l ll lll lll ll l l ll ll llllll llll llllllllll l ll l l lllll lllllllllllll lll l l l l lll llll ll l l l l lllll l l l l l l l l lll l l l l ll l l l ll l l ll l llll l llllllll ll lll l l lllll llll l l l l l ll ll l l l l ll l l l l l l l l l l lll l l l l llll lll lll l l l l l l l l l l l l l l lllll l ll l l l l l l l llll l l l l l l l ll lllll l ll l l l ll l l llll ll ll l l l ll ll ll l l ll lll l l l ll l l l l l ll llll l l l l l ll lll l ll ll lll l l l ll ll l l l l l l ll lll ll l l ll lll ll ll Figure 4: Scatterplot matrix for ep2. R> tests <- rbind(kw.results$p.values, jt.results$p.values, + uit.results$p.values, triple.results$p.values) R> rejectionPlot(tests, lCol = c("green", "red", "blue", "cyan")) R> rejectionPlot(tests, lCol = c("green", "red", "blue", "cyan"), + xlim = c(0, 0.2), rejLine = "bh", alpha = 0.1) The variables with rejected null hypotheses by a Kruskal-Wallis test at the FDR control level 0.10, for example, are obtained by R> getSigTests(kw.results, method = "bh", alpha = 0.1) For highlighting the results from one selected test, one can for example use the command R> plot(ep3, highlight = which(triple.results < 0.01))) This illustration shows that computing and looking at the scatterplots of the PIs may be a useful tool to understand the dependencies between marginal PIs and the corresponding test statistics as well as to detect hidden structures in the data. The rejection plot is a useful tool in the analysis of large datasets and in the multiple testing problem. 5.2. Application example: gMWT and eQTL analysis The testing procedures implemented in the package gMWT can be used in an eQTL analysis. In the following, we combine the gene expression data with the genotype data in order to find important single nucleotide polymorphisms (SNPs) that are associated or influential to the expression level of certain genes. As a first step, using the gene expression data, we identify the genes which are differentially expressed between cases and controls. Interesting genes are for example identified with the Wilcoxon tests (two groups) or with a Kruskal- Wallis test (several groups). After the identification of an interesting gene, we consider all 14 gMWT: Generalized Mann-Whitney Tests in R l l lllllllll l ll l ll l l llllllll ll ll l l l llllllll l lll ll l l l l l l l l l ll lllllll l l l ll ll lll ll lllll l l l l l l l l l l l ll l l ll l lllll l ll l l l l l l l llll ll ll l l l l l l ll llllll l l l l l llll l l ll l lll l ll l l l l l l l l ll l ll l l ll l lll l l l l l ll l ll l llll l l l lll ll l ll l l l l ll l l l l l l l ll ll l l llll lll l ll l ll l l lll ll l l l l l l l lll lll l lll lll l l ll l llll l ll ll lll l llll l l lll l llll l l l l lll l l l ll l l l l l l l l l l l l l l l l l lll llll l l ll l l ll ll l l P(1<3<2) P( 1< 2< 3) 0. 2 0. 4 0.2 l l ll lllllll ll l ll ll lllllll l l l l llll l llllllll l lllll ll l l l ll l l l lll l lll l lll ll ll lll ll l l ll ll l l l llll l ll l l ll l l lll lll ll llll l l l l l l l ll l l l lll l l l l l l ll l l ll l l l l llllllll ll ll l l l l l l l l l l l l l l l l lll l ll l l l l l ll l ll l llll ll l l l l l ll llll l l l lll l l l l l l l l l l ll ll l ll llll l llll l l l l l ll ll l l l ll ll l ll l l lll ll lll lll l l l l l l l l lllll l l l l ll l l l ll l l l l ll l l l l llllll l llll lll l l l llll l lll l lll l l l lll l l l l P(2<1<3) 0.2 l l lllllll l ll l ll l l l lllll llllll ll lllllllll l llll ll l l l ll l l lll lllll l lll ll ll llll l ll lll ll l l l l lll l ll l l ll l l l lll l lll l l l l l lllll ll lll l l ll l l l l ll ll l l l l l l l llll ll ll lll l lll l l l l l l l l l l ll ll l l l lll l ll l l l l ll l ll l ll l llll lll l l l l lllll l lll ll lll l l l l ll l l l l l l ll ll l l l llll lll l llll lllll l ll l l l l ll l l lll ll llll lll ll l l l l l lll ll l ll l l l lll ll l l l l l l l l l lllll l ll ll ll ll ll ll l l l l lll lllll lll ll l l l l l l l P(2<3<1) 0.2 l l llllllllll l l l ll l llllllll llll l l ll l lll lll l llll ll l l l l ll l l l lll lllll l lll l l l l l llll ll lllll ll l l l l lll l ll l l ll l llll ll llll l l l l l l l l ll l ll ll l l ll l l l l llllll l l l l ll l ll ll ll l l lll l l l l l l l l l l l l l ll l l l l ll l l l l l l l ll l l ll l ll l ll ll lll l l l ll l l l ll l lll l l l l l l ll l l l llll l l l l llll llll l l l l l ll l l l l lll l l l ll lll l ll lll l l lll l ll l l l l l l ll l l ll l l l l ll l l l l l lllll ll ll lll llll l lll ll l lllllll ll l l l l P(3<1<2) 0.2 l l lll ll lll l ll l ll ll llll l l l l llllll ll ll lllll ll l ll l ll ll l l l l l l l l l lllll l lll lll ll ll l ll lll l lll l l l lll l ll l l l l ll lll l l l l l l l l l l lllll ll lll l l ll l l lllll l l l llll ll ll l lll l l l lll l l l l l l l l l l ll l ll ll l l l l lll l l l l l ll l l l l ll l l lll lll l l l l l l l ll ll l ll l l lll l l l l lll l l ll l llll l l l l l l l llll llll l lllll l lll ll l l l l l ll l l l l l l l l ll l l l l lll l l l l l l l lll l l l l l ll lllll llll l l l l P(3<2<1) 0 0.2 P( 1< 3< 2) 0 0. 2 0. 4 0. 6 0. 8 1 l l ll ll l lll ll l ll l ll l ll ll l l l l l l l l ll ll ll lll l lllll l l l ll l l l l l l lll l l ll l l lllll l ll ll ll l l l lll l l ll lll l l l l llll l l l ll l l llll llllllllll ll l ll ll l l l l l l l l ll l l ll ll l l l l lllll l lll l l ll lllll lll l l l ll l l l ll ll l ll l l llll l l l ll l l ll ll l l ll ll l lll ll l llll l l l l ll lll lll ll l l l lll ll l l l l l ll l l l l l l l l l l ll l lll l l l l l ll l lll l l ll l l l l l l ll l ll l ll l l ll ll ll l ll l l l l l l llll lll l l l l l l l lll l ll l l ll l l lll l l l l l l l lll l l l l lll l l lll 0.2 l l ll ll l ll ll l ll l ll l ll ll l ll l l l l l l ll l l lll l lllllll l l l ll l l l l l l l lll l l ll l l l l lll ll llll l l llll l l ll lll l l l l lll l l ll l llll lllll lll ll ll ll l ll lll l ll l l ll l l l ll l l l l ll l l l ll ll l l l ll l l l l l ll l l l ll ll l ll ll l l ll l ll l l l l l l l l lll l l l ll ll l l l lll ll ll l l l l l l l l l l llll l ll l l l llll l ll l ll l l l ll ll l ll l l l l l l l l l ll l lll l l l l ll l ll l l l l l l l l llll ll l ll l l l ll l lll l l l ll l ll l l l l l ll l ll l l l lll l l l l l l ll ll l llll l l l l l llll l lll 0.2 l l ll ll l lll ll l l l ll l lll llll ll l l l l l l ll ll l l lll l llllllll l l l ll l l l l l l l lll l l ll l l lllll l l l l l lll lll l l lllll l ll ll l l l l l l l lll l l ll l lll l lllllll l l l llll l l ll l ll lllll ll l l l l l l lll l l ll l l l lllll l l l lll l l ll lll l l l l ll l ll l ll l l l l ll l l l l l l l lllll l l lll l l l l l ll l l l ll l ll l l l ll ll l l l l lllll ll l ll lll l ll l l l l l l l l l l l ll l l l l l ll l l l l l ll l ll ll ll ll l l llllll l lll l l l l l l ll l ll l lll l l l lll l l l l l l l l l l lll l l l l l l ll ll ll l llll ll l l l l ll ll l l l ll l 0.2 l l ll ll l l l ll l ll l ll l ll lll l l l l l l l l l l ll ll l ll l ll lllll l l l l l ll l l l l l l l l ll l l ll l l l llll ll ll l ll lll l ll ll lll l l l l ll l l l l l l ll ll l l l ll lll lll l lll l l ll l ll l l lll l l ll l l l l lllll l l ll ll l ll ll llllll ll l l ll l ll l ll l l l ll l l l l l l l lll l l lll ll l ll ll l l l ll l llll l lll l ll l ll l l l l lll l l ll lll l ll l l l l l l l ll l l ll l l l l ll ll l ll l l l l l ll l l ll l ll l l l ll l l l ll l lll l l l ll l l l l l l l l l l ll llll l l l l l l ll l l l lll l l l ll l l llllll l l l l ll 0 0.2 P( 2< 1< 3) 0 0. 2 0. 4 0. 6 0. 8 1 l l l lllllll lll l llll ll l l ll l l l l l lll l llll ll l l llll ll l l ll l ll l l llll l l llll l l lll l ll ll l ll l l l ll lll lllll l llll l ll lll l l lll llll ll ll l l l ll llll l l l ll ll l l l l ll l l l l l l l l l l ll l lll ll llll l ll l l lll ll ll ll l l l llll l ll ll ll l lll l ll l l l lll ll l l lll l l l l l l llll l ll l l l l l l l l l ll l ll l ll l l l l ll l llll l l ll l l lll ll l l l llll l l l ll l ll l ll l l l ll ll llll ll l lll l l l l l l l lll ll l l l l llll l l ll ll l l l ll l l l l l l ll l l ll l l l lll l l l l l l l ll lll llll l ll l l l l ll ll l l lll 0.2 l l l ll llllll l l ll ll lll ll l l l l l l l ll l l lll ll llll l ll ll l l lll l ll l l lll l l lllll l lll l l l l l llllll l l ll l lll l l l ll l lll l l lll l l lll lll ll l l l l l lll l ll l l l l ll l l l l ll l l ll l l l l l ll l lll lll llll l lll l l ll ll ll ll ll l ll l l l ll l l l l l l lll ll l l l l l l l l ll lll l ll llllll l l l llll l l l lll l l l ll l l l l l lllllll l l l l lll l l l ll l l l l l l l lll ll l lll l l ll l lll ll l l l l lll l l ll ll l l llll l l l l l lll ll l l llll l l l lll l lll l l l l ll l l l l ll l ll ll lll 0.2 l l ll ll l ll ll ll l llll ll ll l l l l l lll l l llll lll lll l lll l ll l l lll l ll l l ll ll l l llll l lllll l ll l l l l lllll lllllll ll l l l l ll ll ll ll l l l ll l llll l l l l ll ll l l l l ll l l ll l l l l l ll l l ll l ll l l l l l ll l ll lll l l ll ll l lll l l l l ll ll ll l l ll l l l ll l ll ll lll l l llll l ll l l lll l l ll l l l l l l ll l l lllll l l l l l l l lll l l l l ll ll l l llll ll lll l ll l ll l l l ll l l l l l l l ll l l lll l l ll ll l l ll lll l l l l l l l llll ll l l l l l llll l ll l l l l l l l l l l l ll ll l l ll 0 0.2 P( 2< 3< 1) 0 0. 2 0. 4 0. 6 0. 8 1 l l ll lll l ll ll l l l llll lllllll ll ll l lll lll l l l llll ll lll l l ll l l ll ll l l l l l lll l l l llll l l l l l lll lll llll l l l lllll l l l l l l l l lll ll ll l l llll l lll lllll lll ll ll ll l l l ll l l l ll l l llllll l l lll l l l ll l ll l l l ll l l ll l l l l l l l llll l l lll l l l lll l ll l l l l lll l lll l l l l l l l l l l l l l l l ll l l lll l l ll l l l ll l l l ll l l l llllll ll ll l l ll lll l ll l l l l ll ll l ll l l l ll l l lll l ll l ll l lllll l l l l l ll l ll l l ll l l l l l l l lll l l ll l l l lllllll l l l l l l ll l l ll ll l l llll l l ll 0.2 l l lll ll l l ll ll l l l llll ll l l llllll ll lll lll ll l l l ll l l l lll l l ll l l lllll l ll l l l lllll lll l l l l l l llll ll llllllll l l l ll l l l ll llllll ll l lll l l l ll ll lll lllllllll l l ll ll l l l ll l l l lll ll llllll l llll l l lll l l l l l l l lll l ll l ll l l ll l lll l l l l lll ll l ll l ll l ll l l l ll l l l ll ll l l l l l l ll l l l l llll l l l l l l l l l l l l ll l ll lll l ll l l l l ll llll l ll l l ll l l l l l l lllll l l ll ll ll l l l lllll lll l l l l lll lll l l l l ll ll l ll l l l llll l l l ll l l l ll lll l l l llll l l ll 0 0.2 P( 3< 1< 2) 0 0. 2 0. 4 0. 6 0. 8 1 l l llll l l l l lllll l ll lll l l l ll l l l l ll ll llll lll l llll l l llll l l l l l lll l l lll l l ll lllllll l lll l l l ll l l l l lll l l llllll l l l l ll l l l l l llll l lll l l l l l l ll l l ll l lllll l l l ll ll lll ll l ll ll l ll l llllll lll l ll l l ll ll l ll l lll l ll l l l l ll l ll l l ll l l l l l l lll lll l llll l ll lll l l l l l l l l l l l l ll l l l l llll l l l l l l l l l lll l lll l l l ll l l l lllll l l l l ll l l l l ll l ll ll lllll l l ll l l l l l l ll ll ll l l l l l l ll l ll l l l l l l l l llll l l lllll l ll l l l l l l l l l l l l l l l l l l l l ll l lll l l l ll Figure 5: Scatterplot matrix for ep3. SNPs located in its w-neighborhood (common values for w are 0.5MB or 1MB, where MB stands for megabases). The genotype information of these neighboring SNPs are then used to explain the expression of the respective center gene. For the SNPs, there are three different genotypes, the wild-type allele coded as 1 or (AA), the heterozygeous mutation 2 (AB) and the homozygeous mutation 3 (BB). Common platforms measure genotypes for 700k up to several million SNPs. See Figure 8 for a sketch on how gene expression values and genotype information are linked. A popular but unrealistic method to consider SNP-wise associations between genotype and gene expression is to fit a linear model. We use the triple test with the PIs Pˆ123 and Pˆ321 (increasing or decreasing trend) to avoid strong assumptions of linear dependence. If p1 and Journal of Statistical Software 15 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Expected Ratio O bs er ve d Ra tio Figure 6: Rejection plot for α ∈ [0, 1]. 0.00 0.05 0.10 0.15 0.20 0. 00 0. 05 0. 10 0. 15 0. 20 Expected Ratio O bs er ve d Ra tio Figure 7: Rejection plot for α ∈ [0, 0.2]. Chromosomal position Ex pr es sio n va lu es o f d iff e re n t i nd ivi du al s l l l l l l SNP A Gene X SNP B SNP C l Probe Expr. A/A A/B B/B Figure 8: Sketch on how genotype information and gene expression are linked. p2 are the two p values from the corresponding directional tests, a two-sided p value for a SNP-wise test is obtained as min(2 · min(p1, p2), 1) and it is implemented in our other R package GeneticTools. The linear model approach is also an option there. In order to perform an eQTL analysis with GeneticTools the genotype data has to be present in ped/map file format, which is the standard output format of many programs. In addition to that a gene expression matrix is required. The basic call to perform an eQTL is then R> library("GeneticTools") R> setwd(file.path("Data", "Genotypes")) R> myEQTL <- eQTL(gex = geneEx, geno = "example", xAnnot = geneAnnotations, + windowSize = 0.5, method = "directional") We assume here, that the ped/map file pair is stored in the Folder file.paht("Data", "Genotypes") and is called example.ped and example.map. The file name is given to the eQTL function with the geno option. In case the SNP data has been imported already previously using the package snpStats (Clayton 2014) and its function read.pedfile() this object can be passed to the geno option instead of a file name. The gene expressions are stored in the matrix geneEx and each column refers to a gene and each row is an individual. 16 gMWT: Generalized Mann-Whitney Tests in R Gene1 − 1 Chr 1 : 8474334 − 9574334 Chromosomal Position in MB p− va lu e 8.5 8.72 8.94 9.16 9.38 9.6 0 0. 2 0. 4 0. 6 0. 8 1 M on o. N A l l l l l l l l ll l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l llll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Figure 9: eQTL-plot. The dashed line symbolizes the location of the linked gene and each dot represents the test result for a SNP test. The dotted line refers to the chosen significance level. In case that all individuals have the same genotype no test is performed and this is marked as “Mono.”. If no genotype information was available for a SNP, this is marked as NA. The expression matrix is specified with the gex option. The option xAnnot is important for the gene annotations and is a list where each list item refers to a gene from the columns of geneEx. The names have to match and the eQTL is only performed for matching pairs. In case that no annotation xAnnot is given to the function no window is used and all combinations are considered instead. Be aware that this might lead to a very long lasting calculation. Each list item in geneAnnotations is a matrix like R> geneAnnotations $Gene1 Chr Start End 1 1 8974334 9074334 $Gene2 Chr Start End 1 12 135633062 135738062 2 12 135735062 135838062 Journal of Statistical Software 17 This takes into account that certain probes have multiple locations in the genome and our method will test for all those locations. In case that the labeling of individuals between the expression data and the genotype data differs, there is an option to give a new vector of labels, called genoSamples. Here can new labels for the individuals in the genotype data be specified, that match with the row names of geneEx. It is important to note that the order of the rows and columns of all lists and matrices does not have to match. The function takes the smallest subsets of individuals and genes and takes then those SNPs, which are in a window around that gene. The window size can be specified with the option windowSize using the unit megabases (MB). After the eQTL is performed we can visualize the results with R> plot(myEQTL, which, file, sig) The which option specifies for which genes from geneEx the plots shall be created. If no option is given, then all plots are created. Because this might lead to a vast number of pictures, there is also an option file to specify a file name such that the plots are saved in a file with this name. The output of a single plot can be seen in Figure 9 with a chosen significance level sig = 0.1. This way we can see for interesting genes the behavior of the surrounding SNPs onto the gene expression. For small datasets it is also possible to check visually for interesting genes. For larger gene sets the function extractEQTL can be used to determine a set of interesting genes. We applied this function on real data e.g., in Siltanen et al. (2013). References Benjamini Y, Hochberg Y (1995). “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society B, 57(1), 289–300. Clayton D (2014). snpStats: SnpMatrix and XSnpMatrix Classes and Methods. R package version 1.16.0, URL http://www-gene.cimr.cam.ac.uk/clayton/. Eddelbuettel D, Franc¸ois R (2011). “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software, 40(8), 1–18. URL http://www.jstatsoft.org/v40/i08/. Eddelbuettel D, Sanderson C (2014). “RcppArmadillo: Accelerating R with High-Performance C++ Linear Algebra.” Computational Statistics & Data Analysis, 71, 1054–1063. Fischer D (2014). GeneticTools: Collection of Genetic Data Analysis Tools. R package version 0.3, URL http://CRAN.R-project.org/package=GeneticTools. Fischer D, Oja H (2015). gMWT: Generalized Mann-Whitney Type Tests. R package version 1.0, URL http://CRAN.R-project.org/package=gMWT. Fischer D, Oja H, Sen PK, Schleutker J, Wahlfors T (2014). “Generalized Mann-Whitney Type Tests for Microarray Experiments.” Scandinavian Journal of Statistics, 41(3), 672–692. 18 gMWT: Generalized Mann-Whitney Tests in R Hothorn T, Hornik K, van de Wiel MA, Zeileis A (2008). “Implementing a Class of Per- mutation Tests: The coin Package.” Journal of Statistical Software, 28(8), 1–23. URL http://www.jstatsoft.org/v28/i08/. Perlman MD (1969). “One-Sided Testing Problems in Multivariate Analysis.” The Annals of Mathematical Statistics, 40(2), 549–567. R Core Team (2014). R: A Language and Environment for Statistical Computing. R Founda- tion for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/. Serfling RJ (1980). Approximation Theorems of Mathematical Statistics. John Wiley & Sons. Seshan VE (2014). clinfun: Clinical Trial Design and Data Analysis Functions. R package version 1.0.6, URL http://CRAN.R-project.org/package=clinfun. Siltanen S, Fischer D, Rantapero T, Laitinen V, Mpindi JP, Kallioniemi O, Wahlfors T, Schleutker J (2013). “ARLTS1 and Prostate Cancer Risk – Analysis of Expression and Regulation.” PLoS ONE, 8(8), e72040. Simes RJ (1986). “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika, 73(3), 751–754. Thas O, De Neve J, Clement L, Ottoy JP (2012). “Probabilistic Index Models.” Journal of the Royal Statistical Society B, 74(4), 623–671. Westfall PH, Young SS (1993). Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. John Wiley & Sons. Journal of Statistical Software 19 A. Union-intersection test for P13 and P23 We implemented two ways to calculate p values from the UIT, the first one is based on a permutation approach and the other one is based on asymptotical results. In both cases we first need to calculate the critical value c of the test statistic Q∗. For the test statistics S1 = Pˆ13 and S2 = Pˆ23, S = (S1, S2) > ∼ N2(µ,Σ) approximately with Σ = ( 1 ρ ρ 1 ) . Our UIT test statistic Q∗ is then Q∗ = I0 · 0 + I1 · (S2 − ρS1) 2 1− ρ2 + I2 · (S1 − ρS2)2 1− ρ2 + I3 · S >Σ−1S, with I0 = I(S1 ≤ ρS2, S2 ≤ ρS1), I1 = I(S1 < 0, S2 > ρS1), I2 = I(S1 > ρS2, S2 < 0), I3 = I(S1 ≥ 0, S2 ≥ 0). As I0 + I1 + I2 + I3 = 1, at most one of the three test statistics in the sum contributes to Q ∗. An approximate p value is then given by the approximation (Perlman 1969) P(Q∗ > c) = 1 2 P(χ21 > c) + cos−1 ρ 2pi P(χ22 > c). For a permutation test version, we permute the elements of the vector of the group variable M times with resulting values of test statistics Q∗1, Q∗2, . . . , Q∗M ; the approximate p value is then p = 1 M M∑ m=1 I(Q∗m ≥ Q∗). Affiliation: Daniel Fischer School of Health Sciences University of Tampere 33014 Tampere, Finland E-mail: Daniel.Fischer@luke.fi Hannu Oja Department of Mathematics and Statistics University of Turku 20014 Turku, Finland Journal of Statistical Software http://www.jstatsoft.org/ published by the American Statistical Association http://www.amstat.org/ Volume 65, Issue 9 Submitted: 2012-05-30 June 2015 Accepted: 2014-08-06