JSS Journal of Statistical Software
June 2015, Volume 65, Issue 9. http://www.jstatsoft.org/
Mann-Whitney Type Tests for Microarray
Experiments: The R Package gMWT
Daniel Fischer
University of Tampere
Hannu Oja
University of Turku
Abstract
We present the R package gMWT which is designed for the comparison of several
treatments (or groups) for a large number of variables. The comparisons are made using
certain probabilistic indices (PI). The PIs computed here tell how often pairs or triples
of observations coming from different groups appear in a specific order of magnitude.
Classical two and several sample rank test statistics such as the Mann-Whitney-Wilcoxon,
Kruskal-Wallis, or Jonckheere-Terpstra test statistics are simple functions of these PI. Also
new test statistics for directional alternatives are provided. The package gMWT can be
used to calculate the variable-wise PI estimates, to illustrate their multivariate distribution
and mutual dependence with joint scatterplot matrices, and to construct several classical
and new rank tests based on the PIs. The aim of the paper is first to briefly explain the
theory that is necessary to understand the behavior of the estimated PIs and the rank
tests based on them. Second, the use of the package is described and illustrated with
simulated and real data examples. It is stressed that the package provides a new flexible
toolbox to analyze large gene or microRNA expression data sets, collected on microarrays
or by other high-throughput technologies. The testing procedures can be used in an eQTL
analysis, for example, as implemented in the package GeneticTools.
Keywords: eQTL, Jonckheere-Terpstra test, Kruskal-Wallis test, Mann-Whitney test, permu-
tation test, several samples, simultaneous testing, union-intersection test, U-statistic.
1. Introduction
We consider nonparametric tests used in the analysis of gene or microRNA expression data
sets with several treatments (groups). For each separate expression variable, the null hypoth-
esis to be tested is that there is no difference between the distributions of the expression in
different groups. To avoid strong (parametric) distributional assumptions, the alternatives
are formulated using probabilities that pairs or triples of observations coming from different
2 gMWT: Generalized Mann-Whitney Tests in R
groups are in a specific order of magnitude. The interesting probabilities are called proba-
bilistic indices (PI), see also Thas, De Neve, Clement, and Ottoy (2012). The test statistics
are based on natural estimates of these PIs, that is, the corresponding two and several sam-
ple U-statistics. Classical several-sample rank test statistics such as the Kruskal-Wallis or
Jonckheere-Terpstra test are special cases in this approach. Also, as the number of variables
(microRNAs) is typically huge and the test statistics for different variables are dependent, we
face a serious simultaneous testing problem. See Fischer, Oja, Sen, Schleutker, and Wahlfors
(2014) for more details.
The package gMWT (Fischer and Oja 2015) provides nonparametric tools for the comparison
of several groups/ treatments when the number of variables is large, and is available from
the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=
gMWT. The tools are the following.
(i) Computation of the PI estimates for the group comparisons. The probabilistic indices
here are (a) the probability Ptt′ that a random observation from group t is smaller than
a random observation from group t′, and (b) the probability Ptt′t′′ that observations
from groups t, t′, t′′ appear in this same order. The tools are also given to produce the
plots of variable-wise PIs.
(ii) Computation of the p values of some classical and some new nonparametric tests for
the comparison of several groups/treatments. The tests are based on the use of the
probabilistic indices Ptt′ and Ptt′t′′ . Classical Mann-Whitney-Wilcoxon, Kruskal-Wallis
and Jonckheere-Terpstra tests are included.
(iii) Tools for the simultaneous testing problem. As the package is meant for the analysis
of gene expression data for example, tools to control the family-wise error rate and/or
the false discovery rate are provided as plots for expected versus observed rejected null
hypotheses with the Simes (improved Bonferroni) and Benjamini-Hochberg rejection
lines. A list of rejected null hypotheses may be obtained as well.
Some standard nonparametric methods such as the Mann-Whitney and Kruskal-Wallis tests
have been implemented in the R stats (R Core Team 2014) package. Linear rank statistics
for the two and several sample location problems with ordered and unordered alternatives
have been implemented also in the coin (Hothorn, Hornik, van de Wiel, and Zeileis 2008)
package. Exact and permutation versions of the Jonckheere-Terpstra test are given by the
package clinfun (Seshan 2014); this function is used in our package as the second option
in our implementation. One contribution of our package gMWT is that these and several
other nonparametric tests are collected with the same syntax under the same roof with a
simultaneous testing possibility for several variables. The interfaces of the functions are
tailored for large datasets with many groups and several variables, so that the application
and comparisons of competing testing procedures are easier. With scatterplot matrices for the
relevant PIs, it is also possible to illustrate and understand the joint variable-wise behavior
of the standard tests.
The structure of this paper is as follows. After a brief review of the theory in Section 2
we present some practical solutions in Section 3 for the computation of the PIs and the
permutational p values of the corresponding tests. In Section 4 a general description of the
package gMWT is given with a typical workflow for its use. We also discuss the calculation of
Journal of Statistical Software 3
the PIs and their scatterplot matrices, and it is described how the tests are performed. Also,
the tools for the multiple testing problem are described. In Section 5, the use of the package
is illustrated with a simulated data set as well as with real genotype data. In the latter case,
an expression quantitative trait locus (eQTL) analysis is performed with the packages gMWT
and GeneticTools (Fischer 2014).
2. Statistical inference based on probabilistic indices
2.1. Null hypothesis and alternatives based on Ptt′ and Ptt′t′′
Consider first the univariate case and the comparison of T groups. Let xt1, . . . , xtNt be a
random sample from a distribution with cumulative distribution function Ft, t = 1, . . . , T ,
and let the samples be independent. The total sample size is then N = N1 + · · · + NT . We
wish to test the null hypothesis
H0 : F1 = F2 = · · · = FT .
The interesting alternatives are formulated using certain probabilistic indices. As ties may
often be present, we write
I(x, y) = I(x < y) +
1
2
I(x = y)
and
I(x, y, z) = I(x < y < z) +
1
2
I(x = y < z) +
1
2
I(x < y = z) +
1
6
I(x = y = z),
with I(·) being the indicator function, which is 1 if the argument (·) is true and 0 else. The
interesting alternatives are then given in terms of the probabilities
Ptt′ = E(I(xt, xt′)) and Ptt′t′′ = E(I(xt, xt′ , xt′′)).
Note that, as
I(x, y) = I(x, y, z) + I(x, z, y) + I(z, x, y)
the probabilities satisfy
Ptt′ = Ptt′t′′ + Ptt′′t′ + Pt′′tt′ .
Under the null hypothesis H0 : F1 = F2 = · · · = FT , for all t, t′, t′′,
Ptt′ =
1
2
and Ptt′t′′ =
1
6
.
We say that F1 and F2 are stochastically ordered and write F1 st F2 if F1(x) ≥ F2(x) ∀x ∈ R.
Then
Ft st Ft′ ⇒ Ptt′ ≥ 1
2
and
Ft st Ft′ st Ft′′ ⇒ Ptt′t′′ ≥ 1
6
but the converse statements are not true.
4 gMWT: Generalized Mann-Whitney Tests in R
In the comparison of T = 3 treatments interesting alternatives might then be formulated, for
example, as
H1 : P12 6= 12 or P13 6= 12 or P23 6= 12 ,
or
H1 : P12 ≥ 12 or P13 ≥ 12 or P23 ≥ 12 with at least one strict inequality,
or
H1 : P13 ≥ 12 or P23 ≥ 12 with at least one strict inequality,
or
H1 : P123 >
1
6
.
The tests will then be based on the estimates Pˆ12, Pˆ13, Pˆ23 and Pˆ123 and should be constructed
keeping the interesting alternative in mind.
2.2. Estimation of Ptt′ and Ptt′t′′
The probabilities Ptt′ and Ptt′t′′ are naturally estimated by corresponding U-statistics
Pˆtt′ =
1
NtNt′
Nt∑
i=1
Nt′∑
i′=1
I(xti, xt′i′)
and
Pˆtt′t′′ =
1
NtNt′Nt′′
Nt∑
i=1
Nt′∑
i′=1
Nt′′∑
i′′=1
I(xti, xt′i′ , xt′′i′′).
A natural statistic for the comparison between group t and other groups is
Pˆt =
1
N −Nt
∑
t′ 6=t
Nt′Pˆtt′ .
A general several-sample U-statistic theory can be used to find the (joint) limiting properties
of Pˆt, Pˆtt′ and Pˆtt′t′′ under the null hypothesis. See, e.g., Chapter 5 in Serfling (1980).
2.3. Tests based on estimates Pˆtt′ and Pˆtt′t′′
As seen before, we have a hierarchy{
Pˆtt′t′′
}
→
{
Pˆtt′
}
→
{
Pˆt
}
and one can construct test statistics at different levels of this hierarchy. Some choices are the
following.
1. Use test statistics Pˆtt′t′′ for H0 : Ft = Ft′ = Ft′′ vs. H1 : Ptt′t′′ 6= 16 . Of course one-sided
alternatives are possible as well.
Journal of Statistical Software 5
2. Use the Mann-Whitney (MW) test statistics Pˆtt′ for H0 : Ft = Ft′ vs. H1 : Ptt′ 6= 12 and
the Jonckheere-Terpstra (JT) test statistics for H0 : F1 = · · · = FT vs. H1 : Ptt′ ≥ 12
for all t < t′ with at least one strict inequality. Note that F1 st F2 st · · · st FT
with at least one strict inequality implies the latter H1. We have two versions of JT
test statistic, namely,
JT =
∑
t<t′
NtNt′Pˆtt′ and JT
∗ =
∑
t<t′
Pˆtt′ .
3. For a fixed group t, use a Mann-Whitney test statistic Pˆt for H0 : F1 = · · · = FT vs.
H1 : F1 = ... = Ft−1 = Ft+1 = ... = FT 6= Ft. Use the Kruskal-Wallis test statistic
KW =
12
N(N + 1)
T∑
t=1
(Pˆt −Nt(N −Nt)/2)2
Nt
for H0 : F1 = · · · = FT vs. H1 : Ft 6= Ft′ for at least one pair t, t′. The alternative then
implies that Ptt′ 6= 12 for at least one pair t, t′.
4. Use a union-intersection test (UIT) to compare three groups t, t′, and t′′. The test
statistic is a combination of statistics Pˆtt′′ and Pˆt′t′′ and is meant for the alternative
max(Ptt′′ , Pt′t′′) >
1
2 . The test statistic can be found in Appendix A, see also Fischer
et al. (2014) for the details.
3. Computational solutions
3.1. Fast computation of Pˆtt′ and Pˆtt′t′′
Consider the univariate case and write the N -vector
x = (x1, x2, . . . , xN )
> = (x11, . . . , x1N1 , x21, . . . , x2N2 , . . . , xT1, . . . , xTNT )
>
of observations coming from all T groups. The PIs are based on two N × N matrices Ist =
Ist(x) and Ieq = Ieq(x) with the elements
(Ist(x))ij = I(xi < xj) and (I
eq(x))ij = I(xi = xj),
i, j = 1, . . . , N . The matrices Ist = Ist(x) and Ieq = Ieq(x) can then be decomposed as
Ist =

Ist11 I
st
12 . . . I
st
1T
Ist21 I
st
22 . . . I
st
2T
. . . . . . . . . . . .
IstT1 I
st
T2 . . . I
st
TT
 and Ieq =

I
eq
11 I
eq
12 . . . I
eq
1T
I
eq
21 I
eq
22 . . . I
eq
2T
. . . . . . . . . . . .
I
eq
T1 I
eq
T2 . . . I
eq
TT
 ,
where the Ni ×Nj submatrices Istij and Ieqij compare treatments i and j, i, j = 1, . . . , T .
Then
Pˆ tt′ =
1
NtNt′
1>Nt
(
Isttt′ +
1
2
I
eq
tt′
)
1Nt′ ,
6 gMWT: Generalized Mann-Whitney Tests in R
0
10
0
20
0
30
0
40
0
50
0
Simulated sample size
Ti
m
e 
in
 s
ec
on
ds
30 150 300 450 600 750 900 1050
l
l
l
l
Naive,R
Submatrices,R
Submatrices,C++
Naive,C++
Figure 1: Computation times for the p values using the permutation test version with test
statistic Pˆ tt′t′′ . Three groups with equal group sizes were used, and the number of permu-
tations in each case was 2000. Naive and submatrix approaches implemented in R and C++
are compared.
where 1k is the notation for a k-vector full of ones. For the triples we get
Pˆ tt′t′′ =
1
NtNt′Nt′′
1>Nt
(
Isttt′I
st
t′t′′ +
1
2
I
eq
tt′I
st
t′t′′ +
1
2
Isttt′I
eq
t′t′′ +
1
6
I
eq
tt′I
eq
t′t′′
)
1Nt′′ .
In case that no ties are present, the matrix Ieq is simply a zero matrix.
Using a naive implementation, we would calculate the probabilities Pˆ tt′t′′ one by one while
going through all NtNt′Nt′′ triple comparisons. In our submatrix approach we thus calculate
the matrices Ist and Ieq only once and then use the submatrices to find the probabilities
Pˆ tt′t′′ . This leads to improved calculation times especially for permutation versions of the
tests. For the computation time comparisons in R and C++ (via Rcpp, Eddelbuettel and
Franc¸ois 2011, and RcppArmadillo, Eddelbuettel and Sanderson 2014), see Figure 1.
3.2. Computation of p values
If the null hypothesis H0 : F1 = · · · = FT is true, then Px ∼ x for all N × N permutation
matrices P. By Px ∼ x we mean that the distributions of Px and x are the same. Matrix P
is an N × N permutation matrix if it is obtained from an identity matrix IN by permuting
its rows and/or columns. The number of distinct permutation matrices is N !.
Note that our test statistics are functions of the matrices Ist and Ieq and that
Ist(Px) = PIst(x)P> and Ieq(Px) = PIeq(x)P>.
The exact p value from a permutation test using a test statistic Q = Q(Ist, Ieq) is then
P
{
Q(PIstP>,PIeqP>) ≥ Q(Ist, Ieq)
}
,
Journal of Statistical Software 7
where the probability is taken over N ! equally probable values of P. The p value can in
practice be estimated by
1
M
M∑
m=1
I
{
Q(PmI
stP>m,PmI
eqP>m) ≥ Q(Ist, Ieq)
}
,
where P1, . . . ,PM is a random sample from a uniform distribution over the set of N × N
permutation matrices. Naturally, the larger M , the better is the estimate of the exact p value.
Approximate p values may also be based on the limiting joint normality of the estimates
(U-statistics) Pˆt, Pˆtt′ , and Pˆtt′t′′ . If no ties are present, the test statistics based on the PIs
are strictly distribution-free with limiting variances and covariances that are easily found.
4. The package gMWT
4.1. General features
The R package gMWT can be used to calculate the variable-wise probabilistic indices Pˆt, Pˆtt′ ,
and Pˆtt′t′′ , to illustrate their joint distributions and dependence with scatterplot matrices,
and to perform various rank tests based on Pˆt, Pˆtt′ and Pˆtt′t′′ described in Section 2.3. See
Figure 2 for possible workflows.
A practical application for the testing procedures is an eQTL analysis for a combined anal-
ysis of microarray and genotype data. In Section 5.2 we illustrate the use of the packages
GeneticTools and gMWT with the directional triple test for testing for eQTL.
In the following, the input matrix X is a data matrix with observations as rows and variables
as columns. The vector g indicates the group membership; its length is then the number of
rows in X.
4.2. Computation of Pˆt, Pˆtt′ and Pˆtt′t′′
The estimated PIs, Pˆt, Pˆtt′ and Pˆtt′t′′ , are calculated using the command
estPI(X, g, type = "pair", goi, mc = 1, order = TRUE)
The options type = "single", "pair" or "triple" specify the PIs to be computed, that is,
Pˆt, Pˆtt′ or Pˆtt′t′′ . The vector goi (”groups of interest”) specifies the groups (values of g) to be
used in the comparisons.
The option order specifies whether the PIs should be calculated for all possible pairs and
triples (order = FALSE) or just for pairs and triples with increasing group labels. In the four
group case, for example, an estPI call with the (default) options (type = "pair", order =
TRUE) would calculate the estimated PIs Pˆ12, Pˆ13, Pˆ14, Pˆ23, Pˆ24, Pˆ34 and in case of using the
parameters (type = "triple", order = TRUE) the estimated PIs Pˆ123, Pˆ124, Pˆ134, Pˆ234.
For matrix valued X, the option mc can be used to execute the parallel calculation on mc-many
cores in order to speed up the calculation (available only for Linux systems).
The result of the function estPI is a list, containing a matrix probs with the PIs as rows and
variables as columns. The other list items are the used parameters.
8 gMWT: Generalized Mann-Whitney Tests in R
Data NxG
G Variables: g=1,...,G
N obs.
n=1,...,N
Groups
t=1,...T
single pairs triple
Calculate Probabilistic Index
Function: estPI
Apply testFunction: gmw
KW
MW: 1 vs. All
MW: pairs
UIT
JT
JT*
Triple
Plot the test resultsFunction: plot
plotProb
plotProb
Figure 2: Possible calculation workflows.
Journal of Statistical Software 9
4.3. Scatterplots for Pˆt, Pˆtt′ and Pˆtt′t′′
Based on the probabilities calculated via estPI the package creates diagnostic scatterplot
matrices for variable-wise PIs. The command is either
plotPI(X, g, col, zoom = FALSE, highlight = NULL, hlCol = "red")
if the diagnostic plots are produced directly from the data or
pe <- estPI(X, g, type = "pair", goi)
plot(pe, col, zoom = FALSE, highlight, hlCol)
if the probabilities are first calculated by estPI. The function plotPI naturally allows also
all options used in the estPI call. The type of the plots depends on the selected PIs; the
pairwise scatterplots are then for all possible pairs of PIs.
Additional plotting options are highlight, hlCol and zoom. Using the highlight option,
indicated variables are plotted with a color specified in hlCol. If the Boolean flag zoomed is
set, the plots are zoomed to the active area of the PIs. Without this flag, the bivariate plots
are in [0, 1]× [0, 1].
4.4. Tests based on Pˆt, Pˆtt′ and Pˆtt′t′′
The basic call for applying the testing procedure is
gmw(X, g, goi, test = "mw", type = "permutation", prob = "pair",
nper = 2000, alternative = "two.sided", mc = 1, output = "min",
keepPM = FALSE, mwAkw = FALSE)
where only input variables X and g are compulsory.
If X is a matrix then the chosen test is applied variable-wise to the data and the results are
reported as a matrix of p values. Specifying the option output = "full" leads to a more
detailed list output with the same length as the number of different alternatives are tested
and each list item contains then again a list with as many columns as there are in X. Each
entry in the encapsulated list is a test result of class ‘htest’.
The vector g gives the group numbers in a natural order. The groups used in the analysis
can be specified via goi. If no goi is specified, all groups are used.
With the option test, the test ("uit", "triple", "mw", "kw", "jt", "jt*") can be spec-
ified. The option prob may be used only in the case of the Mann-Whitney test. For all pair-
wise group comparisons one uses (prob = "pair") while option (prob = "single") compares
each group to the rest of the data. The option type specifies whether the permutation type
test ("permutation") or the asymptotical test version ("asymptotical") is used. For some
standard tests the procedures from base R are available, and the Jonckheere-Terpstra test is
implemented in the clinfun package. The option type = "external" allows the use of these
test versions. A permutation type of test is available for all tests but, as the package is still
under active development, the asymptotical versions are not yet available for all cases. The
number of permutations is selected with the option nper. Different alternatives are avail-
able whenever they are natural and are set with alternative = "smaller", "greater" and
"two.sided".
10 gMWT: Generalized Mann-Whitney Tests in R
If the Westfall & Young method, as proposed in Westfall and Young (1993), will be later
applied for multiple testing, the permutation matrices used for the p value calculation must
be stored for a later use. In that case the option keepPM = TRUE has to be set for the later
maxT correction.
For the option mwAkw = TRUE, pairwise Mann-Whitney tests are performed after the global
Kruskal-Wallis test for a more detailed analysis of the differences between the groups. Please
notice that, to keep the overall significance level, the second step is allowed only after the
rejection by the Kruskal-Wallis test.
Again, the additional option mc may be used to speed up the computation for a large number of
variables if one works on a multiple core computer with Linux operating system. The amount
of cores can be specified with the mc option. This option decreases the calculation time also
on normal desktop computers drastically, since modern desktop computers usually have more
than one core. However, using all available cores in the calculations can make the computer
for the calculation time unusable for other tasks. Hence, for longer calculations we recommend
to choose here mc = detectCores() - 1 if other tasks are performed simultaneously on the
same computer.
The full test output itself is a ‘htest’ R object and has the standard output showing the
used data and the grouping object:
R> library("gMWT")
R> set.seed(123456)
R> myData <- c(rnorm(50), rnorm(60), rnorm(40, 0.5, 1))
R> myGroups <- c(rep(1, 50), rep(2, 60), rep(3, 40))
R> gmw(myData, myGroups, test = "uit", type = "permutation",
+ alternative = "greater", output = "full")
$`H1: Max(P13,P23) > 0.5`
********* Union-Intersection Test *********
data: Data:X, Groups:g, Order: max(P13,P23)
obs.value = 2.8081, p-value = 0.077
alternative hypothesis: greater
The attribute obs.value contains the value of the test statistic and the p.value contains the
p value based on the selected test version. The minimal standard output looks like
R> gmw(myData, myGroups, test = "uit", type = "permutation",
+ alternative = "greater")
pValues
H1: Max(P13,P23) > 0.5 0.077
Journal of Statistical Software 11
4.5. Multiple testing problem
As the testing procedures are applied simultaneously for a large number of dependent vari-
ables, we often face a severe multiple testing problem. Several attempts can be found in the
literature to adjust the p values and/or the test levels so that the inference remains valid
or the number of wrong decisions is as small as possible. The standard Bonferroni method
and its improved version by Simes (1986) control the family-wise error rate (FWER). Ben-
jamini and Hochberg (1995) suggested to control the false discovery rate (FDR) rather than
the FWER. Their procedure, known as the Benjamini-Hochberg procedure, also controls the
FDR at the same level and leads to the same practical rejection rule as the improved Bon-
ferroni procedure. It can be shown that the procedures keep their control levels also under
mild dependencies between the marginal test statistics. See, e.g., Fischer et al. (2014). For
permutation type tests, a multiple testing procedure proposed by Westfall and Young controls
the FWER and takes the dependence structure of the p values into account, see Westfall and
Young (1993).
The package gMWT allows a visualization of the p values as a plot of expected versus observed
proportions of rejected null hypotheses (cumulative distribution of the observed p-values) by
rejectionPlot(testResult, rejLine = "bh", alpha = 0.1, crit = NULL,
xlim = c(0, 0.1))
The input testResult is a vector of variable-wise p values for a selected test statistic. It is
also possible to pass a matrix testResult to the function. In that case each row is handled as
an own test result and curves from different tests are shown in the same figure. The option col
defines the colors for these curves. A solid line is the expected line under the null hypothesis.
The FWER and FDR controlling lines are given by the option rejLine. Possible parameters
are then "bonferroni", "bh" and "simes". The control level can be set with the option
alpha. For the use of these rejection lines, see Fischer et al. (2014). For some examples,
see Figures 6 and 7 in Section 5.1. Finally, the function getSigTests extracts the variables
with rejected null hypotheses at a given FWER or FDR control level α. The function call
getSigTests(X, alpha = 0.05, method) then provides a vector for the positions of these
variables. Possible options for method are "plain", "bonferroni", "bh", "simes", "maxT",
"csR" and "csD". Please keep in mind that in order to use the option "maxT" the permutation
matrix of the test has to be available and that for this the option keepPM = TRUE has to be
set when calling gmw.
5. Examples
5.1. Simulated data
We illustrate the use of the gMWT package first with a simulated data set. We created a
dataset with 500 variables and 3 groups with sample sizes N1 = N2 = N3 = 50. The 150×500
data matrix was obtained as follows.
1. Let zij , i, j = 0,±1,±2, . . . be independently identically distributed (iid) from N(0, 1).
12 gMWT: Generalized Mann-Whitney Tests in R
2. Let
xij =
10∑
k=−10
ψkzi,j+k, i = 1, . . . , 150; j = 1, . . . , 500,
where
ψk = 11− |k|, k = 0,±1, . . . ,±10.
This step thus makes the columns of the matrix X = (xij) dependent while the rows
are iid.
3. The final data matrix is then
X +
 0 · 150(1/3) · 150
(2/3) · 150
 s>
where s is a random vector with 25 ones and 475 zeros. Thus the null hypothesis is not
true for 25 variables indicated by s.
For our illustration, we first calculate the variable-wise PIs, Pˆt, Pˆtt′ and Pˆtt′t′′ by
R> ep1 <- estPI(X, groups, type = "single")
R> ep2 <- estPI(X, groups)
R> ep3 <- estPI(X, groups, type = "triple", order = FALSE)
In the data set the 25 shifted variables are indicated by pickGenes. The scatterplot matrices
in Figures 3, 4 and 5 are then obtained and the shifted observations highlighted as follows.
As the same observations are used repeatedly for different PIs, the PIs are dependent as can
easily be seen in the figures.
R> plot(ep1, highlight = pickGenes, zoom = TRUE)
R> plot(ep2, highlight = pickGenes, zoom = TRUE)
R> plot(ep3, highlight = pickGenes, zoom = TRUE)
Next, we simultaneously perform several tests for all the 500 variables.
R> kw.results <- gmw(X, groups, test = "kw")
R> jt.results <- gmw(X, groups, test = "jt")
R> uit.results <- gmw(X, groups, test = "uit", alternative = "greater")
R> triple.results <- gmw(X, groups, test = "triple", alternative = "greater")
The results from the testing procedures are summarized in rejection plots, see Figures 6
and 7. In Figure 7 a Benjamini-Hochberg rejection line at the FDR control level α = 0.1 is
provided as well. The null hypotheses with p values smaller than the (highest) crossing point
of the straight rejection line and the curve are then rejected and at most 10% of the rejected
hypotheses are then in fact true.
The figures were created by
Journal of Statistical Software 13
l
l
llll
l
llll
ll
ll
l
l
l
ll
l lllll l l
ll
ll
ll
l
llllllllll
l
llllll
lll
l
l
l
l
l
llll l
l
l
l l
l
llll
lll
lllll l l l l
ll
lllllll
l
ll
l
l
l
l
llll
ll
l
l
l
l
ll
llll
l
l
l
l
ll
ll
ll
l l l l
l
l
l
l
l
l
llllll
l
l
l
lllll
l
lll
ll
l
l
l
llll
l
ll
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
ll
lll
l
l
ll
l
l
llll
l
l
lll
l
lll
l
l
l
l
l
l l
l
l
l
l
ll
l
l
lllll
ll
l
l
lll
lll
l
llll
ll l l l
l ll
l
l
ll
lll
l
ll
ll
ll
l
l
lll
l
lllll
lll
ll
ll ll ll
l
l l l
ll l llllllll
l
l
ll
ll
l
l
lllll
l
l
l
l
l
l
lll
l
ll
l
l
l
lll
ll
ll
l
l
l
l
l
l
lllll l l l
lll l
l
l
P(2)
P(
1)
0.
4
0.
6
0.4 0.6
l
l
llll
l
llll
ll
l
l
l
l
ll
llll
llll
ll
ll
ll
ll
llllllllll
l
lll lll
lll
l
l
l
l
l
lllll
l
l
ll
l
lll
lll
llll
llll
lll
lll lll l
l
ll
l
l
l
l
l
lll
ll
l
l
l
l
ll
l lll
l
l
l
l
lll
l
l
lll
llll
l
l
l
l
l
l
llllll
l
l
l
llll
l
lll
ll
l
l
l
llll
l
ll
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
ll
ll
l
l
ll
l
l
ll
l
l
llll
l
ll l
l
l
l
l
l
ll
l
l
l
l
ll
l
l
lllll
ll
l
l ll
l
llll
llll
ll
l
l
ll
lll
l
ll
ll
l
l
l
l
l
ll l l
lll
lll
l
l
llllll
l
lll
lllll l l ll
l
l
ll
ll
l
l
l
lllll
l
l
l
l
l
l
lll
l
ll
l
l
l
lll
ll
ll
l
l
l
l
ll
l
l lllll
ll
l
l
l
P(3)
0.4 0.6
P(
2)
0
0.
2
0.
4
0.
6
0.
8
1
ll
ll
lll
lll
l
lllll
l
l
llll
ll lll
l
lllllll
ll
ll
ll
l
llllllll l
l
llll l
l
ll
l
ll
l
ll
l
l
l
l
lllll
l
lll
ll
l
l
l
l
ll
lll
ll
l
l
llll
l l
l l
l
llll
l
l l l lll
l
llll
lll
ll
l
ll
lll
l
ll
l
l
l
l
l
l
l
l
l
llll
l
l
l
l
lll
lll l l
ll
ll
llll
l
l
ll
ll
lll
l
lll lll l
llll l
lllll
llll l lllll
l
ll
ll
llll
l
l
lll
l
l
l l
l
l
lll
l
llll lllll
ll
l
l l
lllll
l lllll
l
l
ll
lll
l
l
l
ll
l lll
ll
l l
ll
l
l
lll
l l
ll
llll ll
ll
ll
l l
l
lll
l
l
l
l
l
llll
lll
lll
l ll
ll l
l
l l
l
l ll
l
l ll
l
l
l
l
ll
ll
l
l
l
l
l
l
ll
l
l
l
l
l
ll
l
l l
Figure 3: Scatterplot matrix for ep1.
l
l
ll
llllll
l
l
ll
l
l
l
l
l
lll
lllll
l l ll
ll
l
l
ll
ll
lllllll
l
lllll
l
lll
l
l
l
l
l
lll
l
l
l
l
l
l
l
lll
l
ll
l
l
ll
l l
l
ll
lll
lll
lll
lllll
l
l
l
l
l
l
lll
l
l
l
l
l
ll
ll
ll
l
l l
l llll l
l l l
l
l
l
l
l
l
l
l
l
l
llll
l
l
l
l
l
lll
lll
l
lll
l
ll
ll
l
lll
l
l
ll
l
l
ll l
l
l
l
l
ll
l
l l
l
l
l
l
l
ll
ll
l
llllll
l
ll
l
llll
ll
lll
l
l
l
l
ll l
l
l
ll
l
lll
l
l
l
l
lll
l
l
llll
ll
l
l llll
l
l
l
ll
l
l
l
llll
ll
l
lll
llllllll
l l
ll
lll
llll
l
l
l
l
l
l ll
l
ll
ll
l
ll
l
l
l ll l
l l
l
lllll
l
l
l
l
ll
ll
l
l
ll
l
l
l
l
lll ll
l
l
l
l
l
l
l l
l
ll
ll
l l l
l
ll
l
l
l
P(1<3)
P(
1<
2)
0.
4
0.
6
0.4 0.6 0.8
l
l
ll
llllll
l
l l
ll
l
l
l
l
l
lll
llll
l ll
ll
l
l
ll
ll
lllllll
l
lllll
l
llll
l
l
l
l
llll
l
l
l
l
l
l
l
llll
l
ll
ll
l
l
l l
l l
l
ll
ll
lll
ll
l lll
l
l
l
l
l
l
l
ll
l
l
l
l
ll
ll
ll
l
l l
ll ll l
l l l l
l
l
l
l
l
l
l
l
l
llll
l
l
l
l
l
lll
lll
l
lll
l
ll
ll
l
lll
l
l
l
l
l
lll
l
l
l
l
ll
l
l l
l
l
l
l
l
ll
lll
l
llllll
l
l
l
llll
l
l
l
l
l
l
ll
l
l
l
l
lll
l
l
l
lll
lll
l
lllll
ll l
l l l ll
l
l
l
ll
l
ll
l
lll l
l
l
l
l
ll
ll
ll
ll
l
l
ll
l
ll
l
l l
l ll
l
lll
l
l
l
l
llll
l
l
lll l
l
l
lllll
l
l
l
ll
ll
l
l
ll
l
l
l
l
l lllll
l
l
l
l
l lll
l
l
ll
l l l
l lll
l
l
l
P(2<3)
0.4 0.6
P(
1<
3)
0
0.
2
0.
4
0.
6
0.
8
1
l
l
lllll
llll
l
l
ll
l
ll
lll
lll
ll l
l
ll
ll
llllll
llll
llllllllll
l
ll
l
l
lllll lllllllllllll
lll l
l l
l
lll
llll
ll
l
l
l
l
lllll
l
l
l
l
l
l
l
l
lll
l l
l
l
ll
l
l
l
ll
l
l ll l llll
l
llllllll
ll
lll
l
l
lllll
llll
l
l l
l
l
ll
ll
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
lll
l
l
l
l llll
lll
lll
l
l
l
l
l
l
l l
l
l
l
l l
l
lllll
l
ll
l
l
l
l
l
l
l
llll
l l l
l
l
l
l
ll
lllll
l
ll
l l
l
ll
l
l
llll
ll
ll
l
l
l
ll ll
ll
l
l
ll
lll
l
l
l
ll
l
l
l
l
l
ll
llll
l
l
l
l
l
ll
lll
l
ll
ll
lll l
l
l
ll
ll
l
l l l
l
l
ll
lll
ll
l
l ll
lll
ll
ll
Figure 4: Scatterplot matrix for ep2.
R> tests <- rbind(kw.results$p.values, jt.results$p.values,
+ uit.results$p.values, triple.results$p.values)
R> rejectionPlot(tests, lCol = c("green", "red", "blue", "cyan"))
R> rejectionPlot(tests, lCol = c("green", "red", "blue", "cyan"),
+ xlim = c(0, 0.2), rejLine = "bh", alpha = 0.1)
The variables with rejected null hypotheses by a Kruskal-Wallis test at the FDR control level
0.10, for example, are obtained by
R> getSigTests(kw.results, method = "bh", alpha = 0.1)
For highlighting the results from one selected test, one can for example use the command
R> plot(ep3, highlight = which(triple.results < 0.01)))
This illustration shows that computing and looking at the scatterplots of the PIs may be a
useful tool to understand the dependencies between marginal PIs and the corresponding test
statistics as well as to detect hidden structures in the data. The rejection plot is a useful tool
in the analysis of large datasets and in the multiple testing problem.
5.2. Application example: gMWT and eQTL analysis
The testing procedures implemented in the package gMWT can be used in an eQTL analysis.
In the following, we combine the gene expression data with the genotype data in order to
find important single nucleotide polymorphisms (SNPs) that are associated or influential to
the expression level of certain genes. As a first step, using the gene expression data, we
identify the genes which are differentially expressed between cases and controls. Interesting
genes are for example identified with the Wilcoxon tests (two groups) or with a Kruskal-
Wallis test (several groups). After the identification of an interesting gene, we consider all
14 gMWT: Generalized Mann-Whitney Tests in R
l
l
lllllllll
l
ll
l
ll
l l
llllllll
ll ll l
l
l llllllll
l
lll
ll
l
l
l
l
l l
l
l
l
ll
lllllll
l
l
l
ll
ll
lll
ll
lllll
l
l
l
l
l
l
l l l l
l
ll
l
l
ll
l
lllll
l
ll
l
l
l
l
l
l
l
llll
ll
ll
l
l
l
l
l
l
ll
llllll
l
l
l
l
l llll l l
ll
l
lll
l
ll l
l
l
l
l
l
l
l
ll
l ll
l
l
ll
l
lll
l
l
l
l
l
ll
l
ll
l
llll
l
l
l
lll
ll
l
ll
l
l
l
l
ll
l
l
l
l
l
l
l
ll
ll
l
l
llll
lll
l
ll l ll
l
l
lll
ll
l
l l l
l
l
l
lll
lll
l
lll
lll l
l
ll
l llll
l ll ll
lll
l
llll
l
l
lll
l
llll l
l
l
l
lll
l
l
l ll
l
l
l
l
l
l
l
l
l
l
l
l
l l
l l
l
lll
llll
l
l
ll
l
l ll
ll
l
l
P(1<3<2)
P(
1<
2<
3)
0.
2
0.
4
0.2
l
l
ll lllllll
ll
l
ll
ll
lllllll l
l l l llll
l
llllllll
l
lllll
ll
l
l
l
ll
l
l
l
lll
l lll
l
lll
ll
ll
lll
ll
l
l ll
ll
l
l
l
llll
l
ll
l
l
ll
l
l
lll
lll
ll
llll
l
l
l
l
l
l
l
ll l
l l
lll
l
l
l
l
l l
ll l l ll
l
l
l
l
llllllll
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
lll
l
ll
l
l
l
l
l
ll
l
ll
l
llll
ll
l
l
l l
l
ll
llll
l
l
l
lll
l
l l
l
l
l
l
l
l
l
ll
ll
l
ll
llll
l
llll
l l
l
l l
ll
ll
l
l
l
ll
ll
l
ll
l
l
lll ll lll
lll
l
l l l
l l l
l
lllll
l
l
l
l
ll
l
l
l ll
l
l
l
l
ll
l
l
l
l
llllll
l
llll
lll
l l l
llll
l
lll l
lll
l l l lll
l
l
l
l
P(2<1<3)
0.2
l
l
lllllll
l
ll
l
ll
l
l
l lllll
llllll
ll
lllllllll
l
llll
ll
l
l
l
ll
l
l
lll
lllll l
lll
ll
ll
llll l
ll
lll
ll
l
l
l
l
lll
l
ll
l
l
ll
l
l l
lll
l lll
l
l
l
l
l
lllll
ll
lll
l
l
ll
l
l
l
l
ll
ll l l l
l
l
l
l llll
ll
ll
lll
l
lll
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l
l
lll
l
ll
l
l
l
l
ll
l
ll
l
ll
l
llll
lll
l
l
l
l
lllll
l
lll
ll lll
l
l
l
l
ll
l
l
l
l
l
l
ll
ll
l
l
l
llll
lll
l
llll
lllll
l
ll
l
l
l
l
ll
l
l
lll
ll
llll
lll
ll
l l l
l
l
lll
ll
l
ll
l
l
l
lll
ll
l
l
l
l
l
l
l
l
l
lllll
l
ll ll
ll
ll
ll
ll l
l
l
l
lll
lllll lll ll l l
l
l
l
l
l
P(2<3<1)
0.2
l
l
llllllllll
l
l
l
ll
l
llllllll
llll l
l
ll l lll lll
l
llll
ll
l
l
l
l
ll
l
l
l
lll
lllll
l
lll
l l l
l l
llll
ll
lllll
ll
l
l
l
l
lll
l
ll
l
l
ll
l
llll
ll
llll l
l
l
l
l
l
l
l ll l
ll
ll
l
l
ll
l
l
l
l
llllll
l
l
l
l
ll l ll
ll
ll
l
l
lll l
l
l
l
l
l
l
l
l
l
l
l l
ll
l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
ll
l ll
l
ll
ll
lll
l
l
l
ll
l
l
l ll
l
lll
l
l
l
l
l
l
ll
l
l l
llll
l l l
l
llll
llll
l
l
l
l
l
ll
l
l
l
l
lll
l
l l ll
lll
l ll
lll
l
l
lll
l
ll l l
l
l
l
l
ll
l
l
ll
l
l
l
l
ll
l
l
l
l
l
lllll
ll
ll
lll
llll
l
lll
ll
l lllllll
ll
l
l
l l
P(3<1<2)
0.2
l
l
lll ll lll
l
ll
l
ll
ll
llll l l l l
llllll
ll
ll
lllll ll
l
ll l ll
ll
l
l
l
l
l
l
l
l l
lllll
l
lll
lll
ll
ll l
ll
lll
l
lll
l
l
l
lll
l
ll
l
l
l
l
ll
lll
l
l l l
l
l
l
l
l
l
lllll
ll
lll
l
l
ll
l
l
lllll
l
l
l
llll
ll
ll
l
lll
l l
l
lll
l
l
l
l
l
l
l
l
l
l
ll
l
ll
ll
l
l l
l
lll
l
l
l
l
l
ll
l
l l l
ll
l
l
lll
lll
l
l
l
l
l
l
l
ll
ll
l
ll
l
l
lll
l
l l
l
lll
l
l
ll
l
llll
l
l
l
l
l
l l llll
llll
l
lllll
l
lll
ll
l
l l
l
l
ll
l
l
l l
l
l
l
l
ll
l
l
l
l
lll
l
l
l
l
l
l
l
lll
l
l
l
l
l
ll
lllll
llll
l
l
l
l
P(3<2<1)
0 0.2
P(
1<
3<
2)
0
0.
2
0.
4
0.
6
0.
8
1
l
l
ll
ll
l
lll
ll
l
ll
l
ll
l ll
ll
l
l
l
l
l
l
l
l
ll
ll
ll
lll
l
lllll
l
l
l
ll
l
l
l
l
l
l
lll
l
l
ll
l
l
lllll
l
ll
ll
ll
l
l
l lll
l
l
ll
lll
l
l
l
l
llll
l
l
l
ll
l
l
llll
llllllllll
ll
l
ll
ll
l
l
l
l
l l
l
l
ll l
l
ll
ll
l
l
l
l
lllll
l
lll
l
l
ll
lllll lll
l
l
l
ll
l
l
l
ll
ll
l
ll
l
l
llll
l
l
l
ll
l
l
ll
ll
l
l
ll
ll
l
lll
ll
l
llll
l
l l
l
ll
lll lll
ll
l
l
l
lll
ll
l
l l
l
l ll
l
l
l
l
l
l
l
l
l
l
ll
l
lll
l
l
l
l
l
ll
l
lll
l l
ll
l
l
l
l
l
l ll l
ll
l
ll
l
l
ll
ll
ll
l
ll
l
l l
l
l
l
llll
lll
l
l
l
l
l
l
l
lll
l
ll
l
l
ll
l
l lll
l
l
l
l
l
l
l
lll l
l l
l
lll
l
l
lll
0.2
l
l
ll
ll
l
ll
ll
l
ll
l
ll
l
ll
ll l
ll
l
l
l
l
l
l
ll
l
l
lll
l
lllllll
l
l
l
ll
l
l
l
l
l
l
l
lll
l
l
ll
l
l
l l lll
ll
llll
l l
llll
l
l
ll
lll
l
l
l
l
lll
l
l
ll
l
llll
lllll lll
ll
ll
ll
l
ll
lll
l
ll
l
l
ll
l
l
l ll
l
l l l
ll
l
l
l
ll
ll
l
l
l
ll
l
l
l
l l ll l l l ll
ll
l
ll
ll
l
l
ll
l
ll
l
l
l
l
l
l
l
l
lll
l
l l
ll
ll
l
l
l
lll
ll
ll l
l
l l
l
l
l
l l l
llll
l
ll
l
l
l
llll
l
ll
l
ll
l l
l
ll
ll
l
ll
l
l
l
l
l
l
l
l
l
ll
l
lll
l
l
l
l
ll
l
ll
l
l
l
l
l
l
l
l
llll
ll
l
ll
l
l
l
ll
l
lll
l
l
l
ll
l
ll
l
l
l
l
l
ll
l
ll
l
l
l
lll
l
l
l
l l l
ll
ll
l
llll l
l
l
l l llll
l
lll
0.2
l
l
ll
ll
l
lll
ll
l
l
l
ll
l
lll
llll
ll
l
l
l
l
l
l
ll
ll
l l
lll
l
llllllll
l
l
l
ll
l
l
l
l
l
l
l
lll
l
l
ll
l
l
lllll
l
l l
l l
lll
lll
l
l
lllll
l
ll
ll
l l l
l
l
l
l
lll
l
l
ll
l
lll
l
lllllll l l l llll l l
ll l
ll
lllll
ll
l
l
l
l
l
l
lll
l
l
ll
l
l
l
lllll
l
l
l
lll
l
l
ll
lll
l
l l
l
ll
l
ll
l
ll
l
l
l
l
ll
l
l
l l
l
l
l
lllll
l
l
lll
l
l l
l
l ll
l
l
l
ll
l
ll
l l l ll
ll l
l l
l
lllll
ll
l
ll
lll
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
ll
l l
l
l
l ll
l
ll
ll
ll
ll
l
l
llllll
l
lll
l
l l
l
l
l ll
l
ll
l
lll
l
l
l
lll
l
l
l
l
l
l
l
l l
l
lll
l
l
l
l l
l
ll
ll
ll
l
llll
ll l l
l
l
ll
ll
l
l
l
ll l
0.2
l
l
ll
ll
l
l l
ll
l
ll
l
ll
l
ll
lll
l l
l
l
l
l
l
l
l
l
ll
ll
l
ll
l
ll lllll l l
l
l
l
ll
l
l
l
l
l
l
l
l ll
l
l
ll
l
l
l llll
ll
ll l
ll
lll
l
ll
ll
lll
l
l
l
l
ll
l
l
l
l
l
l
ll
ll l l l ll lll
lll
l
lll
l
l
ll
l
ll
l
l
lll
l
l
ll
l
l
l
l
lllll
l
l
ll
ll
l
ll
ll llllll
ll
l
l
ll
l
ll
l
ll
l
l
l
ll
l
l
l
l l
l
l
lll
l
l
lll ll
l
ll
ll
l
l l
ll
l
llll
l
lll
l
ll
l
ll
l
l
l
l
lll l
l
ll
lll
l
ll
l
l
l
l
l
l
l
ll
l
l ll
l
l
l
l
ll
ll
l
ll
l
l
l
l
l
ll
l
l
ll
l
ll
l
l
l
ll
l
l
l
ll
l
lll
l
l
l
ll
l
l l l
l
l
l
l
l
l
ll
llll
l
l
l
l
l
l
ll
l
l
l
lll l l
l
ll
l
l
llllll
l
l
l l ll
0 0.2
P(
2<
1<
3)
0
0.
2
0.
4
0.
6
0.
8
1
l
l
l
lllllll
lll
l
llll
ll
l
l
ll
l
l
l
l
l
lll
l
llll
ll
l
l
llll
ll
l
l
ll
l
ll
l
l
llll
l
l
llll l
l
lll
l
ll
ll l ll
l
l
l
ll
lll
lllll
l
llll
l
ll
lll
l l
lll
llll
ll
ll
l
l
l
ll
llll
l
l
l
ll
ll
l
l
l
l
ll
l
l
l l
l
l
l
l l
l
ll
l
lll
ll
llll
l
ll
l
l
lll
ll
ll
ll l
l l llll
l
ll
ll
ll
l
lll
l
ll
l
l
l
lll
ll
l
l
lll
l
l
l l
l l
llll
l
ll
l
l l l l
l l l l
ll
l
ll
l
ll
l
l
l
l ll
l
llll
l
l
ll
l
l
lll
ll
l
l
l
llll
l
l
l
ll
l
ll l ll
l
l l ll
ll
llll
ll
l
lll
l
l
l
l
l
l
l
lll
ll
l
l
l
l
llll
l
l
ll
ll
l
l
l
ll
l
l
l
l
l
l
ll
l
l
ll l
l
l
lll
l
l l
l
l
l
l
ll
lll
llll
l
ll
l
l
l l
ll
ll l
l lll
0.2
l
l
l
ll llllll
l
l
ll ll
lll
ll
l
l
l
l
l
l
l ll
l
l
lll ll
llll
l
ll
ll
l
l
lll
l
ll
l
l
lll
l
l
lllll
l
lll
l l
l l l
llllll
l
l
ll
l
lll
l l l ll
l
lll
l
l
lll
l
l
lll
lll
ll
l l
l
l
l
lll
l ll
l
l
l
l
ll
l
l
l
l
ll
l
l
ll
l
l
l
l
l
ll
l
lll
lll
llll
l
lll
l
l
ll
ll
ll
ll
ll
l
ll
l l
l ll
l
l
l
l
l
l
lll
ll
l
l
l
l
l
l
l
l
ll
lll
l ll
llllll
l
l
l
llll
l
l
l
lll l
l
l ll
l
l l l
l
lllllll
l
l
l
l
lll
l
l
l
ll
l
l
l l
l l l
lll
ll
l lll
l
l
ll
l
lll
ll
l
l l
l
lll
l
l
ll
ll
l
l
llll
l
l
l
l
l lll
ll
l
l
llll
l
l
l lll
l
lll
l
l
l
l
ll
l
l
l
l
ll
l
ll
ll
lll
0.2
l
l
ll
ll l ll ll
ll
l
llll
ll
ll
l l
l
l
l
lll
l
l
llll
lll
lll l
lll l
ll
l
l
lll
l
ll
l
l
ll ll
l
l
llll
l
lllll
l ll
l
l
l
l
lllll
lllllll
ll
l
l
l
l
ll
ll
ll
ll
l
l
l
ll l
llll
l
l
l
l
ll
ll
l
l
l
l
ll
l
l
ll
l
l
l
l
l
ll
l
l
ll
l
ll
l
l
l l
l
ll
l ll
lll
l
l
ll
ll
l
lll
l
l
l
l
ll
ll
ll l
l
ll
l
l
l
ll l
ll ll
lll
l
l llll
l
ll
l
l
lll
l
l
ll
l
l
l
l l l
ll
l
l lllll
l
l
l
l
l
l
l
lll
l
l
l
l
ll ll
l
l llll
ll
lll
l
ll l
ll l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
lll
l
l
ll
ll
l
l
ll
lll l l
l
l
l
l
l
llll
ll
l
l
l
l
l
llll
l
ll
l
l
l l
l
l
l
l
l
l
l
ll
ll l
l ll
0 0.2
P(
2<
3<
1)
0
0.
2
0.
4
0.
6
0.
8
1
l
l
ll lll
l
ll
ll
l
l
l
llll
lllllll ll
ll l
lll
lll
l
l
l
llll
ll
lll
l
l
ll
l
l
ll ll
l
l
l
l
l
lll l l
l llll
l
l
l
l
l
lll
lll
llll l l l lllll
l
l
l
l l
l
l
l
lll
ll
ll l l
llll l
lll
lllll
lll
ll
ll
ll
l
l
l
ll
l
l
l
ll l
l
llllll
l
l
lll
l
l
l
ll
l
ll
l
l
l
ll l
l
ll
l
l l
l l
l
l
llll
l
l
lll
l
l l lll l
ll
l
l
l
l
lll
l
lll
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
ll
l
l
lll
l
l
ll l
l l
ll
l
l
l
ll l l
l
llllll
ll
ll
l
l
ll
lll
l
ll
l
l
l
l
ll
ll
l ll l l
l
ll
l l lll
l
ll
l
ll
l
lllll
l l l
l l ll
l
ll
l
l
ll
l l
l
l
l
l
l
lll
l
l
ll
l
l
l
lllllll
l
l
l
l
l
l
ll
l
l
ll
ll
l
l
llll
l
l ll
0.2
l
l
lll ll l
l
ll
ll
l
l
l
llll
ll l l llllll
ll
lll
lll
ll
l
l
l
ll l l
l
lll
l
l
ll
l
l
lllll
l
ll
l
l
l
lllll
lll l
l
l
l
l
l
llll
ll
llllllll
l
l
l
ll
l
l
l
ll
llllll
ll
l lll l
l l ll ll
lll
lllllllll
l
l
ll
ll
l
l
l
ll
l
l
l
lll
ll
llllll
l
llll
l
l lll
l
l
l l
l
l
l
lll
l
ll
l
ll
l l
ll
l
lll
l
l l
l
lll ll
l
ll
l
ll
l ll
l l
l
ll
l
l
l
ll
ll
l
l
l
l
l
l
ll
l
l
l
l
llll
l
l
l l l
l
l
l
l l
l l ll
l
ll lll
l
ll
l
l
l
l
ll
llll
l
ll
l
l ll
l
l
l
l
l l
lllll
l
l
ll
ll
ll l l
l
lllll
lll
l
l l
l
lll
lll
l
l
l
l
ll ll
l
ll
l
l
l
llll
l
l
l
ll
l
l
l
ll
lll
l
l
l
llll
l
l
ll
0 0.2
P(
3<
1<
2)
0
0.
2
0.
4
0.
6
0.
8
1
l
l
llll l
l l
l lllll
l
ll
lll
l
l
l
ll l l l
l
ll
ll
llll
lll
l
llll
l
l llll
l l l
l
l
lll
l
l
lll
l
l
ll
lllllll
l
lll
l
l
l
ll
l
l
l
l
lll
l
l
llllll
l
l
l
l
ll
l
l
l
l l
llll
l
lll
l
l
l
l
l l
ll l
l
ll
l
lllll
l
l
l
ll
ll
lll
ll
l
ll
ll
l
ll
l
llllll
lll
l
ll
l l ll
ll
l
ll
l
lll
l
ll
l
l
l
l
ll
l
ll
l
l
ll
l
l
l
l l
l
lll
lll
l
llll
l
ll
lll
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
llll
l
l
l l
l
l
l
l
l
lll
l
lll
l
l
l
ll
l
l
l
lllll
l
l
l
l
ll l
l l
l
ll
l
ll
ll
lllll
l
l
ll
l
l
l
l
l
l
ll
ll
ll
l
l
l
l
l
l
ll
l
ll
l
l
l
l
l l
l l
llll
l
l
lllll
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
lll l l l
ll
Figure 5: Scatterplot matrix for ep3.
SNPs located in its w-neighborhood (common values for w are 0.5MB or 1MB, where MB
stands for megabases). The genotype information of these neighboring SNPs are then used to
explain the expression of the respective center gene. For the SNPs, there are three different
genotypes, the wild-type allele coded as 1 or (AA), the heterozygeous mutation 2 (AB) and
the homozygeous mutation 3 (BB). Common platforms measure genotypes for 700k up to
several million SNPs. See Figure 8 for a sketch on how gene expression values and genotype
information are linked.
A popular but unrealistic method to consider SNP-wise associations between genotype and
gene expression is to fit a linear model. We use the triple test with the PIs Pˆ123 and Pˆ321
(increasing or decreasing trend) to avoid strong assumptions of linear dependence. If p1 and
Journal of Statistical Software 15
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Expected Ratio
O
bs
er
ve
d 
Ra
tio
Figure 6: Rejection plot for α ∈ [0, 1].
0.00 0.05 0.10 0.15 0.20
0.
00
0.
05
0.
10
0.
15
0.
20
Expected Ratio
O
bs
er
ve
d 
Ra
tio
Figure 7: Rejection plot for α ∈ [0, 0.2].
Chromosomal position
Ex
pr
es
sio
n 
va
lu
es
 o
f d
iff
e
re
n
t i
nd
ivi
du
al
s
l
l
l
l
l
l
SNP A Gene X SNP B SNP C
l Probe Expr.
A/A
A/B
B/B
Figure 8: Sketch on how genotype information and gene expression are linked.
p2 are the two p values from the corresponding directional tests, a two-sided p value for a
SNP-wise test is obtained as min(2 · min(p1, p2), 1) and it is implemented in our other R
package GeneticTools. The linear model approach is also an option there.
In order to perform an eQTL analysis with GeneticTools the genotype data has to be present
in ped/map file format, which is the standard output format of many programs. In addition
to that a gene expression matrix is required. The basic call to perform an eQTL is then
R> library("GeneticTools")
R> setwd(file.path("Data", "Genotypes"))
R> myEQTL <- eQTL(gex = geneEx, geno = "example", xAnnot = geneAnnotations,
+ windowSize = 0.5, method = "directional")
We assume here, that the ped/map file pair is stored in the Folder file.paht("Data",
"Genotypes") and is called example.ped and example.map. The file name is given to the
eQTL function with the geno option. In case the SNP data has been imported already
previously using the package snpStats (Clayton 2014) and its function read.pedfile() this
object can be passed to the geno option instead of a file name. The gene expressions are
stored in the matrix geneEx and each column refers to a gene and each row is an individual.
16 gMWT: Generalized Mann-Whitney Tests in R
Gene1 − 1
Chr 1 : 8474334 − 9574334
Chromosomal Position in MB
p−
va
lu
e
8.5 8.72 8.94 9.16 9.38 9.6
0
0.
2
0.
4
0.
6
0.
8
1
M
on
o.
N
A
l
l
l
l
l
l
l
l
ll
l l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
llll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
Figure 9: eQTL-plot. The dashed line symbolizes the location of the linked gene and each
dot represents the test result for a SNP test. The dotted line refers to the chosen significance
level. In case that all individuals have the same genotype no test is performed and this is
marked as “Mono.”. If no genotype information was available for a SNP, this is marked as
NA.
The expression matrix is specified with the gex option.
The option xAnnot is important for the gene annotations and is a list where each list item
refers to a gene from the columns of geneEx. The names have to match and the eQTL is only
performed for matching pairs. In case that no annotation xAnnot is given to the function no
window is used and all combinations are considered instead. Be aware that this might lead
to a very long lasting calculation. Each list item in geneAnnotations is a matrix like
R> geneAnnotations
$Gene1
Chr Start End
1 1 8974334 9074334
$Gene2
Chr Start End
1 12 135633062 135738062
2 12 135735062 135838062
Journal of Statistical Software 17
This takes into account that certain probes have multiple locations in the genome and our
method will test for all those locations. In case that the labeling of individuals between the
expression data and the genotype data differs, there is an option to give a new vector of
labels, called genoSamples. Here can new labels for the individuals in the genotype data be
specified, that match with the row names of geneEx.
It is important to note that the order of the rows and columns of all lists and matrices does
not have to match. The function takes the smallest subsets of individuals and genes and takes
then those SNPs, which are in a window around that gene. The window size can be specified
with the option windowSize using the unit megabases (MB).
After the eQTL is performed we can visualize the results with
R> plot(myEQTL, which, file, sig)
The which option specifies for which genes from geneEx the plots shall be created. If no option
is given, then all plots are created. Because this might lead to a vast number of pictures,
there is also an option file to specify a file name such that the plots are saved in a file with
this name. The output of a single plot can be seen in Figure 9 with a chosen significance level
sig = 0.1.
This way we can see for interesting genes the behavior of the surrounding SNPs onto the gene
expression. For small datasets it is also possible to check visually for interesting genes. For
larger gene sets the function extractEQTL can be used to determine a set of interesting genes.
We applied this function on real data e.g., in Siltanen et al. (2013).
References
Benjamini Y, Hochberg Y (1995). “Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society B, 57(1),
289–300.
Clayton D (2014). snpStats: SnpMatrix and XSnpMatrix Classes and Methods. R package
version 1.16.0, URL http://www-gene.cimr.cam.ac.uk/clayton/.
Eddelbuettel D, Franc¸ois R (2011). “Rcpp: Seamless R and C++ Integration.” Journal of
Statistical Software, 40(8), 1–18. URL http://www.jstatsoft.org/v40/i08/.
Eddelbuettel D, Sanderson C (2014). “RcppArmadillo: Accelerating R with High-Performance
C++ Linear Algebra.” Computational Statistics & Data Analysis, 71, 1054–1063.
Fischer D (2014). GeneticTools: Collection of Genetic Data Analysis Tools. R package
version 0.3, URL http://CRAN.R-project.org/package=GeneticTools.
Fischer D, Oja H (2015). gMWT: Generalized Mann-Whitney Type Tests. R package version
1.0, URL http://CRAN.R-project.org/package=gMWT.
Fischer D, Oja H, Sen PK, Schleutker J, Wahlfors T (2014). “Generalized Mann-Whitney Type
Tests for Microarray Experiments.” Scandinavian Journal of Statistics, 41(3), 672–692.
18 gMWT: Generalized Mann-Whitney Tests in R
Hothorn T, Hornik K, van de Wiel MA, Zeileis A (2008). “Implementing a Class of Per-
mutation Tests: The coin Package.” Journal of Statistical Software, 28(8), 1–23. URL
http://www.jstatsoft.org/v28/i08/.
Perlman MD (1969). “One-Sided Testing Problems in Multivariate Analysis.” The Annals of
Mathematical Statistics, 40(2), 549–567.
R Core Team (2014). R: A Language and Environment for Statistical Computing. R Founda-
tion for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Serfling RJ (1980). Approximation Theorems of Mathematical Statistics. John Wiley & Sons.
Seshan VE (2014). clinfun: Clinical Trial Design and Data Analysis Functions. R package
version 1.0.6, URL http://CRAN.R-project.org/package=clinfun.
Siltanen S, Fischer D, Rantapero T, Laitinen V, Mpindi JP, Kallioniemi O, Wahlfors T,
Schleutker J (2013). “ARLTS1 and Prostate Cancer Risk – Analysis of Expression and
Regulation.” PLoS ONE, 8(8), e72040.
Simes RJ (1986). “An Improved Bonferroni Procedure for Multiple Tests of Significance.”
Biometrika, 73(3), 751–754.
Thas O, De Neve J, Clement L, Ottoy JP (2012). “Probabilistic Index Models.” Journal of
the Royal Statistical Society B, 74(4), 623–671.
Westfall PH, Young SS (1993). Resampling-Based Multiple Testing: Examples and Methods
for p-Value Adjustment. John Wiley & Sons.
Journal of Statistical Software 19
A. Union-intersection test for P13 and P23
We implemented two ways to calculate p values from the UIT, the first one is based on a
permutation approach and the other one is based on asymptotical results. In both cases
we first need to calculate the critical value c of the test statistic Q∗. For the test statistics
S1 = Pˆ13 and S2 = Pˆ23, S = (S1, S2)
> ∼ N2(µ,Σ) approximately with
Σ =
(
1 ρ
ρ 1
)
.
Our UIT test statistic Q∗ is then
Q∗ = I0 · 0 + I1 · (S2 − ρS1)
2
1− ρ2 + I2 ·
(S1 − ρS2)2
1− ρ2 + I3 · S
>Σ−1S,
with
I0 = I(S1 ≤ ρS2, S2 ≤ ρS1), I1 = I(S1 < 0, S2 > ρS1),
I2 = I(S1 > ρS2, S2 < 0), I3 = I(S1 ≥ 0, S2 ≥ 0).
As I0 + I1 + I2 + I3 = 1, at most one of the three test statistics in the sum contributes to Q
∗.
An approximate p value is then given by the approximation (Perlman 1969)
P(Q∗ > c) =
1
2
P(χ21 > c) +
cos−1 ρ
2pi
P(χ22 > c).
For a permutation test version, we permute the elements of the vector of the group variable
M times with resulting values of test statistics Q∗1, Q∗2, . . . , Q∗M ; the approximate p value is
then
p =
1
M
M∑
m=1
I(Q∗m ≥ Q∗).
Affiliation:
Daniel Fischer
School of Health Sciences
University of Tampere
33014 Tampere, Finland
E-mail: Daniel.Fischer@luke.fi
Hannu Oja
Department of Mathematics and Statistics
University of Turku
20014 Turku, Finland
Journal of Statistical Software http://www.jstatsoft.org/
published by the American Statistical Association http://www.amstat.org/
Volume 65, Issue 9 Submitted: 2012-05-30
June 2015 Accepted: 2014-08-06