Jukuri, open repository of the Natural Resources Institute Finland (Luke) 
   
 
All material supplied via Jukuri is protected by copyright and other intellectual property rights. Duplication 
or sale, in electronic or print form, of any part of the repository collections is prohibited. Making electronic 
or print copies of the material is permitted only for your own personal use or for educational purposes.  For 
other purposes, this article may be used in accordance with the publisher’s terms. There may be 
differences between this version and the publisher’s version. You are advised to cite the publisher’s 
version. 

 
This is an electronic reprint of the original article.  
This reprint may differ from the original in pagination and typographic detail. 

 
Author(s): Enyew Negussie, Oscar González-Recio, Mara Battagin, Ali-Reza Bayat, Tommy 
Boland, Yvette de Haas, Aser Garcia-Rodriguez, Philip C. Garnsworthy, Nicolas 
Gengler, Michael Kreuzer, Björn Kuhla, Jan Lassen, Nico Peiren, Marcin Pszczola, 
Angela Schwarm, Hélène Soyeurt, Amélie Vanlierde, Tianhai Yan & Filippo Biscarini 
 

Title: Integrating heterogeneous across-country data for proxy-based random forest 
prediction of enteric methane in dairy cattle 

Year: 2022 

Version: Preprint version 

Copyright:   The Author(s) 2022 

Rights: CC BY 4.0 

Rights url: http://creativecommons.org/licenses/by/4.0/ 

 
Please cite the original version: 

Negussie E., González-Recio O., Battagin M., Bayat A.-R., Boland T., de Haas Y., Garcia-Rodriguez 
A., Garnsworthy P.C., Gengler N., Kreuzer M., Kuhla B., Lassen J., Peiren N., Pszczola M., Schwarm 
A., Soyeurt H., Vanlierde A., Yan T. & Biscarini F. (2022). Integrating heterogeneous across-country 
data for proxy-based random forest prediction of enteric methane in dairy cattle. Journal of Dairy 
Science 105(6): 5124-5140. https://doi.org/10.3168/jds.2021-20158. 

 
ABSTRACT

Direct measurements of methane (CH4) from indi-
vidual animals are difficult and expensive. Predictions 
based on proxies for CH4 are a viable alternative. 
Most prediction models are based on multiple linear 
regressions (MLR) and predictor variables that are not 
routinely available in commercial farms, such as dry 
matter intake (DMI) and diet composition. The use 
of machine learning (ML) algorithms to predict CH4 
emissions from across-country heterogeneous data sets 
has not been reported. The objectives were to compare 
performances of ML ensemble algorithm random for-
est (RF) and MLR models in predicting CH4 emissions 
from proxies in dairy cows, and assess effects of imput-
ing missing data points on prediction accuracy. Data 
on CH4 emissions and proxies for CH4 from 20 herds 
were provided by 10 countries. The integrated data set 
contained 43,519 records from 3,483 cows, with 18.7% 
missing data points imputed using k-nearest neighbor 
imputation. Three data sets were created, 3k (no miss-
ing records), 21k (missing DMI imputed from milk, fat, 

protein, body weight), and 41k (missing DMI, milk fat, 
and protein records imputed). These data sets were 
used to test scenarios (with or without DMI, imputed 
vs. nonimputed DMI, milk fat, and protein), and pre-
diction models (RF vs. MLR). Model predictive ability 
was evaluated within and between herds through 10-
fold cross-validation. Prediction accuracy was measured 
as correlation between observed and predicted CH4, 
root mean squared error (RMSE) and mean normal-
ized discounted cumulative gain (NDCG). Inclusion of 
DMI in the model improved within and between-herd 
prediction accuracy to 0.77 (RMSE = 23.3%) and 0.58 
(RMSE = 31.9%) in RF and to 0.50 (RMSE = 0.327) 
and 0.13 (RMSE = 42.71) in MLR, respectively than 
when DMI was not included in the predictive model. 
When missing DMI records were imputed, within and 
between-herd accuracy increased to 0.84 (RMSE = 
18.5%) and 0.63 (RMSE = 29.9%), respectively. In all 
scenarios, RF models out-performed MLR models. Re-
sults suggest routinely measured variables from dairy 
farms can be used in developing globally robust pre-
diction models for CH4 if coupled with state-of-the-art 
techniques for imputation and advanced ML algorithms 
for predictive modeling.
Key words: enteric methane, machine learning, 
prediction models, proxies for methane

Integrating heterogeneous across-country data for proxy-based 
random forest prediction of enteric methane in dairy cattle
Enyew Negussie,1*  Oscar González-Recio,2  Mara Battagin,3  Ali-Reza Bayat,4  Tommy Boland,5   
Yvette de Haas,6  Aser Garcia-Rodriguez,7  Philip C. Garnsworthy,8  Nicolas Gengler,9   
Michael Kreuzer,10  Björn Kuhla,11  Jan Lassen,12  Nico Peiren,13  Marcin Pszczola,14   
Angela Schwarm,15  Hélène Soyeurt,9  Amélie Vanlierde,16  Tianhai Yan,17  and Filippo Biscarini18  
1Animal Genomics and Breeding, Natural Resources Institute Finland (Luke), 31600 Jokioinen, Finland
2Department of Animal Breeding, Instituto Nacional de Investigacion y Tecnologia Agraria y Alimentaria (INIA-CSIC), 28040 Madrid, Spain
3Italian Brown Cattle Breeders’ Association, Verona, Italy
4Animal Nutrition, Natural Resources Institute Finland (Luke), 31600 Jokioinen, Finland
5Agriculture and Food Science Centre, School of Agriculture and Food Science, University College Dublin, Belfield, Belfield, Dublin 4, Ireland
6Animal Breeding and Genomics, Wageningen University and Research, 6700 AH Wageningen, the Netherlands
7Department of Animal Production, NEIKER—Basque Institute for Agricultural Research and Development, 01192 Arkaute, Spain
8School of Biosciences, University of Nottingham, Sutton Bonington Campus, Loughborough LE12 5RD, United Kingdom
9TERRA Teaching and Research Centre, Gembloux Agro-Bio Tech, University of Liège, 5030 Gembloux, Belgium
10ETH Zurich, Institute of Agricultural Sciences, Universitaetstrasse 2, 8092 Zurich, Switzerland
11Research Institute for Farm Animal Biology (FBN), Institute of Nutritional Physiology “Oskar Kellner,” Wilhelm-Stahl-Allee 2, 18196 Dummerstorf, 
Germany
12VikingGenetics, Ebeltoftvej 16, 8960 Randers, Denmark
13Institute for Agricultural and Fisheries Research (ILVO), Merelbeke, Belgium
14Department of Genetics and Animal Breeding, Poznan University of Life Sciences, Wołynska 33, 60-637 Poznan, Poland
15Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, PO Box 5003, 1432 Ås, Norway
16Productions in Agriculture Department, Walloon Agricultural Research Centre (CRA-W), BEL-5030 Gembloux, Belgium
17Livestock Production Science Branch, Agri-Food and Biosciences Institute, Hillsborough, Co. Down BT26 6DR, United Kingdom
18National Research Council, Institute of Agricultural Biology and Biotechnology (CNR-IBBA), Via Bassini 15, 20133 Milan, Italy

 
J. Dairy Sci. 105
https://doi.org/10.3168/jds.2021-20158
© 2022, The Authors. Published by Elsevier Inc. and Fass Inc. on behalf of the American Dairy Science Association®. 
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Received January 15, 2021.
Accepted February 9, 2022.
*Corresponding author: enyew.negussie@​luke​.fi

https://orcid.org/0000-0003-4892-9938
https://orcid.org/0000-0002-9106-4063
https://orcid.org/0000-0001-7309-6793
https://orcid.org/0000-0002-4894-0662
https://orcid.org/0000-0002-7433-130X
https://orcid.org/0000-0002-4331-4101
https://orcid.org/0000-0001-5519-6766
https://orcid.org/0000-0001-5131-3398
https://orcid.org/0000-0002-5981-5509
https://orcid.org/0000-0002-9978-1171
https://orcid.org/0000-0002-2032-5502
https://orcid.org/0000-0002-1338-8644
https://orcid.org/0000-0001-5500-1607
https://orcid.org/0000-0003-2833-5083
https://orcid.org/0000-0002-5750-2111
https://orcid.org/0000-0001-9883-9047
https://orcid.org/0000-0002-4619-1936
https://orcid.org/0000-0002-1994-5202
https://orcid.org/0000-0002-3901-2354
mailto:enyew.negussie@luke.fi


Journal of Dairy Science Vol. 105 No. 6, 2022

INTRODUCTION

Food production and agriculture face major challeng-
es under climate change, in terms of expected negative 
effect on productivity as well as implementation of sec-
toral actions to limit greenhouse gas (GHG) emissions. 
Sustainable farming, livestock husbandry, fisheries, and 
forestry can help countries identify opportunities for 
reducing emissions while addressing their food security, 
resilience, and rural development goals (FAO, 2016).

Agricultural activities contribute 10 to 14% of global 
anthropogenic GHG emissions (Jantke et al., 2020). 
Livestock production systems account for 40% of CH4 
emissions in agriculture (FAO, 2018), where the largest 
part originates from CH4 that is produced and released 
from the rumen. There is also an indirect contribution 
through, for example, feed-production activities, defor-
estation, and manure (Cassandro, 2020). Agriculture, 
particularly livestock, is increasingly being recognized 
as both a contributor to the process and a potential 
victim of it (Cassandro et al., 2013; Cassandro, 2020). 
However, in the agriculture sector, livestock production 
has a great potential for reducing GHG emissions and 
a tremendous ability to contribute to climate change 
mitigation and adaptation (Cassandro 2020). This is in 
part because, of the many available options for mitiga-
tion of GHG, mitigation of CH4 is particularly efficient 
given its relatively short half-life and therefore any mit-
igation effort is expected to result in quick returns. The 
growing demand for meat and milk, which is predicted 
to double by 2050 (Rojas-Downing et al., 2017) calls for 
an accurate inventory of CH4 emissions for setting up 
effective and sustainable mitigation strategies.

Direct measurement of CH4 emissions from individual 
animals using respiration chambers provides reliable 
information which can be used for national inventories, 
assessment of dietary mitigation strategies, genetic se-
lection, and calculation of energy loss through exhaled 
CH4 (Appuhamy et al., 2016). However, this approach 
is not suitable for large-scale assessment and is expen-
sive and labor intensive (Kebreab et al., 2006; Moraes 
et al., 2014; Negussie et al., 2017b). There have been 
several efforts in recent decades to develop low-cost 
and portable methods for direct measurement of CH4 
emissions in animals (Negussie et al., 2017a,b; Zhao et 
al., 2020). Although such handheld and portable ap-
plications for direct measurement have the potential 
for high throughput, they are generally based on CH4 
concentration as opposed to flux assays and are in some 
cases considered to be less accurate than respiration 
chambers (Garnsworthy et al., 2019). Instead, the use of 
combinations of proxies for CH4 has been suggested as a 
valid alternative to direct measurement of CH4. Proxies 
for CH4 are traits that are directly or indirectly related 

to CH4 and that can easily be measured on a large scale 
and at low-cost (Negussie et al., 2017a). Some prox-
ies (e.g., milk yield, milk composition, lactation stage) 
are easily and readily available from routine national 
recording schemes in many countries and thus their 
use may be promising in developing robust prediction 
models for CH4. In a comprehensive review (Negussie 
et al., 2017a) highlighted that use of combinations of 
readily available proxies for CH4 could increase accu-
racy of CH4 predictions by 15 to 35%. This is mainly 
because, different proxies describe independent sources 
of variation in CH4 emissions and one proxy can correct 
for shortcomings of others.

Several equations have been developed for proxy-
based prediction of CH4 emission in dairy cattle using 
primarily some powerful yet expensive proxies such as 
feed intake and diet composition (Ellis et al., 2010; 
Hristov et al., 2013; Nielsen et al., 2013; Ramin and 
Huhtanen, 2013; Storlien et al., 2014; Appuhamy et 
al., 2016; Charmley et al., 2016). A comprehensive 
analysis of data consisting of these proxies was reported 
recently by Niu et al. (2018). They used multiple linear 
regression (MLR) models for prediction of enteric CH4 
emissions based on traits such as energy intake, diet 
composition, and milk yield as predictor variables (Niu 
et al., 2018). Unfortunately, large-scale availability of 
such data containing energy intake or DMI and diet 
composition is limited. They are especially difficult 
and expensive to record from commercial farms. When 
available, they are mainly resourced from relatively 
small numbers of animals or from a single herd, that 
limits their applicability to other regions or produc-
tion systems. Furthermore, most of these prediction 
models used so far are based on conventional statistical 
methods fitting MLR models. Such models cannot ap-
proximate potentially nonlinear relationships between 
proxies and emissions unless resorting to generalized 
additive model extensions and to modeling nonlinear 
relationships explicitly. Therefore, the use of low-cost 
and routinely recorded traits (e.g., milk yield, milk 
composition, age, lactation stage) as predictor variables 
can be a practical option. For these reasons, a more 
comprehensive database needs to be collated to develop 
enteric CH4 emission prediction models at both global 
and regional scales (Niu et al., 2018) applying more 
flexible state-of-the-art statistical and analytical tools.

The smart farming revolution is a global trend based 
on key innovative technologies, such as Internet of 
things, cloud computing, big data, and machine learning 
(ML), which are reshaping modern agriculture (Wolfert 
et al., 2017). A vast array of sensors and phenotyping 
platforms for farm applications are now generating an 
enormous and continuous stream of data. Several of the 
above-mentioned proxies are already being generated 

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE


Journal of Dairy Science Vol. 105 No. 6, 2022

from such applications and more are likely to follow. 
These data are high throughput, relatively low-cost 
and can be used for development of robust models for 
accurate prediction of CH4 emissions. Effective and ef-
ficient utilization of information contained in such large 
heterogeneous data sets requires advanced and versatile 
statistical tools, such as ML algorithms. In predictive 
modeling ML provides an excellent solution to identify 
hidden trends in heterogeneous and noisy data sets, 
and to accommodate nonlinear relationships between 
variables (Zhang and Ma, 2012; Al-Jarrah et al., 2015). 
Use of ML methods for proxy-based predictions of CH4 
emissions from combined across-country heterogeneous 
data sets has not yet been reported. Importantly, their 
predictive performance in comparison with conventional 
statistical methods, involving routinely recorded prox-
ies for CH4 remains unexplored. The main objectives of 
the current study were (1) to combine heterogeneous 
across-country data on routinely measured proxies for 
CH4 into an integrated data set; (2) to apply a ML 
ensemble algorithm random forest (RF) to the proxy-
based predictions of CH4 and compare its performance 
with that of MLR models; (3) to explore the possibility 
of imputing missing data points and compare accu-
racy of CH4 prediction from imputed and nonimputed 
data sets, because combining data from heterogeneous 
across-country sources is bound to generate a propor-
tion of missing records.

MATERIALS AND METHODS

Data

Data on enteric CH4 production and proxies for CH4 
that are routinely collected from dairy farms were pro-
vided by 13 research centers from 10 European part-
ner countries (Belgium, Denmark, Finland, Germany, 
Ireland, the Netherlands, Poland, Spain, Switzerland 
and UK) of the METHAGENE consortium (EU-COST 
Action FA1302) on large-scale methane measurements 
on individual ruminants for genetic evaluations (www​
.methagene​.eu). The data sets were from 20 herds cov-

ering a diverse geographical and production-systems 
mix. Individual cow records from different breeds, par-
ity (from 1–3+), age and stage of lactation (DIM from 
1–349) were combined into a large integrated data set. 
Variables included in the combined data set were herd, 
breed, parity, DIM, BW, DMI, CH4 production, CH4 
measurement method, milk yield, milk fat, and milk 
protein. The breeds included were mainly Holstein Frie-
sians, Nordic Red, Brown Swiss, Norwegian Red, and 
Nordic crosses. The majority (90%) of the herds kept 
Holstein Friesian breed and only 2 herds kept breeds 
other than Holstein Friesian. In all herds, BW was 
measured and provided as weight in kilograms along 
with measurement of CH4. Because of cost and asso-
ciated technical difficulties, in most commercial dairy 
herds recordings of feed intake and other diet-related 
information are not routinely conducted. As a result, 
for this study extensive diet composition and feeding 
management information were not provided except for 
DMI, which few of the participating herds were able 
to provide. Methane emissions were measured with 5 
measurement techniques: cattle respiration chambers, 
the SF6 tracer technique, and sniffers (F10, Gasmet, 
and NG Guardian), and description of the methods 
as well as some measurement details are provided in 
Garnsworthy et al. (2019). In total, the combined data 
set included 47,129 repeated records from 3,886 cows, 
belonging to 5 dairy breeds (Table 1). Detailed descrip-
tion of the data by participating herds is provided 
in Supplemental Table S1 (https:​/​/​doi​.org/​10​.7910/​
DVN/​BINDG9, Negussie, 2022).

Data Integration

Individual data sets collated from the 20 herds were 
standardized by making sure that all variables were 
expressed in the same units such as milk yield in kilo-
grams, protein and fat as percentages, and enteric CH4 
as grams per day. When expressed in liters per day, 
enteric CH4 production was converted to grams per day 
using the following conversion equation:

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

Table 1. Descriptive statistics of CH4 and the main proxy variables included in the integrated data set

Variable
No. of 

observations Mean SD Minimum Maximum

CH4, g/d 43,519 372.5 133.2 100 983
Milk yield, kg/d 43,507 31.3 9.2 9.0 89.5
Milk fat, % 23,783 3.84 0.64 3.23 9.91
Milk protein, % 23,783 3.63 0.55 3.00 6.94
DMI, kg/d 3,427 20.2 3.3 7.4 39.3
BW, kg 43,392 585 114.5 478 955
DIM 43,519 156 87.5 1 349
Parity 43,472 1.99 1.6 1 14

www.methagene.eu
www.methagene.eu
https://doi.org/10.7910/DVN/BINDG9
https://doi.org/10.7910/DVN/BINDG9


Journal of Dairy Science Vol. 105 No. 6, 2022

	 CH4 (L/d) × 0.668 = CH4 (g/d),	 [1]

where 0.668 is the density of CH4 at normal tempera-
ture and pressure (20°C and 1 atm) (Engineering Tool-
Box, 2003, 2004). Weekly averages of CH4 production 
were calculated and used in data analyses. Categorical 
variables for breed and CH4 measurement method were 
standardized by making sure that all categories were 
labeled consistently across data sets. All date variables 
were standardized to a common DD-MM-YYYY for-
mat. After data standardization, data from the 20 indi-
vidual herds were combined into a large integrated data 
set. The combined data set was then filtered by retain-
ing only records with DIM ≤350 and CH4 ≤1,000 g/d. 
Outliers for enteric CH4 production (g/d; outside 3.5 
SD, within herd) were also excluded. The final filtered 
data set contained 43,519 records from 3,483 cows.

Imputation of Missing Proxies

Integrated and filtered data from the combined 20 in-
dividual data sets contained 18.7% missing data points 
for DMI and milk fat and protein. A nonparametric 
k-nearest neighbor imputation approach (Troyanskaya 
et al., 2001) was used to impute missing proxy data 
to obtain a larger and complete data set to compare 
the predictive performance of algorithms on imputed 
and nonimputed data sets. The imputation process first 
involved calculation of Gower distances (Gower, 1971) 
between records as

	 S Sij ijz ijz z ijz=
k=0

n v
∑ ∑( ) =

. ,δ δ
1

	 [2]

where Sij is the similarity between samples i and j, δijz 
∈ {0, 1} is an indicator variable that specifies whether 
a comparison between samples i and j over variable z 
is possible (δijz = 1) or not (δijz = 0); Sijz is the similar-
ity coefficient at any given variable z. Similarities were 
calculated differently depending on whether variable 
z was binary, categorical, or quantitative: (1) associa-
tion tables; (2) Hamming distances; (3) 1 − |xi − xj|/
(range). Similarity coefficients were then averaged over 
the v variables for which comparisons were possible, to 
give a general similarity coefficient Sijz between samples 
i and j. Based on the matrix of Gower similarities, k-
nearest neighbors for any given record were selected, 
and missing data points were imputed based on the 
average values (quantitative variables) or majority-vote 
(if binary or categorical variables) of the k neighbors. 
In the present study, a value of k = 4 was chosen for 
the imputation. In a few cases, the neighborhood did 
not have enough information to make imputation pos-
sible and missing values were left unimputed and con-

sequently removed. This left 40,532 records in the final 
imputed data set.

Data Analyses

After data integration and imputation, 3 data sets 
were generated for predictive modeling: (1) 3k, contain-
ing 3,337 records with no missing data on any variables 
(no imputation); (2) 21k, containing 21,215 records and 
only the missing 84% of DMI records were imputed; 
and (3) 41k, containing 40,532 records [i.e., all filtered 
records where all missing data points (81% of DMI and 
48% of milk fat and protein records) were imputed]. All 
data sets comprised the 11 proxies used in this study 
to model CH4 production: breed (5 classes), country 
(10 classes), herd (20 classes), DIM, parity (3 classes), 
methane measurement technique (5 classes), DMI 
(kg/d; imputed in the 21k and 41k data sets), BW (kg), 
milk yield (kg/d), milk fat (%; imputed in the 41k data 
sets) and milk protein (%; imputed in the 41k data 
sets). The 3 data sets were all used to predict CH4 
emissions with both RF and MLR predictive models, 
either including or not including DMI in the model. 
Figure 1 provides a visual representation of the adopted 
data analysis.

Prediction Models for CH4 Emissions from Proxies

Random Forest Predictive Model. Random for-
est is a supervised ML algorithm used for predictive 
modeling in both classification and regression problems 
(Breiman, 2001). Random forest generates and aggre-
gates a “forest” of trees built on resampled subsets of 
the data. In the current study, RF was used to predict 
individual CH4 emissions from imputed and nonim-
puted data sets of proxies for CH4. As a semiparametric 
ML ensemble algorithm, RF is robust to overfitting and 
able to capture complex interactions in data, thereby 
handling efficiently problems associated with complex 
and heterogeneous data structures (González-Recio and 
Forni, 2011). The RF works by building large number 
of decision trees on bootstrapped samples of the data 
and random subsets of the predictor variables. Predic-
tions from each single decision tree are then averaged to 
get the final prediction for CH4 emissions:

	 ˆ ˆ ,f
B

fi b

B
b ix x( ) = ( )

=∑
1

1
	 [3]

where xi is the vector of predictors for record i (i.e., 
CH4 proxies), and f̂b ix( ) is the corresponding predicted 
CH4 emission from the regression tree built on the bth 
bootstrapped sample of the data. Each tree heuristi-
cally minimizes the loss function (residual sum of 

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE


Journal of Dairy Science Vol. 105 No. 6, 2022

squares) through top-down greedy recursive binary 
splitting. The number of trees in the forest (B), the 
number of variables or proxies (m) randomly used in 
each node and the maximum size of each node for fur-
ther partition were tuned based on the lowest general-
ized error on the out-of-bag samples. Convergence of 
the generalization error was assessed from 1,000 trees 
onward and lowest generalized error was obtained with 
m = 5 and minimum node [size = 5]. Equation 3 was 
used to build RF models to accommodate the different 
scenarios with and without DMI and imputation of 
missing DMI as well as other missing proxy variables as 
described above.

Multiple Linear Regression Model. To provide 
a benchmark against which performance of RF models 
can be compared, MLR models were run for all data 
sets and scenarios considered.

The basic MLR model for CH4 as a function of prox-
ies was:

	 yi j

p
j ij i= + +

=∑β β χ ε0 1
,	 [4]

where yi is methane production of individual i, β0 is 
the intercept of the model, β1, …, βm are the regres-

sion coefficients of the proxy variables included in the 
model, xij is the value for proxy j in animal i (for cat-
egorical proxies, xij are indicator variables), and εi is 
the residual term. The variables in [1; p] depend on 
the scenario considered (data set, imputation). The 
QR factorization was used to solve the MLR models. 
Although residual variance may vary between subsets 
of the records, the most relevant in this study being 
methane measurement techniques, for the MLR model 
homoscedasticity was assumed. Equation 4 was applied 
to the 3 data sets and all envisaged scenarios, the same 
as with RF models.

Accuracy of Prediction

Predictive ability of RF and MLR models was 
evaluated within and between herds in the 3 data sets 
(3k, 21k, and 41k). In each data set, inclusion or not 
inclusion of DMI (measured or imputed) was tested. 
The effect of imputing missing proxies was evaluated 
indirectly by comparing the 3 data sets: no imputation 
(3k), imputation of DMI (21k), imputation of DMI and 
milk fat and protein (41k). Predictive ability of the 
models was estimated through 10-fold cross-validation 

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

Figure 1. Cross-validation scheme used for estimation of the predictive ability from random forest and multiple linear regression models 
under within- and between-herd prediction scenarios. Each model (data set size, inclusion of DMI, within- or between-herd scenario) was rep-
licated 5 times. r = Pearson correlation between observed and predicted CH4 values; RMSE = root mean squared error; NCDG = normalized 
discounted cumulative gain.


Journal of Dairy Science Vol. 105 No. 6, 2022

replicated 5 times. The data were split into 10 parti-
tions: 9 were used to train the model and one was used 
to test the model, until all 10 partitions were used once 
as test set. Multiple records from the same cows were 
always assigned to the same fold. In the within-herd 
cross-validation, records from all herds were used both 
in the training and test sets: CH4 records from a given 
cow were predicted from data of herd mates plus cows 
from other herds and countries. In the between-herd 
cross-validation, all records from one herd “k” (each 
herd in turn) were set aside as test set, and the remain-
ing k − 1 herds were used to train the model (Figure 1).

Three accuracy metrics were used to measure pre-
dictive ability of the models: (1) Pearson correlation 
between observed and predicted CH4 values, (2) root 
mean squared error (RMSE), as percentage of the mean 
of the observed response variable, and (3) normalized 
discounted cumulative gain (NDCG). Normalized dis-
counted cumulative gain is a ranking metric developed 
in information theory (Järvelin and Kekäläinen, 2002), 
which has been applied to evaluation of genomic selec-
tion models (Blondel et al., 2015). The NDCG metric 
evaluates the top individuals in the ranking, which 
are supposed to be the most relevant when comparing 
models. The top 20% emitters were considered here and 
their CH4 outputs were ranked based on observed and 
predicted ranks; NDCG was calculated as:

	 NDCG y y
y y d i

y y d i

i

k

i

i

k

i

 =( ,  )̂
ˆ

=

=

∑
∑

( )



 ⋅ ( )( )

( )



 ⋅

1

1

π

π (( )( )












,	 [5]

where y yπ ˆ( )



 and y yπ ( )



 are the top k observed CH4 

emission records according to either their predicted or 
observed ranking; d(i) = log2(i + 1) is the weight by 
which ranked values are discounted with (i ∈ [1, k]); k 
is the number of individuals included in the top 20%. 
The NDCG values range between 0 and 1, with values 
close to 1 indicating better performance of the model to 
correctly predict the most relevant individuals (e.g., 
identifying those cows that emit more CH4 than other 
cows).

Variable Importance

In predictive modeling, important proxies drive the 
outcome of the model and have a significant effect on 
accuracy of prediction. In RF, the relative importance 
of proxies included in the predictive models was auto-
matically retrieved. The relative importance of prox-
ies was estimated by running the out-of-bag samples 
through the RF trees after randomly permuting the 
values at each proxy variable and comparing the re-

sulting predictive accuracy (or loss function) with that 
obtained from the original data (nonpermuted). The 
relative importance of proxies was then scaled to be 
in the [0, 100] range, which provided insights into the 
predictive and biological roles played by the proxies in 
prediction of CH4 emissions from cows. Two measures 
of variable importance were used: percent increase in 
mean squared error and increase in node purity.

Software and Computing Environment

All data handling, processing, and analysis were 
performed using the R environment for statistical 
computing (https:​/​/​r​-project​.org). Specifically, the R 
package VIM (Kowarik and Templ, 2016) was used for 
imputation of missing proxy data and RF (Liaw and 
Wiener, 2002) was used for RF predictions. For the 
MLR models we used the lm() function from the stats 
R package, which uses QR factorization to solve the 
model (https:​/​/​r​-project​.org). Plots were generated us-
ing the ggplot2 R package (Wickham, 2009).

RESULTS

Integrated Across-Country Data

Average enteric CH4 emission across the 20 herds in 
10 European countries was 372.5 g/d (±133.2; SD) and 
ranged from 280 to 543 g/d (Table 1). Figure 2 sum-
marizes CH4 emissions by herd and shows that there 
is marked variability between herds, with most (two-
thirds) measurements lying between 250 and 500 g/d. 
The effect of country of origin on methane emissions 
was tested in a multiple regression model. In a naive 
model, the effect of country of origin was significant 
(P < 0.01). When accounting for co-dependencies be-
tween records (cows nested within herds, nested within 
countries) the herd effect absorbs a large portion of the 
variation from the country effect, which is no longer 
significant (P > 0.01). This indicates that CH4 emis-
sion levels vary between countries mainly because of 
between-herd variations.

Regarding the distribution of methane, we have 
checked visually the per-country histograms and nu-
merically the descriptive statistics and the mean and 
median for CH4 correspond very well. The formal 
Shapiro-Wilk test tells us however that in all countries 
except 3 there are some deviations from normality. This 
may be linked to different sample sizes and different 
CH4 variability among countries. Additionally, these 
deviations are limited (in the sense that distributions 
do not appear to be binary, or bimodal, or strongly 
skewed) and linear models are known to be rather ro-
bust to deviations from normality and distributional 

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

https://r-project.org
https://r-project.org


Journal of Dairy Science Vol. 105 No. 6, 2022

assumptions (Schielzeth et al., 2020; Knief and Forst-
meier, 2021). General summary statistics for the differ-
ent proxy variables in the integrated data set are given 
in Table 1.

Prediction Accuracy of CH4 from RF  
Versus MLR Models.

Within-Herd CH4 Prediction Accuracy.  Table 
2 shows average Pearson correlations, RMSE, and 
NDCG for all predictive models and scenarios from 
both RF and MLR. Figures 3, 4, and 5 show RF results 
for each of the 5 replicates per model and scenario. In 
the 3k data set, when measured DMI was included in 
the RF model, prediction accuracy measured as Pear-
son correlation between observed and predicted CH4 
r y y,̂( ) increased from 0.52 to 0.77, RMSE was reduced 
from 31.3 to 23.3, and NDCG increased from 0.75 to 
0.89 (Table 2). In the 21k data set, when missing DMI 
records were imputed and included in the prediction 
model, prediction accuracy r y y,̂( ) increased from 0.80 
to 0.84, RMSE declined from 20.0 to 18.5, and NDCG 
increased from 0.91 to 0.92. In the 41k data set, when 
all missing variables including DMI were imputed and 
included in the prediction model, prediction accuracy 

r y y,̂( ) increased slightly from 0.81 to 0.82, RMSE de-
creased from 20.6 to 20.0, and NDCG remained the 
same. When moving from 3k to 21k and 41k data sets, 
predictions become progressively less variable.

Within-herd prediction accuracy from RF models 
varied with CH4 measurement method. In general, CH4 
measurements from chambers tended to be predicted 
more accurately and more robustly (lower between-
replicate variability) than CH4 records from sniffers 
and SF6. This was especially true when DMI was not 
included in the model, although with DMI included in 
the model sniffers gave comparable results to chambers. 
The SF6 almost always gave the least accurate predic-
tions and was very variable across replicates (Figures 
3 to 5).

Marked differences were found between predictive 
models (RF vs. MLR) in within-herd prediction accu-
racy. Prediction accuracies from the MLR model were 
consistently lower than RF across data sets and sce-
narios. Across the different data sets and scenarios, 
when prediction was made using MLR instead of RF, 
r y y,̂( ) declined from 0.77 to 0.50, RMSE increased from 
23.3 to 32.7, and NDCG decreased from 0.89 to 0.73 for 
the 3k data set. For the 21k data set, within-herd r y y,̂( ) 
declined from 0.84 to 0.75, RMSE increased from 18.5 

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

Figure 2. Distribution of mean methane production (g/d) across herds in the combined data set. Boxes correspond to the interquartile range 
(IQR); whiskers are 1.5 times the IQR on both sides; dots describe data points which fall outside the ±1.5 × IQR boundaries around the box.


Journal of Dairy Science Vol. 105 No. 6, 2022

to 22.7, and NDCG decreased from 0.92 to 0.90. Simi-
larly, for 41k data set, within-herd r y y,̂( ) declined from 
0.82 to 0.79, RMSE increased from 20.0 to 21.4, and 
NDCG was the same at 0.86 for both MLR and RF 
predictive models.

Between-Herd CH4 Prediction Accuracy.  For 
the between-herd scenario, again a general pattern to-
ward more accurate and especially less variable predic-
tions with increasing data size was observed, although 
somewhat less clear than in the within-herd scenario. 
In the 3k data set, inclusion of measured DMI in the 
prediction model increased between-herd r y y,̂( ) to 0.13 
and 0.58, reduced RMSE to 42.7 and 31.9, and increased 
NDCG to 0.56 and 0.82 in MLR and RF models, re-
spectively (Table 2). In the 21k data set, when missing 
DMI imputed, between-herd r y y,̂( ) increased to 0.14 
and 0.63, RMSE decreased to 32.7 and 29.9, and NDCG 
decreased to 0.87 and 0.83 in MLR and RF models, 
respectively. In 41k data set, when missing proxy vari-
ables were imputed, between-herd r y y,̂( ) increased to 
0.21 and 0.39, RMSE decreased to 39.6 and 33.6, and 
NDCG decreased to 0.83 to 0.65 in MLR and RF mod-
els, respectively.

Between-herd CH4 prediction accuracies among herds 
using different CH4 measurement methods are shown in 
Figures 3 to 5. In general, whether or not DMI was in-
cluded in predictive models, between-herd prediction 
accuracies for chamber CH4 measurement methods 
were higher than sniffer CH4 measurement methods. 
For instance, when measured DMI was included or 
missing DMI was imputed and included in the predic-
tion model, between-herd r y y,̂( ) for chamber measure-
ment methods was 20% higher than for sniffer measure-

ment methods. Similarly, RMSE for chamber CH4 
measurement methods was 30 to 50% lower than sniffer 
measurement methods. However, when NDCG metric 
was used, only a small difference was observed in be-
tween-herd prediction accuracy between chamber and 
sniffer measurement methods.

The RF and MLR predictive models had varied be-
tween-herd prediction accuracy over the 3 different 
data sets (Table 2). For instance, in the 3k data set, 
between-herd r y y,̂( ) declined from 0.58 to 0.13, RMSE 
increased from 31.9 to 42.7, and NDCG decreased from 
0.82 to 0.56 when MLR was used instead of RF model. 
Similarly, for the 21k data set, between-herd r y y,̂( ) 
declined from 0.63 to 0.14, RMSE increased from 29.9 
to 38.4, and NDCG increased from 0.83 to 0.87. For the 
larger 41k data set, in which all missing proxy variables 
were imputed, between-herd r y y,̂( ) declined from 0.39 
to 0.21, RMSE increased from 33.6 to 39.6, and NDCG 
increased from 0.65 to 0.83. In general, the differences 
between the models in r y y,̂ ,( )  RMSE and NDCG in 
Table 2 were all significant at (P < 0.0001) except for 
the 41k data with imputation and DMI in the model, 
where the difference was not significant (P = 0.28).

Variable Importance

When building decision trees, RF computes how 
much each variable is contributing to the prediction, 
which is a measure of variable importance. Two mea-
sures of variable importance were used in the present 
study: percent increase in mean square error and in-
crease in node purity, which are presented in Figure 6. 
When DMI was not included in the prediction model, 

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

Table 2. Within- and between-herd predictive ability of random forest (RF) and multiple linear regression (MLR) models with different data 
sets: 3k, 21k, 41k, measured by 3 accuracy metrics: Pearson correlation, RMSE, and NDCG from 10-fold cross-validation that is averaged over 
5 replicates

Prediction 
model

Data attributes No. of  
cross- 

validation 
replicates

Within-herd

 
Between-herd

Size
Proxies 
imputed  

DMI 
included r y y,̂( )1 RMSE2 NDCG3 r y y,̂( ) RMSE NDCG

RF 3k No No 5 0.52 31.3 0.75 0.32 35.4 0.73
RF 3k No Yes 5 0.77 23.3 0.89 0.58 31.9 0.82
RF 21k No No 5 0.80 20.3 0.91 0.33 33.1 0.65
RF 21k Yes Yes 5 0.84 18.5 0.92 0.63 29.9 0.83
RF 41k Yes No 5 0.81 20.6 0.86 0.20 34.8 0.55
RF 41k Yes Yes 5 0.82 20.0 0.86 0.39 33.6 0.65
MLR 3k No No 5 0.28 36.6 0.70 0.12 45.6 0.56
MLR 3k No Yes 5 0.50 32.7 0.73 0.13 42.7 0.56
MLR 21k No No 5 0.71 22.4 0.89 0.07 38.4 0.70
MLR 21k Yes Yes 5 0.75 22.7 0.90 0.14 32.7 0.87
MLR 41k Yes No 5 0.77 21.6 0.86 0.19 41.4 0.81
MLR 41k Yes Yes 5 0.79 21.4 0.86 0.21 39.6 0.83
1Pearson correlations (r) between observed and predicted CH4 production.
2Rootmeans squared error (expressed as percentage of the mean CH4 production g/d).
3Mean normalized discounted cumulative gain.


Journal of Dairy Science Vol. 105 No. 6, 2022

the most relevant proxies for prediction were DIM, 
milk yield, BW, milk fat, and milk protein. When DMI 
was included in the prediction model DMI ranked first 
in variable importance, followed by milk yield, DIM, 
milk fat, and BW. In general, breed, parity and CH4 
measurement method ranked at the bottom of variable 
importance.

DISCUSSION

Accurate inventories of GHG emission are essential 
to reflect a country’s national emissions from livestock 
production systems. Productivity and associated emis-
sions intensity of livestock farming differ widely around 
the world and the potential for change is large (Niu et 
al., 2018). Understanding national, regional, and global 
variations in GHG emissions are therefore essential for 
concerted global actions to mitigate emissions (Hristov 
et al., 2018). Particularly at a time when there are un-
certainties in proportion of increase in CH4 emissions 
solely attributable to livestock sources (Hristov et al., 
2018), accurate estimation of CH4 emissions across 
different national borders and production systems is 
needed. Average CH4 production (g/d) from ruminants 
varies with diet, animal populations (e.g., species and 
breeds), production system, production level, and DIM 

(Bell et al., 2014; de Haas et al., 2011; Garnsworthy 
et al., 2012; Negussie et al., 2017b). Different esti-
mates of average CH4 production of dairy cows were 
reported from different production systems (Waghorn 
et al., 2008; O’Neill et al., 2011; Hellwing et al., 2013; 
Deighton et al., 2014; Negussie et al., 2017b; Niu et 
al., 2018). Using data compiled over 8 experiments and 
covering 30 diets (Hellwing et al., 2013) reported an 
average CH4 production of 412 g/d for Danish lactating 
cows. Negussie et al. (2017b) working on Nordic red 
cows estimated an average CH4 production of 396 g/d 
whereas (Bayat et al., 2017) using chambers reported 
a range from 335 to 492 g/d in an experiment designed 
to test different dietary treatments. Bell et al. (2014) 
reported an average CH4 production of 418 g/d with 
a range between 220 and 480 g/d in 1964 lactating 
Holstein Friesian cows across 21 UK herds. In Australia 
(Williams et al., 2013; Deighton et al., 2014; Moate et 
al., 2014) CH4 emission ranged from 369 to 458 g/d for 
cows fed harvested pasture grass. From other pasture 
grass based dairy productions systems, somewhat dif-
ferent values have been reported. For instance, O’Neill 
et al. (2011) reported CH4 emissions of 251 g/d in 
cows fed harvested perennial ryegrass. In New Zealand, 
Waghorn et al. (2008) reported CH4 emissions rang-
ing from 273 to 352 g/d in cows fed harvested pasture 

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

Figure 3. Within-herd (top) and between-herd (bottom) prediction accuracies in terms of Pearson correlations r y y,̂( ) between observed and 
predicted CH4 emissions (g/d) by CH4 measurement method and prediction scenarios within 3k, 21k, and 41k data sets. Colors indicate inclusion 
(blue) or not (red) of DMI in the predictive model; shapes indicate imputation (triangles) or no imputation (dots) of missing proxies in the data 
set. Data points represent the 5 replicates of each predictive model. The gray horizontal bars are average prediction accuracies per scenario.


Journal of Dairy Science Vol. 105 No. 6, 2022

grass. This selection of results demonstrates the large 
between-country variability. In our combined across-
country data, estimated overall mean CH4 production 
was 372 g/d, which is in the middle of the ranges de-
scribed above. Our estimate also corresponds to recent 
estimates from a combined regional data set reported in 
(Niu et al., 2018). Using a global data set collated from 
the United States, European Union, Australia, and 
New Zealand, the authors reported mean CH4 produc-
tion of 345 g/d per cow for European Union, 354 g/d 
per cow for the United States, and 347 g/d per cow for 
the combined intercontinental data set.

Accuracy of Proxy-Based Prediction of CH4 Using 
RF Versus MLR Models

In a comprehensive review, Negussie et al. (2017a) 
concluded that whenever direct animal measurements 
are difficult and expensive to procure, use of combina-
tions of proxies for CH4 in empirical prediction equa-
tions has great potential. Empirical models have long 
been used involving different predictor variables to 
predict CH4 emissions from cows, some as early as the 
1930s (Kriss, 1931). There are several examples of such 
quantitative approaches to predict CH4 production in 

cattle using mainly dietary and animal factors as prox-
ies (Kebreab et al., 2008; Ellis et al., 2010; Ramin and 
Huhtanen, 2013; Appuhamy et al., 2016; Niu et al., 
2018; Benaouda et al., 2019). Appuhamy et al. (2016) 
listed 40 such models that were developed in North 
America, Europe, Australia, and New Zealand. They 
suggested that comprehensive CH4 emission models 
need examining and testing against CH4 emission 
measurements from dairy cows in different regions of 
the world, an idea that was recently implemented in 
Benaouda et al., (2019). So far, although many pre-
diction models have been reported, a closer look at 
most empirical models indicates that there are still a 
range of limitations that may preclude their practical 
applicability. These limitations include the following: 
(1) Most prediction models are not based on individual 
cow observations but on treatment means from differ-
ent studies. Depending on sample size and other factors 
including measurement methods, this can be associated 
with different degrees of uncertainty (e.g., SD) (Ap-
puhamy et al., 2016). (2) In most cases, data sets used 
as an input were from a single herd, a specific diet or 
from only few labs (Blaxter and Clapperton, 1965; Yan 
et al., 2000; Jentsch et al., 2007; Ellis et al., 2010). (3) 
Most prediction models were based on measurements 

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

Figure 4. Within-herd (top) and between-herd (bottom) prediction accuracies in terms of root mean squared error (RMSE) by CH4 measure-
ment method and prediction scenarios within 3k, 21k, and 41k data sets. Colors indicate inclusion (blue) or no inclusion (red) of DMI in the 
predictive model; shapes indicate imputation (triangles) or no imputation (dots) of missing proxies in the data sets. Data points represent the 
5 replicates of each predictive model. The gray horizontal bars are average prediction accuracies per scenario.


Journal of Dairy Science Vol. 105 No. 6, 2022

from relatively small numbers of animals, which may 
limit their broad applicability. (4) Most prediction 
models used simple or MLR analyses without appropri-
ate modeling of the fixed and random components (Ra-
min and Huhtanen, 2013). In many instances, possible 
nonlinear relationships in the data were not taken into 
consideration, leading to biased estimates of param-
eters (St-Pierre, 2001). (5) As enteric CH4 emissions 
are strongly related to feed intake, almost all models 
included a measure of intake such as DMI, intake of 
gross energy or metabolizable energy, or fiber intake, 
as prime predictor variables. However, these variables 
are currently not readily available under commercial 
conditions and none of them is routinely measured 
on individual animal’s on-farm. Thus, there is a need 
to develop robust prediction models that do not rely 
completely on feed intake measures or estimates (Hris-
tov et al., 2018). Nevertheless, models without these 
variables (such as DMI or dietary composition) could 
be less accurate and thus during model development, 
it is essential to consider the trade-offs between cost, 
practicability and prediction accuracy (Appuhamy et 
al., 2016). (6) Finally, the advent of the smart farming 
revolution, new phenotyping platforms, such as sensors, 
on-line recording and imaging tools, have started to 

generate an enormous amount of data on proxies for 
CH4 from heterogeneous sources. Compilation, analysis 
and utilization of such large sets of information require 
the latest, robust and versatile statistical tools, which 
are now common in the ML approach. However, none 
of the CH4 prediction models reported so far in dairy 
cattle has attempted to use ML algorithms; and our 
study represents the first such effort in this direction. 
Machine learning algorithms have great potential for 
identifying hidden trends in unstructured heteroge-
neous data sets and offer predictive modeling that 
can accommodate nonlinear relationships among vari-
ables. Combining across-country heterogeneous data 
along with the application of ML algorithms should 
be a logical step toward developing robust and glob-
ally relevant CH4 prediction models (Negussie et al., 
2019). The robustness of ML methods is partly due to 
their extraordinary ability to input and use information 
from heterogeneous sources. Consequently, predictions 
from ML model RF are reliable and robust and are 
applicable under diverse production and environmental 
conditions. Furthermore, the approach outlined in the 
current study offers analytical tools to support future 
attempts to build globally representative large-scale 
GHG emission databases, on the basis of which, ac-

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

Figure 5. Within-herd (top) and between-herd (bottom) prediction accuracies in terms of mean normalized discounted cumulative gain 
(NDCG) by CH4 measurement method and prediction scenarios within 3k, 21k, and 41k data sets. Colors indicate inclusion (blue) or no inclusion 
(red) of DMI in the predictive model; shapes indicate imputation (triangles) or no imputation (dots) of missing proxies in the data sets. Data 
points represent the 5 replicates of each predictive model. The gray horizontal bars are average prediction accuracies per scenario.


Journal of Dairy Science Vol. 105 No. 6, 2022

curate regional and intercontinental inventories as well 
as concerted global mitigation strategies could be de-
veloped.

Within-Herd and Between-Herd Methane  
Prediction Accuracy

Both in within- and between-herd predictions, inclu-
sion of DMI, either measured or imputed, increased the 
predictive ability of RF and MLR models. As expected 
for decreasing marginal increments, the largest predic-
tion accuracy increase was observed for the 3k data set 
(from RF models: 17–48% for within-herd predictions 
and 1–83% for between-herd predictions, depending 
on the metrics used). In modeling the stage of lacta-
tion, quadratic terms or high order polynomials can be 
used to account for possible nonlinear relationships. In 
Equation 4, an additive model with linear terms only 
was chosen. However, we have also tested MLR model 
with a quadratic DIM term, but no significant changes 
in the results (negligible or no improvements in predic-
tive accuracy) was observed. The modeling of lactation 
stage for CH4 prediction could be an area of future 
further investigation.

Between-herd predictions are more challenging than 
within-herd predictions. For between-herd predictions, 
RF showed a much better performance than MLR, with 

accuracy increasing by as much as 350% for r y y,̂( ) with 
21k data. This enhanced performance substantiates the 
robustness of RF predictions and the ability of RF 
models to make effective use of information coming 
from different herds with heterogeneous management 
and farm routines. Overall, across the 3 data sets and 
accuracy metrics, between-herd prediction accuracies 
were lower than within-herd prediction accuracies 
which is in line with the results reported in Wang and 
Bovenhuis (2019). This is probably because observa-
tions of an individual cow will predict the CH4 output 
of its herd mates with higher accuracy than predicting 
CH4 output of animals in other herds in the combined 
across-country data. Furthermore, diet composition 
and other factors that influence CH4 output, vary less 
within herds than between herds. Evaluating random 
cross-validation and block cross-validation (with farms 
as blocks) which corresponds to within- and between-
herd cross-validation in our study, Wang and Bovenhuis 
(2019) reported that random cross-validation could re-
sult in an over optimistic view on the ability of milk IR 
spectra to predict CH4 emission and leads to misleading 
conclusions. Roberts et al. (2017) explained that when 
validation data are randomly selected for cross-valida-
tion from the entire spatial domain, training and vali-
dation data from nearby locations will be dependent 
(spatial autocorrelation). Consequently, if the objective 

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

Figure 6. Variable importance in terms of percent mean square error reduction (%IncMSE) and increase in node purity (IncNodePurity) 
from the random forest model (A) without including DMI and (B) with inclusion of DMI in the prediction model.


Journal of Dairy Science Vol. 105 No. 6, 2022

is to project outside the spatial structure of the training 
data, error estimates from random cross-validations 
will be overly optimistic. To address this, they sug-
gested that blocks can be designed across the spatial 
structure itself (i.e., in contiguous geographic space). 
This effectively forces testing on more spatially distant 
records, thus decreasing optimism in error estimates 
which underscores the power and practicality of the 
between-herd or block cross-validation as implemented 
in the current study. In addition to cross-validation 
with nonrandom blocks, carefully chosen modeling ob-
jectives can offer more reliable error estimates (Roberts 
et al., 2017).

When comparing methods used for CH4 measurement, 
records coming from respiration chambers consistently 
displayed the most accurate and least variable predic-
tions, across data sets, scenarios and accuracy metrics. 
Possible reasons for this include within-day variation in 
CH4 emissions, which may not be accounted for in spot-
sample sniffer techniques, and the influence of herd and 
environmental variability on sniffer measurements. In 
addition, sniffer measurements are influenced by cow 
activity, feeding behavior, and relationships between 
cow herd mates, which are excluded when cows are 
placed in chambers (Garnsworthy et al., 2019). As a 
result, sniffer data were not as robust as chamber data 
in predicting CH4 emission in other herds. Nevertheless, 
when within-herd prediction accuracies were compared, 
sniffer data were as accurate as chamber data in predic-
tion of CH4 as they are mostly tailored to specific herd 
environments. On the other hand, for the SF6 method, 
when DMI was added or missing proxies were imputed, 
with-herd prediction accuracies were close to estimates 
from the chamber herds. However, estimates from SF6 
were in general highly variable owing to the small num-
ber of observations available for the method.

Variable Importance and Effect of Imputation  
of Missing Data Points

In predictive statistics, it is fundamentally important 
to have an accurate model; however, it may also be 
desirable to have a model that is easy to interpret, and 
where variable features that contribute most to pre-
dictive ability can be identified. In the present study, 
relative contributions of proxy variables were provided 
by RF models. When DMI was not included in the 
model, the proxies that contributed most to prediction 
accuracy were DIM, milk yield, BW, milk fat, and milk 
protein. On the other hand, when included in the mod-
el, DMI was identified as the most important variable 
by all the metrics used to measure variable importance. 
On the contrary, breed, parity and CH4 measurement 

method were the variables that contributed least to 
prediction accuracy.

Adding DMI to the prediction models is expected 
to increase accuracy of prediction of CH4 because of 
the clear biological relationship between DMI and CH4 
production. For instance, in dairy cows that consume 
more feed, more CH4 is produced due to the greater 
availability of substrate for microbial fermentation 
(Hristov et al., 2018). Conversely, increased intake may 
potentially increase passage rate and shorten digesta 
retention time in the rumen, thus decreasing rumen 
fermentation and organic matter digestibility, which 
ultimately decreases CH4 per unit of feed (Boadi et 
al., 2004). Dry matter intake and ME intake are the 
variables most used for prediction of CH4 emission 
(Johnson and Johnson, 1995; Mills, et al., 2003; Ellis et 
al., 2007). Consequently, prediction equations includ-
ing such energy intake variables showed low root mean 
square prediction error (RMSPE) and are therefore 
important in prediction of enteric CH4 emission (So-
brinho et al., 2019). Ellis et al. (2007) also confirmed 
that use of DMI in prediction equations for CH4 emis-
sion in cattle resulted in lower RMSPE. Sobrinho et al. 
(2019) working on Nellore cattle reported that equa-
tions that included intakes of DM, total carbohydrate, 
ME, cellulose and nonfiber carbohydrates were the 
most accurate for the prediction of enteric CH4 emis-
sion. In our study, however, DMI was measured in few 
herds, and data on other dietary variables were not 
available. Therefore, imputation of missing DMI data 
points and their inclusion in prediction models had a 
marked positive effect on prediction accuracies, espe-
cially on between-herd prediction accuracies. This was 
particularly true for herds using the sniffer method, the 
majority of which did not have measured DMI records. 
This is consistent with reports in the literature (Ap-
puhamy et al., 2016; Bayat et al., 2017). Appuhamy et 
al. (2016) evaluated 40 prediction equations using data 
that included measured or estimated DMI and some 
feed quality attributes. They reported that models us-
ing estimated DMI predicted enteric CH4 emissions as 
accurately as the measured DMI, provided DMI could 
be estimated with reasonable accuracy. They also re-
ported that enteric CH4 emissions from dairy cows can 
be predicted successfully (RMSPE = 12.7%) without 
DMI, but more accurately (RMSPE = 7.7%) with 
estimated DMI. Similarly, in our study, using hetero-
geneous across-country data RMSPE for RF models 
ranged from 23.3 to 31.3% when DMI was not in the 
prediction model and ranged from 18.5 to 20.3% when 
imputed DMI was included in the prediction model. 
This in general indicates that by imputing missing DMI 
data points it is possible to achieve satisfactory predic-

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE


Journal of Dairy Science Vol. 105 No. 6, 2022

tion of CH4 emissions provided the right statistical and 
imputation methods are used.

Niu et al. (2018) evaluated the potential contribu-
tion of predictor variables by adding sequentially each 
of the predictor variables during a model development 
process. They observed that accuracy of prediction of 
CH4 production was improved in models that included 
DMI, diet composition, milk production and composi-
tion, and BW. In particular, complex models that used 
all available variable information consistently improved 
prediction performance compared with simpler models. 
However, models using only milk yield or diet composi-
tion were the least accurate. When DMI was removed 
from the model to predict CH4 production, ECM was 
selected instead due to its high correlation with DMI, 
but model predictive ability was reduced. This is con-
sistent with results of the current study where, when 
available, DMI was ranked first, followed by milk yield 
and milk compositional variables. Under the scenario 
when DMI was omitted from training of the regression 
trees, RF ranked milk yield and milk compositional 
traits at the top. Breed, parity and methods used for 
the measurement of CH4 ranked lowest, indicating their 
relatively low importance, which could also be due in 
part to their correlations with highly related variables.

Benefits of Heterogeneous Across-Country Data  
and Potentials of Machine Learning  
in Predictive Modeling

In recent years, marked progress has been made in 
developing empirical CH4 prediction models at national, 
regional, and global levels (Ellis et al., 2010; Appuhamy 
et al., 2016; Niu et al., 2018). Despite this progress, 
much remains to be done. Especially, collating diverse 
and heterogeneous intercontinental data into one us-
able set, data harmonization, standardization, model 
validation, and correction for heterogeneity of variances 
in across-country data are all areas of interest. The sta-
tistical methods most used in developing CH4 predictive 
models to date have been questioned because of their 
limitations (Hristov et al., 2018), such as not includ-
ing random effects of animals or studies. Furthermore, 
models based on MLR assume predominantly linear 
relationships among the features of the target variables, 
although nonlinear relationships in the data set may be 
equally likely. To the best of our knowledge, the current 
study is one of the first attempts to predict CH4 by 
applying ML on combined data on low-cost routinely 
recorded proxies for CH4 from individual animals and 
from diverse international sources.

In our heterogeneous data set, because most herds 
had no measured DMI records, most of the missing 
DMI data points were imputed from routinely recorded 

proxy variables. The finding that imputed DMI records 
improved prediction accuracies clearly shows the poten-
tial that CH4 output could be predicted with reasonable 
accuracy from routinely available variables and esti-
mated DMI, provided that DMI is imputed accurately. 
This opens a great opportunity to include many herds 
in large global or regional databases for intensive data 
analysis, such as across-country genetic evaluations 
when direct measurements of DMI are not available. In 
general, addition of many routinely recorded predictor 
variables into the predictive model can contribute to 
increased prediction accuracy. Therefore, our results 
emphasize the great value of using proxy variables that 
are recorded routinely on-farm and imputing missing 
DMI observations to generate a reasonably accurate 
prediction of CH4 when direct measurements of CH4 
from individual animals are difficult or expensive to 
obtain on a large scale.

Because predictive ability of models is likely to be 
enhanced with increasing model complexity (Moraes 
et al., 2014), during model development the trade-off 
between availability of variable inputs on-farm and pre-
diction accuracy must be carefully considered. In this 
regard, a review by Negussie et al. (2017a) provided an 
extensive list of potential proxies with critical evalu-
ation of their attributes. The present study substan-
tiated the practical use of such proxies in improving 
the accuracy of prediction of CH4. Therefore, from now 
onward, relaxing the threshold on predictor variables 
to include low-cost and routinely available proxies will 
open possibilities to collate more and diverse informa-
tion on GHG emissions from as many regions and pro-
duction systems as possible. More and diverse informa-
tion, when properly compiled and analyzed, will have 
a significant implication in improving the accuracy of 
predictions. However, Hristov et al. (2018) highlighted 
that one of current challenges is to make more data 
available and the next frontier should be the collation 
of GHG information from as wide and diverse livestock 
populations as possible to develop robust models that 
are globally applicable. At the same time, efforts should 
also be directed toward sharpening analytical tools by 
which important missing data points could be accu-
rately imputed from routinely measured proxies.

Future Considerations

Predictive modeling has a great significance and value 
particularly for traits that are difficult and expensive 
to record routinely under commercial farm conditions. 
This includes traits such as DMI and CH4 emission. 
The accuracy of predictive models can be influenced by 
the type and the completeness of the data used. In most 
situations especially with integrated data from diverse 

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE


Journal of Dairy Science Vol. 105 No. 6, 2022

sources missing variables are quite common. The impu-
tation of missing variables form other correlated predic-
tor variables is a common solution to increase sample 
size and hence accuracy as shown in this study. How-
ever, imputation techniques can introduce uncertainty 
in the data with trade-offs between increase in sample 
size and accuracy on one hand and possible increase in 
uncertainty on the other. Our results showed the ben-
efits of imputation in terms of predictive performance 
across the different scenarios. Nevertheless, future 
studies on the measurement of accuracy of imputations 
are warranted. Furthermore, heterogeneity of residual 
variances is one important area. In multilevel models, 
residual variance may vary between subsets of the data, 
such as between methane measurement techniques, 
and this was not accounted for in our MLR model in 
Equation 4, where homoscedasticity was assumed. In 
classical statistical modeling, notably in mixed linear 
models for animal genetics, heterogeneity of variance 
is known to bias the estimates of model solutions (e.g., 
Visscher and Hill, 1992). However, Hill (1984) on the 
contrary has shown that ignoring heterogeneity of 
variances decreases the efficiency of genetic evaluation 
procedures and consequently the response to selection. 
Generally, less obvious is the effect of heterogeneous 
variance on the performance of predictive models in 
ML, where most of the methods used (including RF) 
do not rely on the homoscedasticity assumption. In 
a simulation study, W. Ruth and T. Loughin (Simon 
Fraser University, Burnaby, BC, Canada; unpublished 
data) showed that heterogeneity of variance negatively 
affects the performance of single regression trees. How-
ever, less clear is how this could affect the performance 
of ensemble methods such as RF, which are based on a 
large number of trees that could counteract the nega-
tive effects of heterogeneous variance by reducing the 
variance of predictions through averaging over trees. 
Assessing the effect of heterogeneous variance on the 
predictive ability of ML models for prediction of CH4 
emission is an interesting topic that will be taken up in 
our follow-up studies.

In conclusion, the present study describes a novel 
way forward for developing accurate and robust CH4 
prediction models. These models will help in designing 
effective and sustainable GHG mitigation strategies as 
well as aiding national, regional and global GHG inven-
tories. The broad applicability of such models requires 
collation of input data from wide and heterogeneous 
sources. It helps to overcome the difficulty of procuring 
predictor variables related to intake and diet composi-
tion on-farm, and the need for versatile statistical tools 
for compiling and analysis of unstructured, heteroge-
neous across-country data. In this way, low-cost and 
routinely measured proxy variables can be used to 

provide a reasonably accurate prediction of CH4 when 
coupled with imputation of missing DMI data points. 
As a predictive model, the use of the ML ensemble 
algorithm RF consistently gave more accurate predic-
tions than conventional multiple regression models. 
This provides a great potential for building a globally 
representative large-scale CH4 emission database on the 
basis of which an accurate regional and intercontinental 
inventory as well as a concerted global mitigation strat-
egy could be developed. Results from this study lay 
strong foundations for our next thorough comparison 
of various state-of-the-art ML methods for prediction 
of dairy-cow CH4 emissions in a much larger integrated 
global data set.

ACKNOWLEDGMENTS

This paper is the result of the concerted effort of all 
participants and support from the networks of COST 
Action FA1302 “METHAGENE: Large-scale methane 
measurements on individual ruminants for genetic 
evaluations.” The authors thank all individuals and 
groups who have directly or indirectly contributed to 
this work; special thanks are due to the technical and 
financial support from the COST Action FA1302 of the 
European Union. In addition, all financial and technical 
supports from all participating countries and research 
centers involved in this work are greatly acknowledged. 
The authors have not stated any conflicts of interest.

REFERENCES

Al-Jarrah, O. Y., P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, and 
K. Taha. 2015. Efficient machine learning for big data: A review. 
Big Data Research. 2:87–93. http:​/​/​dx​.doi​.org/​10​.1016/​j​.bdr​.2015​
.04​.001.

Appuhamy, J. A. D. R. N., J. France, and E. Kebreab. 2016. Models 
for predicting enteric methane emissions from dairy cows in North 
America, Europe, and Australia and New Zealand. Glob. Chang 
Biol. 22:3039–3056. https:​/​/​doi​.org/​10​.1111/​gcb​.13339.

Bayat, A. R., L. Ventto, P. Kairenius, T. Stefański, H. Leskinen, I. 
Tapio, E. Negussie, J. Vilkki, and K. J. Shingfield. 2017. Dietary 
forage to concentrate ratio and sunflower oil supplement alter ru-
men fermentation, ruminal methane emissions, and nutrient utili-
zation in lactating cows. Transl. Anim. Sci. 1:277–286. https:​/​/​doi​
.org/​10​.2527/​tas2017​.0032.

Bell, M. J., N. Saunders, R. H. Wilcox, E. M. Homer, J. R. Good-
man, J. Craigon, and P. C. Garnsworthy. 2014. Methane emissions 
among individual dairy cows during milking quantified by eructa-
tion peaks or ratio with carbon dioxide. J. Dairy Sci. 97:6536–
6546. https:​/​/​doi​.org/​10​.3168/​jds​.2013​-7889.

Benaouda, M., C. Martin, X. Li, E. Kebreab, A. N. Hristov, Z. Yu, 
D. R. Yáñez-Ruiz, C. K. Reynolds, L. A. Crompton, J. Dijkstra, 
A. Bannink, A. Schwarm, M. Kreuzer, M. McGee, P. Lund, A. 
L. F. Hellwing, M. R. Weisbjerg, P. J. Moate, A. R. Bayat, K. 
J. Shingfield, N. Peiren, and M. Eugène. 2019. Evaluation of the 
performance of existing mathematical models predicting enteric 
methane emissions from ruminants: Animal categories and dietary 
mitigation strategies. Anim. Feed Sci. Technol. 255:114207. https:​
/​/​doi​.org/​10​.1016/​j​.anifeedsci​.2019​.114207.

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

http://dx.doi.org/10.1016/j.bdr.2015.04.001
http://dx.doi.org/10.1016/j.bdr.2015.04.001
https://doi.org/10.1111/gcb.13339
https://doi.org/10.2527/tas2017.0032
https://doi.org/10.2527/tas2017.0032
https://doi.org/10.3168/jds.2013-7889
https://doi.org/10.1016/j.anifeedsci.2019.114207
https://doi.org/10.1016/j.anifeedsci.2019.114207


Journal of Dairy Science Vol. 105 No. 6, 2022

Blaxter, K. L., and J. L. Clapperton. 1965. Prediction of the amount 
of methane produced by ruminants. Br. J. Nutr. 19:511–522. https:​
/​/​doi​.org/​10​.1079/​BJN19650046.

Blondel, M., A. Onogi, H. Iwata, and N. Ueda. 2015. A ranking ap-
proach to genomic selection. PLoS One 10:e0128570. https:​/​/​doi​
.org/​10​.1371/​journal​.pone​.0128570.

Boadi, D. A., K. M. Wittenberg, S. L. Scott, D. Burton, K. Buckley, J. 
A. Small, and K. H. Ominski. 2004. Effect of low and high forage 
diet on enteric and manure pack greenhouse gas emissions from a 
feedlot. Can. J. Anim. Sci. 84:445–453. https:​/​/​doi​.org/​10​.4141/​
A03​-079.

Breiman, L. 2001. Random forests. Mach. Learn. 45:5–32. https:​/​/​doi​
.org/​10​.1023/​A:​1010933404324.

Cassandro, M. 2020. Animal breeding and climate change, mitigation 
and adaptation. J. Anim. Breed. Genet. 137:121–122. https:​/​/​doi​
.org/​10​.1111/​jbg​.12469.

Cassandro, M., M. Marcello, and B. Stefanon. 2013. Genetic aspects of 
enteric methane emission in livestock ruminants. Ital. J. Anim. Sci. 
12:450–458. https:​/​/​doi​.org/​10​.4081/​ijas​.2013​.e73.

Charmley, E., S. R. O. Williams, P. J. Moate, R. S. Hegarty, R. M. 
Herd, V. H. Oddy, P. Reyenga, K. M. Staunton, A. Anderson, 
and M. C. Hannah. 2016. A universal equation to predict meth-
ane production of forage-fed cattle in Australia. Anim. Prod. Sci. 
56:169–180. https:​/​/​doi​.org/​10​.1071/​AN15365.

de Haas, Y., J. J. Windig, M. P. L. Calus, J. Dijkstra, M. de Haan, A. 
Bannink, and R. F. Veerkamp. 2011. Genetic parameters for pre-
dicted methane production and the potential for reducing enteric 
emissions through genomic selection. J. Dairy Sci. 94:6122–6134. 
https:​/​/​doi​.org/​10​.3168/​jds​.2011​-4439.

Deighton, M. H., S. R. O. Williams, M. C. Hannah, R. J. Eckard, 
T. M. Boland, W. J. Wales, and P. J. Moate. 2014. A modified 
sulphur hexafluoride tracer technique enables accurate determina-
tion of enteric methane emissions from ruminants. Anim. Feed 
Sci. Technol. 197:47–63. https:​/​/​doi​.org/​10​.1016/​j​.anifeedsci​.2014​
.08​.003.

Ellis, J. L., A. Bannink, J. France, E. Kebreab, and J. Dijkstra. 2010. 
Evaluation of enteric methane prediction equations for dairy cows 
used in whole farm models. Glob. Chang. Biol. 16:3246–3256. 
https:​/​/​doi​.org/​10​.1111/​j​.1365​-2486​.2010​.02188​.x.

Ellis, J. L., E. Kebreab, N. E. Odongo, B. W. McBride, E. K. Okine, 
and J. France. 2007. Prediction of methane production from dairy 
and beef cattle. J. Dairy Sci. 90:3456–3466. https:​/​/​doi​.org/​10​
.3168/​jds​.2006​-675.

Engineering ToolBox. 2003. Gases - densities. Accessed Jun. 11, 2018. 
https:​/​/​www​.Engineeringtoolbox​.Com/​Gas​-Density​-d​_158​.Html.

Engineering ToolBox. 2004. STP – standard temperature and pres-
sure and NTP – normal temperature and pressure. Accessed Jun. 
11, 2018. https:​/​/​www​.engineeringtoolbox​.com/​stp​-standard​-ntp​
-normal​-air​-d​_772​.html.

FAO (Food and Agriculture Organization of the United Nations). 
2016. Greenhouse Gas Emissions from Agriculture, Forestry and 
Other Land Use. FAO. http:​/​/​www​.fao​.org/​3/​a​-i6340e​.pdf.

FAO (Food and Agriculture Organization of the United Nations). 
2018. Enteric fermentation. FAOSTAT. Accessed June 10, 2018. 
http:​/​/​www​.fao​.org/​faostat/​en/​#data/​ge.

Garnsworthy, P. C., J. Craigon, J. H. Hernandez-Medrano, and N. 
Saunders. 2012. Variation among individual dairy cows in meth-
ane measurements made on farm during milking. J. Dairy Sci. 
95:3181–3189. https:​/​/​doi​.org/​10​.3168/​jds​.2011​-4606.

Garnsworthy, P. C., G. F. Difford, M. J. Bell, A. R. Bayat, P. 
Huhtanen, B. Kuhla, J. Lassen, N. Peiren, M. Pszczola, D. Sorg, 
M. H. P. W. Visker, and T. Yan. 2019. Comparison of methods 
to measure methane for use in genetic evaluation of dairy cattle. 
Animals (Basel) 9:837. https:​/​/​doi​.org/​10​.3390/​ani9100837.

González-Recio, O., and S. Forni. 2011. Genome-wide prediction of 
discrete traits using Bayesian regressions and machine learning. 
Genet. Sel. Evol. 43:7. https:​/​/​doi​.org/​10​.1186/​1297​-9686​-43​-7.

Gower, J. C. 1971. A general coefficient of similarity and some of 
its properties. Biometrics 27:857–874. https:​/​/​doi​.org/​10​.2307/​
2528823.

Hellwing, A. L. F., P. Lund, J. Madsen, and M. R. Weisbjerg. 2013. 
Comparison of enteric methane production predicted from the 
CH4/CO2 ratio and measured in respiration chambers. Adv. Anim. 
Biosci. 4:557.

Hill, W. G. 1984. On selection among groups with heterogenous vari-
ance. Anim. Prod. 39:473–477.

Hristov, A. N., E. Kebreab, M. Niu, J. Oh, A. Bannink, A. R. Bayat, 
T. M. Boland, A. F. Brito, D. P. Casper, L. A. Crompton, J. Dijk-
stra, M. Eugène, P. C. Garnsworthy, N. Haque, A. L. F. Hellwing, 
P. Huhtanen, M. Kreuzer, B. Kuhla, P. Lund, J. Madsen, C. Mar-
tin, P. J. Moate, S. Muetzel, C. Muñoz, N. Peiren, J. M. Powell, 
C. K. Reynolds, A. Schwarm, K. J. Shingfield, T. M. Storlien, M. 
R. Weisbjerg, D. R. Yáñez-Ruiz, and Z. Yu. 2018. Symposium 
review: Uncertainties in enteric methane inventories, measurement 
techniques, and prediction models. J. Dairy Sci. 101:6655–6674. 
https:​/​/​doi​.org/​10​.3168/​jds​.2017​-13536.

Hristov, A. N., J. Oh, J. L. Firkins, J. Dijkstra, E. Kebreab, G. Wag-
horn, H. P. S. Makkar, A. T. Adesogan, W. Yang, C. Lee, P. J. 
Gerber, B. Henderson, and J. M. Tricarico. 2013. Special topics: 
Mitigation of methane and nitrous oxide emissions from animal 
operations: I. A review of enteric methane mitigation options. J. 
Anim. Sci. 91:5045–5069. https:​/​/​doi​.org/​10​.2527/​jas​.2013​-6583.

Jantke, K., M. J. Hartmann, L. Rasche, B. Blanz, and U. A. Schneider. 
2020. Agricultural greenhouse gas emissions: Knowledge and posi-
tions of German farmers. Land (Basel) 9:130. https:​/​/​doi​.org/​10​
.3390/​land9050130.

Järvelin, K., and J. Kekäläinen. 2002. Cumulated gain-based evalua-
tion of IR techniques. ACM Trans. Inf. Syst. 20:422–446. https:​/​/​
doi​.org/​10​.1145/​582415​.582418.

Jentsch, W., M. Schweigel, F. Weissbach, H. Scholze, W. Pitroff, and 
M. Derno. 2007. Methane production in cattle calculated by the 
nutrient composition of the diet. Arch. Anim. Nutr. 61:10–19. 
https:​/​/​doi​.org/​10​.1080/​17450390601106580.

Johnson, K. A., and D. E. Johnson. 1995. Methane emissions from 
cattle. J. Anim. Sci. 73:2483–2492. https:​/​/​doi​.org/​10​.2527/​1995​
.7382483x.

Kebreab, E., K. Clark, C. Wagner-Riddle, and J. France. 2006. Meth-
ane and nitrous oxide emissions from Canadian animal agriculture: 
A review. Can. J. Anim. Sci. 86:135–157. https:​/​/​doi​.org/​10​.4141/​
A05​-010.

Kebreab, E., K. A. Johnson, S. L. Archibeque, D. Pape, and T. Wirth. 
2008. Model for estimating enteric methane emissions from United 
States dairy and feedlot cattle. J. Anim. Sci. 86:2738–2748. https:​
/​/​doi​.org/​10​.2527/​jas​.2008​-0960.

Knief, U., and W. Forstmeier. 2021. Violating the normality assump-
tion may be the lesser of two evils. Behav. Res. Methods 53:2576–
2590. https:​/​/​doi​.org/​10​.1101/​498931.

Kowarik, A., and M. Templ. 2016. Imputation with the R package 
VIM. J. Stat. Softw. 74:1–16. https:​/​/​doi​.org/​10​.18637/​jss​.v074​
.i07.

Kriss, M. 1931. A comparison of feeding standards for dairy cows, 
with especial reference to energy requirements. J. Nutr. 4:141–161. 
https:​/​/​doi​.org/​10​.1093/​jn/​4​.1​.141.

Liaw, A., and M. Wiener. 2002. Classification and regression by ran-
domforest. R News 2:18–22. http:​/​/​CRAN​.R​-project​.org/​doc/​
Rnews/​.

Mills, J. A. N., E. Kebreab, C. M. Yates, L. A. Crompton, S. B. Cam-
mell, M. S. Dhanoa, M. S. R. E. Agnew, and J. France. 2003. Al-
ternative approaches to predicting methane emissions from dairy 
cows. J. Anim. Sci. 81:3141–3150. https:​/​/​doi​.org/​10​.2527/​2003​
.81123141x.

Moate, P. J., S. R. O. Williams, C. Grainger, M. C. Hannah, E. N. 
Ponnampalam, and R. J. Eckard. 2011. Influence of cold-pressed 
canola, brewers grains and hominy meal as dietary supplements 
suitable for reducing enteric methane emissions from lactating 
dairy cows. Anim. Feed Sci. Technol. 166–167:254–264. https:​/​/​
doi​.org/​10​.1016/​j​.anifeedsci​.2011​.04​.069.

Moraes, L. E., A. B. Strathe, J. G. Fadel, D. P. Casper, and E. Ke-
breab. 2014. Prediction of enteric methane emissions from cattle. 
Glob. Chang. Biol. 20:2140–2148. https:​/​/​doi​.org/​10​.1111/​gcb​
.12471.

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

https://doi.org/10.1079/BJN19650046
https://doi.org/10.1079/BJN19650046
https://doi.org/10.1371/journal.pone.0128570
https://doi.org/10.1371/journal.pone.0128570
https://doi.org/10.4141/A03-079
https://doi.org/10.4141/A03-079
https://doi.org/10.1023/A:1010933404324
https://doi.org/10.1023/A:1010933404324
https://doi.org/10.1111/jbg.12469
https://doi.org/10.1111/jbg.12469
https://doi.org/10.4081/ijas.2013.e73
https://doi.org/10.1071/AN15365
https://doi.org/10.3168/jds.2011-4439
https://doi.org/10.1016/j.anifeedsci.2014.08.003
https://doi.org/10.1016/j.anifeedsci.2014.08.003
https://doi.org/10.1111/j.1365-2486.2010.02188.x
https://doi.org/10.3168/jds.2006-675
https://doi.org/10.3168/jds.2006-675
https://www.Engineeringtoolbox.Com/Gas-Density-d_158.Html
https://www.engineeringtoolbox.com/stp-standard-ntp-normal-air-d_772.html
https://www.engineeringtoolbox.com/stp-standard-ntp-normal-air-d_772.html
http://www.fao.org/3/a-i6340e.pdf
http://www.fao.org/faostat/en/#data/ge
https://doi.org/10.3168/jds.2011-4606
https://doi.org/10.3390/ani9100837
https://doi.org/10.1186/1297-9686-43-7
https://doi.org/10.2307/2528823
https://doi.org/10.2307/2528823
https://doi.org/10.3168/jds.2017-13536
https://doi.org/10.2527/jas.2013-6583
https://doi.org/10.3390/land9050130
https://doi.org/10.3390/land9050130
https://doi.org/10.1145/582415.582418
https://doi.org/10.1145/582415.582418
https://doi.org/10.1080/17450390601106580
https://doi.org/10.2527/1995.7382483x
https://doi.org/10.2527/1995.7382483x
https://doi.org/10.4141/A05-010
https://doi.org/10.4141/A05-010
https://doi.org/10.2527/jas.2008-0960
https://doi.org/10.2527/jas.2008-0960
https://doi.org/10.18637/jss.v074.i07
https://doi.org/10.18637/jss.v074.i07
https://doi.org/10.1093/jn/4.1.141
http://CRAN.R-project.org/doc/Rnews/
http://CRAN.R-project.org/doc/Rnews/
https://doi.org/10.2527/2003.81123141x
https://doi.org/10.2527/2003.81123141x
https://doi.org/10.1016/j.anifeedsci.2011.04.069
https://doi.org/10.1016/j.anifeedsci.2011.04.069
https://doi.org/10.1111/gcb.12471
https://doi.org/10.1111/gcb.12471


Journal of Dairy Science Vol. 105 No. 6, 2022

Negussie, E. 2022. Supplemental Table S1. Harvard Dataverse, V1. 
https:​/​/​doi​.org/​10​.7910/​DVN/​BINDG9

Negussie, E., Y. de Haas, F. Dehareng, R. J. Dewhurst, J. Dijkstra, N. 
Gengler, D. P. Morgavi, H. Soyeurt, S. van Gastelen, T. Yan, and 
F. Biscarini. 2017a. Invited review: Large-scale indirect measure-
ments for enteric methane emissions in dairy cattle: A review of 
proxies and their potential for use in management and breeding 
decisions. J. Dairy Sci. 100:2433–2453. https:​/​/​doi​.org/​10​.3168/​
jds​.2016​-12030.

Negussie, E., O. González-Recio, Y. de Haas, N. Gengler, H. Soyeurt, 
N. Peiren, M. Pszczola, P. Garnsworthy, M. Battagin, A. R. Bayat, 
J. Lassen, T. Yan, T. Boland, B. Kuhla, T. Strabel, A. Schwarm, 
A. Vanlierde, and F. Biscarini. 2019. Machine learning ensemble 
algorithms in predictive analytics of dairy cattle methane emis-
sion using imputed versus non-imputed datasets. Page 40 in Pro-
ceedings of 7th GGAA (Greenhouse Gas and Animal Agriculture) 
Conference, Iguassu Falls, Brazil. Embrapa Southeast Livestock.

Negussie, E., J. Lehtinen, P. Mäntysaari, A. R. Bayat, A.-E. Liinamo, 
E. A. Mäntysaari, and M. H. Lidauer. 2017b. Non-invasive indi-
vidual methane measurement in dairy cows. Animal 11:890–899. 
https:​/​/​doi​.org/​10​.1017/​S1751731116002718.

Nielsen, N. I., H. Volden, M. Åkerlind, M. Brask, A. L. F. Hellwing, 
T. Storlien, and J. Bertilsson. 2013. A prediction equation for en-
teric methane emission from dairy cows for use in NorFor. Acta 
Agric. Scand. A Anim. Sci. 63:126–130. https:​/​/​doi​.org/​10​.1080/​
09064702​.2013​.851275.

Niu, M., E. Kebreab, A. N. Hristov, J. Oh, C. Arndt, A. Bannink, A. 
R. Bayat, A. F. Brito, T. Boland, D. Casper, L. A. Crompton, 
J. Dijkstra, M. A. Eugène, P. C. Garnsworthy, M. N. Haque, A. 
L. F. Hellwing, P. Huhtanen, M. Kreuzer, B. Kuhla, P. Lund, J. 
Madsen, C. Martin, S. C. McClelland, M. McGee, P. J. Moate, 
S. Muetzel, C. Muñoz, P. O’Kiely, N. Peiren, C. K. Reynolds, A. 
Schwarm, K. J. Shingfield, T. M. Storlien, M. R. Weisbjerg, D. 
R. Yáñez-Ruiz, and Z. Yu. 2018. Prediction of enteric methane 
production, yield, and intensity in dairy cattle using an intercon-
tinental database. Glob. Chang. Biol. 24:3368–3389. https:​/​/​doi​
.org/​10​.1111/​gcb​.14094.

O’Neill, B. F., M. H. Deighton, B. M. O’Loughlin, F. J. Mulligan, 
T. M. Boland, M. O’Donovan, and E. Lewis. 2011. Effects of a 
perennial ryegrass diet or total mixed ration diet offered to spring-
calving Holstein-Friesian dairy cows on methane emissions, dry 
matter intake, and milk production. J. Dairy Sci. 94:1941–1951. 
https:​/​/​doi​.org/​10​.3168/​jds​.2010​-3361.

Ramin, M., and P. Huhtanen. 2013. Development of equations for pre-
dicting methane emissions from ruminants. J. Dairy Sci. 96:2476–
2493. https:​/​/​doi​.org/​10​.3168/​jds​.2012​-6095.

Roberts, D. R., V. Bahn, S. Ciuti, M. S. Boyce, J. Elith, G. Guille-
ra-Arroita, S. Hauenstein, J. J. Lahoz-Monfort, B. Schröder, W. 
Thuiller, D. I. Warton, B. A. Wintle, F. Hartig, and C. F. Dor-
mann. 2017. Cross-validation strategies for data with temporal, 
spatial, hierarchical, or phylogenetic structure. Ecography 40:913–
929. https:​/​/​doi​.org/​10​.1111/​ecog​.02881.

Rojas-Downing, M. M., A. P. Nejadhashemi, T. Harrigan, and S. A. 
Woznicki. 2017. Climate change and livestock: Impacts, adapta-
tion, and mitigation. Clim. Risk Manage. 16:145–163. https:​/​/​doi​
.org/​10​.1016/​j​.crm​.2017​.02​.001.

Schielzeth, H., N. J. Dingemanse, S. Nakagawa, D. F. Westneat, H. Al-
legue, C. Teplitsky, D. Réale, N. A. Dochtermann, L. Z. Garamsze-
gi, and Y. G. Araya-Ajoy. 2020. Robustness of linear mixed-effects 
models to violations of distributional assumptions. Methods Ecol. 
Evol. 11:1141–1152. https:​/​/​doi​.org/​10​.1111/​2041​-210X​.13434.

Sobrinho, T. L. P., R. H. Branco, E. Magnani, A. Berndt, R. C. Ca-
nesin, and M. E. Z. Mercadante. 2018. Development and evalu-
ation of prediction equations for methane emission from Nellore 
cattle. Acta Sci. Anim. Sci. 41:e42559. https:​/​/​doi​.org/​10​.4025/​
actascianimsci​.v41i1​.42559.

St-Pierre, N. R. 2001. Invited review: Integrating quantitative findings 
from multiple studies using mixed model methodology. J. Dairy 

Sci. 84:741–755. https:​/​/​doi​.org/​10​.3168/​jds​.S0022​-0302(01)74530​
-4.

Storlien, T. M., H. Volden, T. Almøy, K. A. Beauchemin, T. A. McAl-
lister, and O. M. Harstad. 2014. Prediction of enteric methane 
production from dairy cows. Acta Agric. Scand. A Anim. Sci. 
64:98–109. https:​/​/​doi​.org/​10​.1080/​09064702​.2014​.959553.

Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tib-
shirani, D. Botstein, and R. B. Altman. 2001. Missing value esti-
mation methods for DNA microarrays. Bioinformatics 17:520–525. 
https:​/​/​doi​.org/​10​.1093/​bioinformatics/​17​.6​.520.

Visscher, P. M., and W. G. Hill. 1992. Heterogeneity of variance and 
dairy-cattle breeding. Anim. Sci. 55:321–329. https:​/​/​doi​.org/​10​
.1017/​S0003356100021012.

Waghorn, G. C., H. Clark, V. Taufa, and A. Cavanagh. 2008. Monen-
sin controlled-release capsules for methane mitigation in pasture-
fed dairy cows. Aust. J. Exp. Agric. 48:65–68. https:​/​/​doi​.org/​10​
.1071/​EA07299.

Wang, Q., and H. Bovenhuis. 2019. Validation strategy can result in an 
overoptimistic view of the ability of milk infrared spectra to pre-
dict methane emission of dairy cattle. J. Dairy Sci. 102:6288–6295. 
https:​/​/​doi​.org/​10​.3168/​jds​.2018​-15684.

Wickham, H. 2009. Ggplot2: Elegant Graphics for Data Analysis. 2nd 
ed. Springer Nature.

Williams, S. R. O., T. Clarke, M. C. Hannah, L. C. Marett, P. J. 
Moate, M. J. Auldist, and W. J. Wales. 2013. Energy partitioning 
in herbage-fed dairy cows offered supplementary grain during an 
extended lactation. J. Dairy Sci. 96:484–494. https:​/​/​doi​.org/​10​
.3168/​jds​.2012​-5787.

Wolfert, S., L. Ge, C. Verdouw, and M.-J. Bogaardt. 2017. Big data in 
smart farming–A review. Agric. Syst. 153:69–80. https:​/​/​doi​.org/​
10​.1016/​j​.agsy​.2017​.01​.023.

Yan, T., R. E. Agnew, F. J. Gordon, and M. J. Porter. 2000. The pre-
diction of methane energy output in dairy and beef cattle offered 
grass silage-based diets. Livest. Prod. Sci. 64:253–263. https:​/​/​doi​
.org/​10​.1016/​S0301​-6226(99)00145​-1.

Zhang, C., and Y. Ma. 2012. Random forest. Pages 157–175 in En-
semble Machine Learning: Methods and Applications. A. Cutler, 
D. R. Cutler, and J. R. Stevens, ed. Springer. https:​/​/​doi​.org/​10​
.1007/​978​-1​-4419​-9326​-7​_5.

Zhao, Y., X. Nan, L. Yang, S. Zheng, L. Jiang, and B. Xiong. 2020. 
A review of enteric methane emission measurement techniques 
in ruminants. Animals (Basel) 10:1004. https:​/​/​doi​.org/​10​.3390/​
ani10061004.

ORCIDS

Enyew Negussie  https:​/​/​orcid​.org/​0000​-0003​-4892​-9938
Oscar González-Recio  https:​/​/​orcid​.org/​0000​-0002​-9106​-4063
Mara Battagin  https:​/​/​orcid​.org/​0000​-0001​-7309​-6793
Ali-Reza Bayat  https:​/​/​orcid​.org/​0000​-0002​-4894​-0662
Tommy Boland  https:​/​/​orcid​.org/​0000​-0002​-7433​-130X
Yvette de Haas  https:​/​/​orcid​.org/​0000​-0002​-4331​-4101
Aser Garcia-Rodriguez  https:​/​/​orcid​.org/​0000​-0001​-5519​-6766
Philip C. Garnsworthy  https:​/​/​orcid​.org/​0000​-0001​-5131​-3398
Nicolas Gengler  https:​/​/​orcid​.org/​0000​-0002​-5981​-5509
Michael Kreuzer  https:​/​/​orcid​.org/​0000​-0002​-9978​-1171
Björn Kuhla  https:​/​/​orcid​.org/​0000​-0002​-2032​-5502
Jan Lassen  https:​/​/​orcid​.org/​0000​-0002​-1338​-8644
Nico Peiren  https:​/​/​orcid​.org/​0000​-0001​-5500​-1607
Marcin Pszczola  https:​/​/​orcid​.org/​0000​-0003​-2833​-5083
Angela Schwarm  https:​/​/​orcid​.org/​0000​-0002​-5750​-2111
Hélène Soyeurt  https:​/​/​orcid​.org/​0000​-0001​-9883​-9047
Amélie Vanlierde  https:​/​/​orcid​.org/​0000​-0002​-4619​-1936
Tianhai Yan  https:​/​/​orcid​.org/​0000​-0002​-1994​-5202
Filippo Biscarini  https:​/​/​orcid​.org/​0000​-0002​-3901​-2354

Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE

https://doi.org/10.7910/DVN/BINDG9
https://doi.org/10.3168/jds.2016-12030
https://doi.org/10.3168/jds.2016-12030
https://doi.org/10.1017/S1751731116002718
https://doi.org/10.1080/09064702.2013.851275
https://doi.org/10.1080/09064702.2013.851275
https://doi.org/10.1111/gcb.14094
https://doi.org/10.1111/gcb.14094
https://doi.org/10.3168/jds.2010-3361
https://doi.org/10.3168/jds.2012-6095
https://doi.org/10.1111/ecog.02881
https://doi.org/10.1016/j.crm.2017.02.001
https://doi.org/10.1016/j.crm.2017.02.001
https://doi.org/10.1111/2041-210X.13434
https://doi.org/10.4025/actascianimsci.v41i1.42559
https://doi.org/10.4025/actascianimsci.v41i1.42559
https://doi.org/10.3168/jds.S0022-0302(01)74530-4
https://doi.org/10.3168/jds.S0022-0302(01)74530-4
https://doi.org/10.1080/09064702.2014.959553
https://doi.org/10.1093/bioinformatics/17.6.520
https://doi.org/10.1017/S0003356100021012
https://doi.org/10.1017/S0003356100021012
https://doi.org/10.1071/EA07299
https://doi.org/10.1071/EA07299
https://doi.org/10.3168/jds.2018-15684
https://doi.org/10.3168/jds.2012-5787
https://doi.org/10.3168/jds.2012-5787
https://doi.org/10.1016/j.agsy.2017.01.023
https://doi.org/10.1016/j.agsy.2017.01.023
https://doi.org/10.1016/S0301-6226(99)00145-1
https://doi.org/10.1016/S0301-6226(99)00145-1
https://doi.org/10.1007/978-1-4419-9326-7_5
https://doi.org/10.1007/978-1-4419-9326-7_5
https://doi.org/10.3390/ani10061004
https://doi.org/10.3390/ani10061004
https://orcid.org/0000-0003-4892-9938
https://orcid.org/0000-0002-9106-4063
https://orcid.org/0000-0001-7309-6793
https://orcid.org/0000-0002-4894-0662
https://orcid.org/0000-0002-7433-130X
https://orcid.org/0000-0002-4331-4101
https://orcid.org/0000-0001-5519-6766
https://orcid.org/0000-0001-5131-3398
https://orcid.org/0000-0002-5981-5509
https://orcid.org/0000-0002-9978-1171
https://orcid.org/0000-0002-2032-5502
https://orcid.org/0000-0002-1338-8644
https://orcid.org/0000-0001-5500-1607
https://orcid.org/0000-0003-2833-5083
https://orcid.org/0000-0002-5750-2111
https://orcid.org/0000-0001-9883-9047
https://orcid.org/0000-0002-4619-1936
https://orcid.org/0000-0002-1994-5202
https://orcid.org/0000-0002-3901-2354

	Negussie et al 2021.pdf
	Enyew_etal2022_JDS_CH4_machineLearning (1)
	Integrating heterogeneous across-country data for proxy-basedrandom forest prediction of enteric methane in dairy cattle
	INTRODUCTION
	MATERIALS AND METHODS
	Data
	Data Integration
	Imputation of Missing Proxies
	Data Analyses
	Prediction Models for CH4 Emissions from Proxies
	Accuracy of Prediction
	Variable Importance
	Software and Computing Environment

	RESULTS
	Integrated Across-Country Data
	Prediction Accuracy of CH4 from RF Versus MLR Models.
	Variable Importance

	DISCUSSION
	Accuracy of Proxy-Based Prediction of CH4 Using RF Versus MLR Models
	Within-Herd and Between-Herd Methane Prediction Accuracy
	Variable Importance and Effect of Imputation of Missing Data Points
	Benefits of Heterogeneous Across-Country Data and Potentials of Machine Learning in Predictive Modeling
	Future Considerations

	ACKNOWLEDGMENTS
	REFERENCES