Jukuri, open repository of the Natural Resources Institute Finland (Luke) All material supplied via Jukuri is protected by copyright and other intellectual property rights. Duplication or sale, in electronic or print form, of any part of the repository collections is prohibited. Making electronic or print copies of the material is permitted only for your own personal use or for educational purposes. For other purposes, this article may be used in accordance with the publisher’s terms. There may be differences between this version and the publisher’s version. You are advised to cite the publisher’s version. This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail. Author(s): Enyew Negussie, Oscar González-Recio, Mara Battagin, Ali-Reza Bayat, Tommy Boland, Yvette de Haas, Aser Garcia-Rodriguez, Philip C. Garnsworthy, Nicolas Gengler, Michael Kreuzer, Björn Kuhla, Jan Lassen, Nico Peiren, Marcin Pszczola, Angela Schwarm, Hélène Soyeurt, Amélie Vanlierde, Tianhai Yan & Filippo Biscarini Title: Integrating heterogeneous across-country data for proxy-based random forest prediction of enteric methane in dairy cattle Year: 2022 Version: Preprint version Copyright: The Author(s) 2022 Rights: CC BY 4.0 Rights url: http://creativecommons.org/licenses/by/4.0/ Please cite the original version: Negussie E., González-Recio O., Battagin M., Bayat A.-R., Boland T., de Haas Y., Garcia-Rodriguez A., Garnsworthy P.C., Gengler N., Kreuzer M., Kuhla B., Lassen J., Peiren N., Pszczola M., Schwarm A., Soyeurt H., Vanlierde A., Yan T. & Biscarini F. (2022). Integrating heterogeneous across-country data for proxy-based random forest prediction of enteric methane in dairy cattle. Journal of Dairy Science 105(6): 5124-5140. https://doi.org/10.3168/jds.2021-20158. ABSTRACT Direct measurements of methane (CH4) from indi- vidual animals are difficult and expensive. Predictions based on proxies for CH4 are a viable alternative. Most prediction models are based on multiple linear regressions (MLR) and predictor variables that are not routinely available in commercial farms, such as dry matter intake (DMI) and diet composition. The use of machine learning (ML) algorithms to predict CH4 emissions from across-country heterogeneous data sets has not been reported. The objectives were to compare performances of ML ensemble algorithm random for- est (RF) and MLR models in predicting CH4 emissions from proxies in dairy cows, and assess effects of imput- ing missing data points on prediction accuracy. Data on CH4 emissions and proxies for CH4 from 20 herds were provided by 10 countries. The integrated data set contained 43,519 records from 3,483 cows, with 18.7% missing data points imputed using k-nearest neighbor imputation. Three data sets were created, 3k (no miss- ing records), 21k (missing DMI imputed from milk, fat, protein, body weight), and 41k (missing DMI, milk fat, and protein records imputed). These data sets were used to test scenarios (with or without DMI, imputed vs. nonimputed DMI, milk fat, and protein), and pre- diction models (RF vs. MLR). Model predictive ability was evaluated within and between herds through 10- fold cross-validation. Prediction accuracy was measured as correlation between observed and predicted CH4, root mean squared error (RMSE) and mean normal- ized discounted cumulative gain (NDCG). Inclusion of DMI in the model improved within and between-herd prediction accuracy to 0.77 (RMSE = 23.3%) and 0.58 (RMSE = 31.9%) in RF and to 0.50 (RMSE = 0.327) and 0.13 (RMSE = 42.71) in MLR, respectively than when DMI was not included in the predictive model. When missing DMI records were imputed, within and between-herd accuracy increased to 0.84 (RMSE = 18.5%) and 0.63 (RMSE = 29.9%), respectively. In all scenarios, RF models out-performed MLR models. Re- sults suggest routinely measured variables from dairy farms can be used in developing globally robust pre- diction models for CH4 if coupled with state-of-the-art techniques for imputation and advanced ML algorithms for predictive modeling. Key words: enteric methane, machine learning, prediction models, proxies for methane Integrating heterogeneous across-country data for proxy-based random forest prediction of enteric methane in dairy cattle Enyew Negussie,1* Oscar González-Recio,2 Mara Battagin,3 Ali-Reza Bayat,4 Tommy Boland,5 Yvette de Haas,6 Aser Garcia-Rodriguez,7 Philip C. Garnsworthy,8 Nicolas Gengler,9 Michael Kreuzer,10 Björn Kuhla,11 Jan Lassen,12 Nico Peiren,13 Marcin Pszczola,14 Angela Schwarm,15 Hélène Soyeurt,9 Amélie Vanlierde,16 Tianhai Yan,17 and Filippo Biscarini18 1Animal Genomics and Breeding, Natural Resources Institute Finland (Luke), 31600 Jokioinen, Finland 2Department of Animal Breeding, Instituto Nacional de Investigacion y Tecnologia Agraria y Alimentaria (INIA-CSIC), 28040 Madrid, Spain 3Italian Brown Cattle Breeders’ Association, Verona, Italy 4Animal Nutrition, Natural Resources Institute Finland (Luke), 31600 Jokioinen, Finland 5Agriculture and Food Science Centre, School of Agriculture and Food Science, University College Dublin, Belfield, Belfield, Dublin 4, Ireland 6Animal Breeding and Genomics, Wageningen University and Research, 6700 AH Wageningen, the Netherlands 7Department of Animal Production, NEIKER—Basque Institute for Agricultural Research and Development, 01192 Arkaute, Spain 8School of Biosciences, University of Nottingham, Sutton Bonington Campus, Loughborough LE12 5RD, United Kingdom 9TERRA Teaching and Research Centre, Gembloux Agro-Bio Tech, University of Liège, 5030 Gembloux, Belgium 10ETH Zurich, Institute of Agricultural Sciences, Universitaetstrasse 2, 8092 Zurich, Switzerland 11Research Institute for Farm Animal Biology (FBN), Institute of Nutritional Physiology “Oskar Kellner,” Wilhelm-Stahl-Allee 2, 18196 Dummerstorf, Germany 12VikingGenetics, Ebeltoftvej 16, 8960 Randers, Denmark 13Institute for Agricultural and Fisheries Research (ILVO), Merelbeke, Belgium 14Department of Genetics and Animal Breeding, Poznan University of Life Sciences, Wołynska 33, 60-637 Poznan, Poland 15Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, PO Box 5003, 1432 Ås, Norway 16Productions in Agriculture Department, Walloon Agricultural Research Centre (CRA-W), BEL-5030 Gembloux, Belgium 17Livestock Production Science Branch, Agri-Food and Biosciences Institute, Hillsborough, Co. Down BT26 6DR, United Kingdom 18National Research Council, Institute of Agricultural Biology and Biotechnology (CNR-IBBA), Via Bassini 15, 20133 Milan, Italy J. Dairy Sci. 105 https://doi.org/10.3168/jds.2021-20158 © 2022, The Authors. Published by Elsevier Inc. and Fass Inc. on behalf of the American Dairy Science Association®. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). Received January 15, 2021. Accepted February 9, 2022. *Corresponding author: enyew.negussie@​luke​.fi https://orcid.org/0000-0003-4892-9938 https://orcid.org/0000-0002-9106-4063 https://orcid.org/0000-0001-7309-6793 https://orcid.org/0000-0002-4894-0662 https://orcid.org/0000-0002-7433-130X https://orcid.org/0000-0002-4331-4101 https://orcid.org/0000-0001-5519-6766 https://orcid.org/0000-0001-5131-3398 https://orcid.org/0000-0002-5981-5509 https://orcid.org/0000-0002-9978-1171 https://orcid.org/0000-0002-2032-5502 https://orcid.org/0000-0002-1338-8644 https://orcid.org/0000-0001-5500-1607 https://orcid.org/0000-0003-2833-5083 https://orcid.org/0000-0002-5750-2111 https://orcid.org/0000-0001-9883-9047 https://orcid.org/0000-0002-4619-1936 https://orcid.org/0000-0002-1994-5202 https://orcid.org/0000-0002-3901-2354 mailto:enyew.negussie@luke.fi Journal of Dairy Science Vol. 105 No. 6, 2022 INTRODUCTION Food production and agriculture face major challeng- es under climate change, in terms of expected negative effect on productivity as well as implementation of sec- toral actions to limit greenhouse gas (GHG) emissions. Sustainable farming, livestock husbandry, fisheries, and forestry can help countries identify opportunities for reducing emissions while addressing their food security, resilience, and rural development goals (FAO, 2016). Agricultural activities contribute 10 to 14% of global anthropogenic GHG emissions (Jantke et al., 2020). Livestock production systems account for 40% of CH4 emissions in agriculture (FAO, 2018), where the largest part originates from CH4 that is produced and released from the rumen. There is also an indirect contribution through, for example, feed-production activities, defor- estation, and manure (Cassandro, 2020). Agriculture, particularly livestock, is increasingly being recognized as both a contributor to the process and a potential victim of it (Cassandro et al., 2013; Cassandro, 2020). However, in the agriculture sector, livestock production has a great potential for reducing GHG emissions and a tremendous ability to contribute to climate change mitigation and adaptation (Cassandro 2020). This is in part because, of the many available options for mitiga- tion of GHG, mitigation of CH4 is particularly efficient given its relatively short half-life and therefore any mit- igation effort is expected to result in quick returns. The growing demand for meat and milk, which is predicted to double by 2050 (Rojas-Downing et al., 2017) calls for an accurate inventory of CH4 emissions for setting up effective and sustainable mitigation strategies. Direct measurement of CH4 emissions from individual animals using respiration chambers provides reliable information which can be used for national inventories, assessment of dietary mitigation strategies, genetic se- lection, and calculation of energy loss through exhaled CH4 (Appuhamy et al., 2016). However, this approach is not suitable for large-scale assessment and is expen- sive and labor intensive (Kebreab et al., 2006; Moraes et al., 2014; Negussie et al., 2017b). There have been several efforts in recent decades to develop low-cost and portable methods for direct measurement of CH4 emissions in animals (Negussie et al., 2017a,b; Zhao et al., 2020). Although such handheld and portable ap- plications for direct measurement have the potential for high throughput, they are generally based on CH4 concentration as opposed to flux assays and are in some cases considered to be less accurate than respiration chambers (Garnsworthy et al., 2019). Instead, the use of combinations of proxies for CH4 has been suggested as a valid alternative to direct measurement of CH4. Proxies for CH4 are traits that are directly or indirectly related to CH4 and that can easily be measured on a large scale and at low-cost (Negussie et al., 2017a). Some prox- ies (e.g., milk yield, milk composition, lactation stage) are easily and readily available from routine national recording schemes in many countries and thus their use may be promising in developing robust prediction models for CH4. In a comprehensive review (Negussie et al., 2017a) highlighted that use of combinations of readily available proxies for CH4 could increase accu- racy of CH4 predictions by 15 to 35%. This is mainly because, different proxies describe independent sources of variation in CH4 emissions and one proxy can correct for shortcomings of others. Several equations have been developed for proxy- based prediction of CH4 emission in dairy cattle using primarily some powerful yet expensive proxies such as feed intake and diet composition (Ellis et al., 2010; Hristov et al., 2013; Nielsen et al., 2013; Ramin and Huhtanen, 2013; Storlien et al., 2014; Appuhamy et al., 2016; Charmley et al., 2016). A comprehensive analysis of data consisting of these proxies was reported recently by Niu et al. (2018). They used multiple linear regression (MLR) models for prediction of enteric CH4 emissions based on traits such as energy intake, diet composition, and milk yield as predictor variables (Niu et al., 2018). Unfortunately, large-scale availability of such data containing energy intake or DMI and diet composition is limited. They are especially difficult and expensive to record from commercial farms. When available, they are mainly resourced from relatively small numbers of animals or from a single herd, that limits their applicability to other regions or produc- tion systems. Furthermore, most of these prediction models used so far are based on conventional statistical methods fitting MLR models. Such models cannot ap- proximate potentially nonlinear relationships between proxies and emissions unless resorting to generalized additive model extensions and to modeling nonlinear relationships explicitly. Therefore, the use of low-cost and routinely recorded traits (e.g., milk yield, milk composition, age, lactation stage) as predictor variables can be a practical option. For these reasons, a more comprehensive database needs to be collated to develop enteric CH4 emission prediction models at both global and regional scales (Niu et al., 2018) applying more flexible state-of-the-art statistical and analytical tools. The smart farming revolution is a global trend based on key innovative technologies, such as Internet of things, cloud computing, big data, and machine learning (ML), which are reshaping modern agriculture (Wolfert et al., 2017). A vast array of sensors and phenotyping platforms for farm applications are now generating an enormous and continuous stream of data. Several of the above-mentioned proxies are already being generated Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Journal of Dairy Science Vol. 105 No. 6, 2022 from such applications and more are likely to follow. These data are high throughput, relatively low-cost and can be used for development of robust models for accurate prediction of CH4 emissions. Effective and ef- ficient utilization of information contained in such large heterogeneous data sets requires advanced and versatile statistical tools, such as ML algorithms. In predictive modeling ML provides an excellent solution to identify hidden trends in heterogeneous and noisy data sets, and to accommodate nonlinear relationships between variables (Zhang and Ma, 2012; Al-Jarrah et al., 2015). Use of ML methods for proxy-based predictions of CH4 emissions from combined across-country heterogeneous data sets has not yet been reported. Importantly, their predictive performance in comparison with conventional statistical methods, involving routinely recorded prox- ies for CH4 remains unexplored. The main objectives of the current study were (1) to combine heterogeneous across-country data on routinely measured proxies for CH4 into an integrated data set; (2) to apply a ML ensemble algorithm random forest (RF) to the proxy- based predictions of CH4 and compare its performance with that of MLR models; (3) to explore the possibility of imputing missing data points and compare accu- racy of CH4 prediction from imputed and nonimputed data sets, because combining data from heterogeneous across-country sources is bound to generate a propor- tion of missing records. MATERIALS AND METHODS Data Data on enteric CH4 production and proxies for CH4 that are routinely collected from dairy farms were pro- vided by 13 research centers from 10 European part- ner countries (Belgium, Denmark, Finland, Germany, Ireland, the Netherlands, Poland, Spain, Switzerland and UK) of the METHAGENE consortium (EU-COST Action FA1302) on large-scale methane measurements on individual ruminants for genetic evaluations (www​ .methagene​.eu). The data sets were from 20 herds cov- ering a diverse geographical and production-systems mix. Individual cow records from different breeds, par- ity (from 1–3+), age and stage of lactation (DIM from 1–349) were combined into a large integrated data set. Variables included in the combined data set were herd, breed, parity, DIM, BW, DMI, CH4 production, CH4 measurement method, milk yield, milk fat, and milk protein. The breeds included were mainly Holstein Frie- sians, Nordic Red, Brown Swiss, Norwegian Red, and Nordic crosses. The majority (90%) of the herds kept Holstein Friesian breed and only 2 herds kept breeds other than Holstein Friesian. In all herds, BW was measured and provided as weight in kilograms along with measurement of CH4. Because of cost and asso- ciated technical difficulties, in most commercial dairy herds recordings of feed intake and other diet-related information are not routinely conducted. As a result, for this study extensive diet composition and feeding management information were not provided except for DMI, which few of the participating herds were able to provide. Methane emissions were measured with 5 measurement techniques: cattle respiration chambers, the SF6 tracer technique, and sniffers (F10, Gasmet, and NG Guardian), and description of the methods as well as some measurement details are provided in Garnsworthy et al. (2019). In total, the combined data set included 47,129 repeated records from 3,886 cows, belonging to 5 dairy breeds (Table 1). Detailed descrip- tion of the data by participating herds is provided in Supplemental Table S1 (https:​/​/​doi​.org/​10​.7910/​ DVN/​BINDG9, Negussie, 2022). Data Integration Individual data sets collated from the 20 herds were standardized by making sure that all variables were expressed in the same units such as milk yield in kilo- grams, protein and fat as percentages, and enteric CH4 as grams per day. When expressed in liters per day, enteric CH4 production was converted to grams per day using the following conversion equation: Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Table 1. Descriptive statistics of CH4 and the main proxy variables included in the integrated data set Variable No. of observations Mean SD Minimum Maximum CH4, g/d 43,519 372.5 133.2 100 983 Milk yield, kg/d 43,507 31.3 9.2 9.0 89.5 Milk fat, % 23,783 3.84 0.64 3.23 9.91 Milk protein, % 23,783 3.63 0.55 3.00 6.94 DMI, kg/d 3,427 20.2 3.3 7.4 39.3 BW, kg 43,392 585 114.5 478 955 DIM 43,519 156 87.5 1 349 Parity 43,472 1.99 1.6 1 14 www.methagene.eu www.methagene.eu https://doi.org/10.7910/DVN/BINDG9 https://doi.org/10.7910/DVN/BINDG9 Journal of Dairy Science Vol. 105 No. 6, 2022 CH4 (L/d) × 0.668 = CH4 (g/d), [1] where 0.668 is the density of CH4 at normal tempera- ture and pressure (20°C and 1 atm) (Engineering Tool- Box, 2003, 2004). Weekly averages of CH4 production were calculated and used in data analyses. Categorical variables for breed and CH4 measurement method were standardized by making sure that all categories were labeled consistently across data sets. All date variables were standardized to a common DD-MM-YYYY for- mat. After data standardization, data from the 20 indi- vidual herds were combined into a large integrated data set. The combined data set was then filtered by retain- ing only records with DIM ≤350 and CH4 ≤1,000 g/d. Outliers for enteric CH4 production (g/d; outside 3.5 SD, within herd) were also excluded. The final filtered data set contained 43,519 records from 3,483 cows. Imputation of Missing Proxies Integrated and filtered data from the combined 20 in- dividual data sets contained 18.7% missing data points for DMI and milk fat and protein. A nonparametric k-nearest neighbor imputation approach (Troyanskaya et al., 2001) was used to impute missing proxy data to obtain a larger and complete data set to compare the predictive performance of algorithms on imputed and nonimputed data sets. The imputation process first involved calculation of Gower distances (Gower, 1971) between records as S Sij ijz ijz z ijz= k=0 n v ∑ ∑( ) = . ,δ δ 1 [2] where Sij is the similarity between samples i and j, δijz ∈ {0, 1} is an indicator variable that specifies whether a comparison between samples i and j over variable z is possible (δijz = 1) or not (δijz = 0); Sijz is the similar- ity coefficient at any given variable z. Similarities were calculated differently depending on whether variable z was binary, categorical, or quantitative: (1) associa- tion tables; (2) Hamming distances; (3) 1 − |xi − xj|/ (range). Similarity coefficients were then averaged over the v variables for which comparisons were possible, to give a general similarity coefficient Sijz between samples i and j. Based on the matrix of Gower similarities, k- nearest neighbors for any given record were selected, and missing data points were imputed based on the average values (quantitative variables) or majority-vote (if binary or categorical variables) of the k neighbors. In the present study, a value of k = 4 was chosen for the imputation. In a few cases, the neighborhood did not have enough information to make imputation pos- sible and missing values were left unimputed and con- sequently removed. This left 40,532 records in the final imputed data set. Data Analyses After data integration and imputation, 3 data sets were generated for predictive modeling: (1) 3k, contain- ing 3,337 records with no missing data on any variables (no imputation); (2) 21k, containing 21,215 records and only the missing 84% of DMI records were imputed; and (3) 41k, containing 40,532 records [i.e., all filtered records where all missing data points (81% of DMI and 48% of milk fat and protein records) were imputed]. All data sets comprised the 11 proxies used in this study to model CH4 production: breed (5 classes), country (10 classes), herd (20 classes), DIM, parity (3 classes), methane measurement technique (5 classes), DMI (kg/d; imputed in the 21k and 41k data sets), BW (kg), milk yield (kg/d), milk fat (%; imputed in the 41k data sets) and milk protein (%; imputed in the 41k data sets). The 3 data sets were all used to predict CH4 emissions with both RF and MLR predictive models, either including or not including DMI in the model. Figure 1 provides a visual representation of the adopted data analysis. Prediction Models for CH4 Emissions from Proxies Random Forest Predictive Model. Random for- est is a supervised ML algorithm used for predictive modeling in both classification and regression problems (Breiman, 2001). Random forest generates and aggre- gates a “forest” of trees built on resampled subsets of the data. In the current study, RF was used to predict individual CH4 emissions from imputed and nonim- puted data sets of proxies for CH4. As a semiparametric ML ensemble algorithm, RF is robust to overfitting and able to capture complex interactions in data, thereby handling efficiently problems associated with complex and heterogeneous data structures (González-Recio and Forni, 2011). The RF works by building large number of decision trees on bootstrapped samples of the data and random subsets of the predictor variables. Predic- tions from each single decision tree are then averaged to get the final prediction for CH4 emissions: ˆ ˆ ,f B fi b B b ix x( ) = ( ) =∑ 1 1 [3] where xi is the vector of predictors for record i (i.e., CH4 proxies), and f̂b ix( ) is the corresponding predicted CH4 emission from the regression tree built on the bth bootstrapped sample of the data. Each tree heuristi- cally minimizes the loss function (residual sum of Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Journal of Dairy Science Vol. 105 No. 6, 2022 squares) through top-down greedy recursive binary splitting. The number of trees in the forest (B), the number of variables or proxies (m) randomly used in each node and the maximum size of each node for fur- ther partition were tuned based on the lowest general- ized error on the out-of-bag samples. Convergence of the generalization error was assessed from 1,000 trees onward and lowest generalized error was obtained with m = 5 and minimum node [size = 5]. Equation 3 was used to build RF models to accommodate the different scenarios with and without DMI and imputation of missing DMI as well as other missing proxy variables as described above. Multiple Linear Regression Model. To provide a benchmark against which performance of RF models can be compared, MLR models were run for all data sets and scenarios considered. The basic MLR model for CH4 as a function of prox- ies was: yi j p j ij i= + + =∑β β χ ε0 1 , [4] where yi is methane production of individual i, β0 is the intercept of the model, β1, …, βm are the regres- sion coefficients of the proxy variables included in the model, xij is the value for proxy j in animal i (for cat- egorical proxies, xij are indicator variables), and εi is the residual term. The variables in [1; p] depend on the scenario considered (data set, imputation). The QR factorization was used to solve the MLR models. Although residual variance may vary between subsets of the records, the most relevant in this study being methane measurement techniques, for the MLR model homoscedasticity was assumed. Equation 4 was applied to the 3 data sets and all envisaged scenarios, the same as with RF models. Accuracy of Prediction Predictive ability of RF and MLR models was evaluated within and between herds in the 3 data sets (3k, 21k, and 41k). In each data set, inclusion or not inclusion of DMI (measured or imputed) was tested. The effect of imputing missing proxies was evaluated indirectly by comparing the 3 data sets: no imputation (3k), imputation of DMI (21k), imputation of DMI and milk fat and protein (41k). Predictive ability of the models was estimated through 10-fold cross-validation Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Figure 1. Cross-validation scheme used for estimation of the predictive ability from random forest and multiple linear regression models under within- and between-herd prediction scenarios. Each model (data set size, inclusion of DMI, within- or between-herd scenario) was rep- licated 5 times. r = Pearson correlation between observed and predicted CH4 values; RMSE = root mean squared error; NCDG = normalized discounted cumulative gain. Journal of Dairy Science Vol. 105 No. 6, 2022 replicated 5 times. The data were split into 10 parti- tions: 9 were used to train the model and one was used to test the model, until all 10 partitions were used once as test set. Multiple records from the same cows were always assigned to the same fold. In the within-herd cross-validation, records from all herds were used both in the training and test sets: CH4 records from a given cow were predicted from data of herd mates plus cows from other herds and countries. In the between-herd cross-validation, all records from one herd “k” (each herd in turn) were set aside as test set, and the remain- ing k − 1 herds were used to train the model (Figure 1). Three accuracy metrics were used to measure pre- dictive ability of the models: (1) Pearson correlation between observed and predicted CH4 values, (2) root mean squared error (RMSE), as percentage of the mean of the observed response variable, and (3) normalized discounted cumulative gain (NDCG). Normalized dis- counted cumulative gain is a ranking metric developed in information theory (Järvelin and Kekäläinen, 2002), which has been applied to evaluation of genomic selec- tion models (Blondel et al., 2015). The NDCG metric evaluates the top individuals in the ranking, which are supposed to be the most relevant when comparing models. The top 20% emitters were considered here and their CH4 outputs were ranked based on observed and predicted ranks; NDCG was calculated as: NDCG y y y y d i y y d i i k i i k i =( ,  )̂ ˆ = = ∑ ∑ ( )    ⋅ ( )( ) ( )    ⋅ 1 1 π π (( )( )       , [5] where y yπ ˆ( )    and y yπ ( )    are the top k observed CH4 emission records according to either their predicted or observed ranking; d(i) = log2(i + 1) is the weight by which ranked values are discounted with (i ∈ [1, k]); k is the number of individuals included in the top 20%. The NDCG values range between 0 and 1, with values close to 1 indicating better performance of the model to correctly predict the most relevant individuals (e.g., identifying those cows that emit more CH4 than other cows). Variable Importance In predictive modeling, important proxies drive the outcome of the model and have a significant effect on accuracy of prediction. In RF, the relative importance of proxies included in the predictive models was auto- matically retrieved. The relative importance of prox- ies was estimated by running the out-of-bag samples through the RF trees after randomly permuting the values at each proxy variable and comparing the re- sulting predictive accuracy (or loss function) with that obtained from the original data (nonpermuted). The relative importance of proxies was then scaled to be in the [0, 100] range, which provided insights into the predictive and biological roles played by the proxies in prediction of CH4 emissions from cows. Two measures of variable importance were used: percent increase in mean squared error and increase in node purity. Software and Computing Environment All data handling, processing, and analysis were performed using the R environment for statistical computing (https:​/​/​r​-project​.org). Specifically, the R package VIM (Kowarik and Templ, 2016) was used for imputation of missing proxy data and RF (Liaw and Wiener, 2002) was used for RF predictions. For the MLR models we used the lm() function from the stats R package, which uses QR factorization to solve the model (https:​/​/​r​-project​.org). Plots were generated us- ing the ggplot2 R package (Wickham, 2009). RESULTS Integrated Across-Country Data Average enteric CH4 emission across the 20 herds in 10 European countries was 372.5 g/d (±133.2; SD) and ranged from 280 to 543 g/d (Table 1). Figure 2 sum- marizes CH4 emissions by herd and shows that there is marked variability between herds, with most (two- thirds) measurements lying between 250 and 500 g/d. The effect of country of origin on methane emissions was tested in a multiple regression model. In a naive model, the effect of country of origin was significant (P < 0.01). When accounting for co-dependencies be- tween records (cows nested within herds, nested within countries) the herd effect absorbs a large portion of the variation from the country effect, which is no longer significant (P > 0.01). This indicates that CH4 emis- sion levels vary between countries mainly because of between-herd variations. Regarding the distribution of methane, we have checked visually the per-country histograms and nu- merically the descriptive statistics and the mean and median for CH4 correspond very well. The formal Shapiro-Wilk test tells us however that in all countries except 3 there are some deviations from normality. This may be linked to different sample sizes and different CH4 variability among countries. Additionally, these deviations are limited (in the sense that distributions do not appear to be binary, or bimodal, or strongly skewed) and linear models are known to be rather ro- bust to deviations from normality and distributional Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE https://r-project.org https://r-project.org Journal of Dairy Science Vol. 105 No. 6, 2022 assumptions (Schielzeth et al., 2020; Knief and Forst- meier, 2021). General summary statistics for the differ- ent proxy variables in the integrated data set are given in Table 1. Prediction Accuracy of CH4 from RF Versus MLR Models. Within-Herd CH4 Prediction Accuracy. Table 2 shows average Pearson correlations, RMSE, and NDCG for all predictive models and scenarios from both RF and MLR. Figures 3, 4, and 5 show RF results for each of the 5 replicates per model and scenario. In the 3k data set, when measured DMI was included in the RF model, prediction accuracy measured as Pear- son correlation between observed and predicted CH4 r y y,̂( ) increased from 0.52 to 0.77, RMSE was reduced from 31.3 to 23.3, and NDCG increased from 0.75 to 0.89 (Table 2). In the 21k data set, when missing DMI records were imputed and included in the prediction model, prediction accuracy r y y,̂( ) increased from 0.80 to 0.84, RMSE declined from 20.0 to 18.5, and NDCG increased from 0.91 to 0.92. In the 41k data set, when all missing variables including DMI were imputed and included in the prediction model, prediction accuracy r y y,̂( ) increased slightly from 0.81 to 0.82, RMSE de- creased from 20.6 to 20.0, and NDCG remained the same. When moving from 3k to 21k and 41k data sets, predictions become progressively less variable. Within-herd prediction accuracy from RF models varied with CH4 measurement method. In general, CH4 measurements from chambers tended to be predicted more accurately and more robustly (lower between- replicate variability) than CH4 records from sniffers and SF6. This was especially true when DMI was not included in the model, although with DMI included in the model sniffers gave comparable results to chambers. The SF6 almost always gave the least accurate predic- tions and was very variable across replicates (Figures 3 to 5). Marked differences were found between predictive models (RF vs. MLR) in within-herd prediction accu- racy. Prediction accuracies from the MLR model were consistently lower than RF across data sets and sce- narios. Across the different data sets and scenarios, when prediction was made using MLR instead of RF, r y y,̂( ) declined from 0.77 to 0.50, RMSE increased from 23.3 to 32.7, and NDCG decreased from 0.89 to 0.73 for the 3k data set. For the 21k data set, within-herd r y y,̂( ) declined from 0.84 to 0.75, RMSE increased from 18.5 Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Figure 2. Distribution of mean methane production (g/d) across herds in the combined data set. Boxes correspond to the interquartile range (IQR); whiskers are 1.5 times the IQR on both sides; dots describe data points which fall outside the ±1.5 × IQR boundaries around the box. Journal of Dairy Science Vol. 105 No. 6, 2022 to 22.7, and NDCG decreased from 0.92 to 0.90. Simi- larly, for 41k data set, within-herd r y y,̂( ) declined from 0.82 to 0.79, RMSE increased from 20.0 to 21.4, and NDCG was the same at 0.86 for both MLR and RF predictive models. Between-Herd CH4 Prediction Accuracy. For the between-herd scenario, again a general pattern to- ward more accurate and especially less variable predic- tions with increasing data size was observed, although somewhat less clear than in the within-herd scenario. In the 3k data set, inclusion of measured DMI in the prediction model increased between-herd r y y,̂( ) to 0.13 and 0.58, reduced RMSE to 42.7 and 31.9, and increased NDCG to 0.56 and 0.82 in MLR and RF models, re- spectively (Table 2). In the 21k data set, when missing DMI imputed, between-herd r y y,̂( ) increased to 0.14 and 0.63, RMSE decreased to 32.7 and 29.9, and NDCG decreased to 0.87 and 0.83 in MLR and RF models, respectively. In 41k data set, when missing proxy vari- ables were imputed, between-herd r y y,̂( ) increased to 0.21 and 0.39, RMSE decreased to 39.6 and 33.6, and NDCG decreased to 0.83 to 0.65 in MLR and RF mod- els, respectively. Between-herd CH4 prediction accuracies among herds using different CH4 measurement methods are shown in Figures 3 to 5. In general, whether or not DMI was in- cluded in predictive models, between-herd prediction accuracies for chamber CH4 measurement methods were higher than sniffer CH4 measurement methods. For instance, when measured DMI was included or missing DMI was imputed and included in the predic- tion model, between-herd r y y,̂( ) for chamber measure- ment methods was 20% higher than for sniffer measure- ment methods. Similarly, RMSE for chamber CH4 measurement methods was 30 to 50% lower than sniffer measurement methods. However, when NDCG metric was used, only a small difference was observed in be- tween-herd prediction accuracy between chamber and sniffer measurement methods. The RF and MLR predictive models had varied be- tween-herd prediction accuracy over the 3 different data sets (Table 2). For instance, in the 3k data set, between-herd r y y,̂( ) declined from 0.58 to 0.13, RMSE increased from 31.9 to 42.7, and NDCG decreased from 0.82 to 0.56 when MLR was used instead of RF model. Similarly, for the 21k data set, between-herd r y y,̂( ) declined from 0.63 to 0.14, RMSE increased from 29.9 to 38.4, and NDCG increased from 0.83 to 0.87. For the larger 41k data set, in which all missing proxy variables were imputed, between-herd r y y,̂( ) declined from 0.39 to 0.21, RMSE increased from 33.6 to 39.6, and NDCG increased from 0.65 to 0.83. In general, the differences between the models in r y y,̂ ,( ) RMSE and NDCG in Table 2 were all significant at (P < 0.0001) except for the 41k data with imputation and DMI in the model, where the difference was not significant (P = 0.28). Variable Importance When building decision trees, RF computes how much each variable is contributing to the prediction, which is a measure of variable importance. Two mea- sures of variable importance were used in the present study: percent increase in mean square error and in- crease in node purity, which are presented in Figure 6. When DMI was not included in the prediction model, Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Table 2. Within- and between-herd predictive ability of random forest (RF) and multiple linear regression (MLR) models with different data sets: 3k, 21k, 41k, measured by 3 accuracy metrics: Pearson correlation, RMSE, and NDCG from 10-fold cross-validation that is averaged over 5 replicates Prediction model Data attributes No. of cross- validation replicates Within-herd   Between-herd Size Proxies imputed   DMI included r y y,̂( )1 RMSE2 NDCG3 r y y,̂( ) RMSE NDCG RF 3k No No 5 0.52 31.3 0.75 0.32 35.4 0.73 RF 3k No Yes 5 0.77 23.3 0.89 0.58 31.9 0.82 RF 21k No No 5 0.80 20.3 0.91 0.33 33.1 0.65 RF 21k Yes Yes 5 0.84 18.5 0.92 0.63 29.9 0.83 RF 41k Yes No 5 0.81 20.6 0.86 0.20 34.8 0.55 RF 41k Yes Yes 5 0.82 20.0 0.86 0.39 33.6 0.65 MLR 3k No No 5 0.28 36.6 0.70 0.12 45.6 0.56 MLR 3k No Yes 5 0.50 32.7 0.73 0.13 42.7 0.56 MLR 21k No No 5 0.71 22.4 0.89 0.07 38.4 0.70 MLR 21k Yes Yes 5 0.75 22.7 0.90 0.14 32.7 0.87 MLR 41k Yes No 5 0.77 21.6 0.86 0.19 41.4 0.81 MLR 41k Yes Yes 5 0.79 21.4 0.86 0.21 39.6 0.83 1Pearson correlations (r) between observed and predicted CH4 production. 2Rootmeans squared error (expressed as percentage of the mean CH4 production g/d). 3Mean normalized discounted cumulative gain. Journal of Dairy Science Vol. 105 No. 6, 2022 the most relevant proxies for prediction were DIM, milk yield, BW, milk fat, and milk protein. When DMI was included in the prediction model DMI ranked first in variable importance, followed by milk yield, DIM, milk fat, and BW. In general, breed, parity and CH4 measurement method ranked at the bottom of variable importance. DISCUSSION Accurate inventories of GHG emission are essential to reflect a country’s national emissions from livestock production systems. Productivity and associated emis- sions intensity of livestock farming differ widely around the world and the potential for change is large (Niu et al., 2018). Understanding national, regional, and global variations in GHG emissions are therefore essential for concerted global actions to mitigate emissions (Hristov et al., 2018). Particularly at a time when there are un- certainties in proportion of increase in CH4 emissions solely attributable to livestock sources (Hristov et al., 2018), accurate estimation of CH4 emissions across different national borders and production systems is needed. Average CH4 production (g/d) from ruminants varies with diet, animal populations (e.g., species and breeds), production system, production level, and DIM (Bell et al., 2014; de Haas et al., 2011; Garnsworthy et al., 2012; Negussie et al., 2017b). Different esti- mates of average CH4 production of dairy cows were reported from different production systems (Waghorn et al., 2008; O’Neill et al., 2011; Hellwing et al., 2013; Deighton et al., 2014; Negussie et al., 2017b; Niu et al., 2018). Using data compiled over 8 experiments and covering 30 diets (Hellwing et al., 2013) reported an average CH4 production of 412 g/d for Danish lactating cows. Negussie et al. (2017b) working on Nordic red cows estimated an average CH4 production of 396 g/d whereas (Bayat et al., 2017) using chambers reported a range from 335 to 492 g/d in an experiment designed to test different dietary treatments. Bell et al. (2014) reported an average CH4 production of 418 g/d with a range between 220 and 480 g/d in 1964 lactating Holstein Friesian cows across 21 UK herds. In Australia (Williams et al., 2013; Deighton et al., 2014; Moate et al., 2014) CH4 emission ranged from 369 to 458 g/d for cows fed harvested pasture grass. From other pasture grass based dairy productions systems, somewhat dif- ferent values have been reported. For instance, O’Neill et al. (2011) reported CH4 emissions of 251 g/d in cows fed harvested perennial ryegrass. In New Zealand, Waghorn et al. (2008) reported CH4 emissions rang- ing from 273 to 352 g/d in cows fed harvested pasture Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Figure 3. Within-herd (top) and between-herd (bottom) prediction accuracies in terms of Pearson correlations r y y,̂( ) between observed and predicted CH4 emissions (g/d) by CH4 measurement method and prediction scenarios within 3k, 21k, and 41k data sets. Colors indicate inclusion (blue) or not (red) of DMI in the predictive model; shapes indicate imputation (triangles) or no imputation (dots) of missing proxies in the data set. Data points represent the 5 replicates of each predictive model. The gray horizontal bars are average prediction accuracies per scenario. Journal of Dairy Science Vol. 105 No. 6, 2022 grass. This selection of results demonstrates the large between-country variability. In our combined across- country data, estimated overall mean CH4 production was 372 g/d, which is in the middle of the ranges de- scribed above. Our estimate also corresponds to recent estimates from a combined regional data set reported in (Niu et al., 2018). Using a global data set collated from the United States, European Union, Australia, and New Zealand, the authors reported mean CH4 produc- tion of 345 g/d per cow for European Union, 354 g/d per cow for the United States, and 347 g/d per cow for the combined intercontinental data set. Accuracy of Proxy-Based Prediction of CH4 Using RF Versus MLR Models In a comprehensive review, Negussie et al. (2017a) concluded that whenever direct animal measurements are difficult and expensive to procure, use of combina- tions of proxies for CH4 in empirical prediction equa- tions has great potential. Empirical models have long been used involving different predictor variables to predict CH4 emissions from cows, some as early as the 1930s (Kriss, 1931). There are several examples of such quantitative approaches to predict CH4 production in cattle using mainly dietary and animal factors as prox- ies (Kebreab et al., 2008; Ellis et al., 2010; Ramin and Huhtanen, 2013; Appuhamy et al., 2016; Niu et al., 2018; Benaouda et al., 2019). Appuhamy et al. (2016) listed 40 such models that were developed in North America, Europe, Australia, and New Zealand. They suggested that comprehensive CH4 emission models need examining and testing against CH4 emission measurements from dairy cows in different regions of the world, an idea that was recently implemented in Benaouda et al., (2019). So far, although many pre- diction models have been reported, a closer look at most empirical models indicates that there are still a range of limitations that may preclude their practical applicability. These limitations include the following: (1) Most prediction models are not based on individual cow observations but on treatment means from differ- ent studies. Depending on sample size and other factors including measurement methods, this can be associated with different degrees of uncertainty (e.g., SD) (Ap- puhamy et al., 2016). (2) In most cases, data sets used as an input were from a single herd, a specific diet or from only few labs (Blaxter and Clapperton, 1965; Yan et al., 2000; Jentsch et al., 2007; Ellis et al., 2010). (3) Most prediction models were based on measurements Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Figure 4. Within-herd (top) and between-herd (bottom) prediction accuracies in terms of root mean squared error (RMSE) by CH4 measure- ment method and prediction scenarios within 3k, 21k, and 41k data sets. Colors indicate inclusion (blue) or no inclusion (red) of DMI in the predictive model; shapes indicate imputation (triangles) or no imputation (dots) of missing proxies in the data sets. Data points represent the 5 replicates of each predictive model. The gray horizontal bars are average prediction accuracies per scenario. Journal of Dairy Science Vol. 105 No. 6, 2022 from relatively small numbers of animals, which may limit their broad applicability. (4) Most prediction models used simple or MLR analyses without appropri- ate modeling of the fixed and random components (Ra- min and Huhtanen, 2013). In many instances, possible nonlinear relationships in the data were not taken into consideration, leading to biased estimates of param- eters (St-Pierre, 2001). (5) As enteric CH4 emissions are strongly related to feed intake, almost all models included a measure of intake such as DMI, intake of gross energy or metabolizable energy, or fiber intake, as prime predictor variables. However, these variables are currently not readily available under commercial conditions and none of them is routinely measured on individual animal’s on-farm. Thus, there is a need to develop robust prediction models that do not rely completely on feed intake measures or estimates (Hris- tov et al., 2018). Nevertheless, models without these variables (such as DMI or dietary composition) could be less accurate and thus during model development, it is essential to consider the trade-offs between cost, practicability and prediction accuracy (Appuhamy et al., 2016). (6) Finally, the advent of the smart farming revolution, new phenotyping platforms, such as sensors, on-line recording and imaging tools, have started to generate an enormous amount of data on proxies for CH4 from heterogeneous sources. Compilation, analysis and utilization of such large sets of information require the latest, robust and versatile statistical tools, which are now common in the ML approach. However, none of the CH4 prediction models reported so far in dairy cattle has attempted to use ML algorithms; and our study represents the first such effort in this direction. Machine learning algorithms have great potential for identifying hidden trends in unstructured heteroge- neous data sets and offer predictive modeling that can accommodate nonlinear relationships among vari- ables. Combining across-country heterogeneous data along with the application of ML algorithms should be a logical step toward developing robust and glob- ally relevant CH4 prediction models (Negussie et al., 2019). The robustness of ML methods is partly due to their extraordinary ability to input and use information from heterogeneous sources. Consequently, predictions from ML model RF are reliable and robust and are applicable under diverse production and environmental conditions. Furthermore, the approach outlined in the current study offers analytical tools to support future attempts to build globally representative large-scale GHG emission databases, on the basis of which, ac- Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Figure 5. Within-herd (top) and between-herd (bottom) prediction accuracies in terms of mean normalized discounted cumulative gain (NDCG) by CH4 measurement method and prediction scenarios within 3k, 21k, and 41k data sets. Colors indicate inclusion (blue) or no inclusion (red) of DMI in the predictive model; shapes indicate imputation (triangles) or no imputation (dots) of missing proxies in the data sets. Data points represent the 5 replicates of each predictive model. The gray horizontal bars are average prediction accuracies per scenario. Journal of Dairy Science Vol. 105 No. 6, 2022 curate regional and intercontinental inventories as well as concerted global mitigation strategies could be de- veloped. Within-Herd and Between-Herd Methane Prediction Accuracy Both in within- and between-herd predictions, inclu- sion of DMI, either measured or imputed, increased the predictive ability of RF and MLR models. As expected for decreasing marginal increments, the largest predic- tion accuracy increase was observed for the 3k data set (from RF models: 17–48% for within-herd predictions and 1–83% for between-herd predictions, depending on the metrics used). In modeling the stage of lacta- tion, quadratic terms or high order polynomials can be used to account for possible nonlinear relationships. In Equation 4, an additive model with linear terms only was chosen. However, we have also tested MLR model with a quadratic DIM term, but no significant changes in the results (negligible or no improvements in predic- tive accuracy) was observed. The modeling of lactation stage for CH4 prediction could be an area of future further investigation. Between-herd predictions are more challenging than within-herd predictions. For between-herd predictions, RF showed a much better performance than MLR, with accuracy increasing by as much as 350% for r y y,̂( ) with 21k data. This enhanced performance substantiates the robustness of RF predictions and the ability of RF models to make effective use of information coming from different herds with heterogeneous management and farm routines. Overall, across the 3 data sets and accuracy metrics, between-herd prediction accuracies were lower than within-herd prediction accuracies which is in line with the results reported in Wang and Bovenhuis (2019). This is probably because observa- tions of an individual cow will predict the CH4 output of its herd mates with higher accuracy than predicting CH4 output of animals in other herds in the combined across-country data. Furthermore, diet composition and other factors that influence CH4 output, vary less within herds than between herds. Evaluating random cross-validation and block cross-validation (with farms as blocks) which corresponds to within- and between- herd cross-validation in our study, Wang and Bovenhuis (2019) reported that random cross-validation could re- sult in an over optimistic view on the ability of milk IR spectra to predict CH4 emission and leads to misleading conclusions. Roberts et al. (2017) explained that when validation data are randomly selected for cross-valida- tion from the entire spatial domain, training and vali- dation data from nearby locations will be dependent (spatial autocorrelation). Consequently, if the objective Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Figure 6. Variable importance in terms of percent mean square error reduction (%IncMSE) and increase in node purity (IncNodePurity) from the random forest model (A) without including DMI and (B) with inclusion of DMI in the prediction model. Journal of Dairy Science Vol. 105 No. 6, 2022 is to project outside the spatial structure of the training data, error estimates from random cross-validations will be overly optimistic. To address this, they sug- gested that blocks can be designed across the spatial structure itself (i.e., in contiguous geographic space). This effectively forces testing on more spatially distant records, thus decreasing optimism in error estimates which underscores the power and practicality of the between-herd or block cross-validation as implemented in the current study. In addition to cross-validation with nonrandom blocks, carefully chosen modeling ob- jectives can offer more reliable error estimates (Roberts et al., 2017). When comparing methods used for CH4 measurement, records coming from respiration chambers consistently displayed the most accurate and least variable predic- tions, across data sets, scenarios and accuracy metrics. Possible reasons for this include within-day variation in CH4 emissions, which may not be accounted for in spot- sample sniffer techniques, and the influence of herd and environmental variability on sniffer measurements. In addition, sniffer measurements are influenced by cow activity, feeding behavior, and relationships between cow herd mates, which are excluded when cows are placed in chambers (Garnsworthy et al., 2019). As a result, sniffer data were not as robust as chamber data in predicting CH4 emission in other herds. Nevertheless, when within-herd prediction accuracies were compared, sniffer data were as accurate as chamber data in predic- tion of CH4 as they are mostly tailored to specific herd environments. On the other hand, for the SF6 method, when DMI was added or missing proxies were imputed, with-herd prediction accuracies were close to estimates from the chamber herds. However, estimates from SF6 were in general highly variable owing to the small num- ber of observations available for the method. Variable Importance and Effect of Imputation of Missing Data Points In predictive statistics, it is fundamentally important to have an accurate model; however, it may also be desirable to have a model that is easy to interpret, and where variable features that contribute most to pre- dictive ability can be identified. In the present study, relative contributions of proxy variables were provided by RF models. When DMI was not included in the model, the proxies that contributed most to prediction accuracy were DIM, milk yield, BW, milk fat, and milk protein. On the other hand, when included in the mod- el, DMI was identified as the most important variable by all the metrics used to measure variable importance. On the contrary, breed, parity and CH4 measurement method were the variables that contributed least to prediction accuracy. Adding DMI to the prediction models is expected to increase accuracy of prediction of CH4 because of the clear biological relationship between DMI and CH4 production. For instance, in dairy cows that consume more feed, more CH4 is produced due to the greater availability of substrate for microbial fermentation (Hristov et al., 2018). Conversely, increased intake may potentially increase passage rate and shorten digesta retention time in the rumen, thus decreasing rumen fermentation and organic matter digestibility, which ultimately decreases CH4 per unit of feed (Boadi et al., 2004). Dry matter intake and ME intake are the variables most used for prediction of CH4 emission (Johnson and Johnson, 1995; Mills, et al., 2003; Ellis et al., 2007). Consequently, prediction equations includ- ing such energy intake variables showed low root mean square prediction error (RMSPE) and are therefore important in prediction of enteric CH4 emission (So- brinho et al., 2019). Ellis et al. (2007) also confirmed that use of DMI in prediction equations for CH4 emis- sion in cattle resulted in lower RMSPE. Sobrinho et al. (2019) working on Nellore cattle reported that equa- tions that included intakes of DM, total carbohydrate, ME, cellulose and nonfiber carbohydrates were the most accurate for the prediction of enteric CH4 emis- sion. In our study, however, DMI was measured in few herds, and data on other dietary variables were not available. Therefore, imputation of missing DMI data points and their inclusion in prediction models had a marked positive effect on prediction accuracies, espe- cially on between-herd prediction accuracies. This was particularly true for herds using the sniffer method, the majority of which did not have measured DMI records. This is consistent with reports in the literature (Ap- puhamy et al., 2016; Bayat et al., 2017). Appuhamy et al. (2016) evaluated 40 prediction equations using data that included measured or estimated DMI and some feed quality attributes. They reported that models us- ing estimated DMI predicted enteric CH4 emissions as accurately as the measured DMI, provided DMI could be estimated with reasonable accuracy. They also re- ported that enteric CH4 emissions from dairy cows can be predicted successfully (RMSPE = 12.7%) without DMI, but more accurately (RMSPE = 7.7%) with estimated DMI. Similarly, in our study, using hetero- geneous across-country data RMSPE for RF models ranged from 23.3 to 31.3% when DMI was not in the prediction model and ranged from 18.5 to 20.3% when imputed DMI was included in the prediction model. This in general indicates that by imputing missing DMI data points it is possible to achieve satisfactory predic- Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Journal of Dairy Science Vol. 105 No. 6, 2022 tion of CH4 emissions provided the right statistical and imputation methods are used. Niu et al. (2018) evaluated the potential contribu- tion of predictor variables by adding sequentially each of the predictor variables during a model development process. They observed that accuracy of prediction of CH4 production was improved in models that included DMI, diet composition, milk production and composi- tion, and BW. In particular, complex models that used all available variable information consistently improved prediction performance compared with simpler models. However, models using only milk yield or diet composi- tion were the least accurate. When DMI was removed from the model to predict CH4 production, ECM was selected instead due to its high correlation with DMI, but model predictive ability was reduced. This is con- sistent with results of the current study where, when available, DMI was ranked first, followed by milk yield and milk compositional variables. Under the scenario when DMI was omitted from training of the regression trees, RF ranked milk yield and milk compositional traits at the top. Breed, parity and methods used for the measurement of CH4 ranked lowest, indicating their relatively low importance, which could also be due in part to their correlations with highly related variables. Benefits of Heterogeneous Across-Country Data and Potentials of Machine Learning in Predictive Modeling In recent years, marked progress has been made in developing empirical CH4 prediction models at national, regional, and global levels (Ellis et al., 2010; Appuhamy et al., 2016; Niu et al., 2018). Despite this progress, much remains to be done. Especially, collating diverse and heterogeneous intercontinental data into one us- able set, data harmonization, standardization, model validation, and correction for heterogeneity of variances in across-country data are all areas of interest. The sta- tistical methods most used in developing CH4 predictive models to date have been questioned because of their limitations (Hristov et al., 2018), such as not includ- ing random effects of animals or studies. Furthermore, models based on MLR assume predominantly linear relationships among the features of the target variables, although nonlinear relationships in the data set may be equally likely. To the best of our knowledge, the current study is one of the first attempts to predict CH4 by applying ML on combined data on low-cost routinely recorded proxies for CH4 from individual animals and from diverse international sources. In our heterogeneous data set, because most herds had no measured DMI records, most of the missing DMI data points were imputed from routinely recorded proxy variables. The finding that imputed DMI records improved prediction accuracies clearly shows the poten- tial that CH4 output could be predicted with reasonable accuracy from routinely available variables and esti- mated DMI, provided that DMI is imputed accurately. This opens a great opportunity to include many herds in large global or regional databases for intensive data analysis, such as across-country genetic evaluations when direct measurements of DMI are not available. In general, addition of many routinely recorded predictor variables into the predictive model can contribute to increased prediction accuracy. Therefore, our results emphasize the great value of using proxy variables that are recorded routinely on-farm and imputing missing DMI observations to generate a reasonably accurate prediction of CH4 when direct measurements of CH4 from individual animals are difficult or expensive to obtain on a large scale. Because predictive ability of models is likely to be enhanced with increasing model complexity (Moraes et al., 2014), during model development the trade-off between availability of variable inputs on-farm and pre- diction accuracy must be carefully considered. In this regard, a review by Negussie et al. (2017a) provided an extensive list of potential proxies with critical evalu- ation of their attributes. The present study substan- tiated the practical use of such proxies in improving the accuracy of prediction of CH4. Therefore, from now onward, relaxing the threshold on predictor variables to include low-cost and routinely available proxies will open possibilities to collate more and diverse informa- tion on GHG emissions from as many regions and pro- duction systems as possible. More and diverse informa- tion, when properly compiled and analyzed, will have a significant implication in improving the accuracy of predictions. However, Hristov et al. (2018) highlighted that one of current challenges is to make more data available and the next frontier should be the collation of GHG information from as wide and diverse livestock populations as possible to develop robust models that are globally applicable. At the same time, efforts should also be directed toward sharpening analytical tools by which important missing data points could be accu- rately imputed from routinely measured proxies. Future Considerations Predictive modeling has a great significance and value particularly for traits that are difficult and expensive to record routinely under commercial farm conditions. This includes traits such as DMI and CH4 emission. The accuracy of predictive models can be influenced by the type and the completeness of the data used. In most situations especially with integrated data from diverse Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE Journal of Dairy Science Vol. 105 No. 6, 2022 sources missing variables are quite common. The impu- tation of missing variables form other correlated predic- tor variables is a common solution to increase sample size and hence accuracy as shown in this study. How- ever, imputation techniques can introduce uncertainty in the data with trade-offs between increase in sample size and accuracy on one hand and possible increase in uncertainty on the other. Our results showed the ben- efits of imputation in terms of predictive performance across the different scenarios. Nevertheless, future studies on the measurement of accuracy of imputations are warranted. Furthermore, heterogeneity of residual variances is one important area. In multilevel models, residual variance may vary between subsets of the data, such as between methane measurement techniques, and this was not accounted for in our MLR model in Equation 4, where homoscedasticity was assumed. In classical statistical modeling, notably in mixed linear models for animal genetics, heterogeneity of variance is known to bias the estimates of model solutions (e.g., Visscher and Hill, 1992). However, Hill (1984) on the contrary has shown that ignoring heterogeneity of variances decreases the efficiency of genetic evaluation procedures and consequently the response to selection. Generally, less obvious is the effect of heterogeneous variance on the performance of predictive models in ML, where most of the methods used (including RF) do not rely on the homoscedasticity assumption. In a simulation study, W. Ruth and T. Loughin (Simon Fraser University, Burnaby, BC, Canada; unpublished data) showed that heterogeneity of variance negatively affects the performance of single regression trees. How- ever, less clear is how this could affect the performance of ensemble methods such as RF, which are based on a large number of trees that could counteract the nega- tive effects of heterogeneous variance by reducing the variance of predictions through averaging over trees. Assessing the effect of heterogeneous variance on the predictive ability of ML models for prediction of CH4 emission is an interesting topic that will be taken up in our follow-up studies. In conclusion, the present study describes a novel way forward for developing accurate and robust CH4 prediction models. These models will help in designing effective and sustainable GHG mitigation strategies as well as aiding national, regional and global GHG inven- tories. The broad applicability of such models requires collation of input data from wide and heterogeneous sources. It helps to overcome the difficulty of procuring predictor variables related to intake and diet composi- tion on-farm, and the need for versatile statistical tools for compiling and analysis of unstructured, heteroge- neous across-country data. In this way, low-cost and routinely measured proxy variables can be used to provide a reasonably accurate prediction of CH4 when coupled with imputation of missing DMI data points. As a predictive model, the use of the ML ensemble algorithm RF consistently gave more accurate predic- tions than conventional multiple regression models. This provides a great potential for building a globally representative large-scale CH4 emission database on the basis of which an accurate regional and intercontinental inventory as well as a concerted global mitigation strat- egy could be developed. Results from this study lay strong foundations for our next thorough comparison of various state-of-the-art ML methods for prediction of dairy-cow CH4 emissions in a much larger integrated global data set. ACKNOWLEDGMENTS This paper is the result of the concerted effort of all participants and support from the networks of COST Action FA1302 “METHAGENE: Large-scale methane measurements on individual ruminants for genetic evaluations.” The authors thank all individuals and groups who have directly or indirectly contributed to this work; special thanks are due to the technical and financial support from the COST Action FA1302 of the European Union. In addition, all financial and technical supports from all participating countries and research centers involved in this work are greatly acknowledged. The authors have not stated any conflicts of interest. REFERENCES Al-Jarrah, O. Y., P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, and K. Taha. 2015. Efficient machine learning for big data: A review. Big Data Research. 2:87–93. http:​/​/​dx​.doi​.org/​10​.1016/​j​.bdr​.2015​ .04​.001. Appuhamy, J. A. D. R. N., J. France, and E. Kebreab. 2016. Models for predicting enteric methane emissions from dairy cows in North America, Europe, and Australia and New Zealand. Glob. Chang Biol. 22:3039–3056. https:​/​/​doi​.org/​10​.1111/​gcb​.13339. Bayat, A. R., L. Ventto, P. Kairenius, T. Stefański, H. Leskinen, I. Tapio, E. Negussie, J. Vilkki, and K. J. Shingfield. 2017. Dietary forage to concentrate ratio and sunflower oil supplement alter ru- men fermentation, ruminal methane emissions, and nutrient utili- zation in lactating cows. Transl. Anim. Sci. 1:277–286. https:​/​/​doi​ .org/​10​.2527/​tas2017​.0032. Bell, M. J., N. Saunders, R. H. Wilcox, E. M. Homer, J. R. Good- man, J. Craigon, and P. C. Garnsworthy. 2014. Methane emissions among individual dairy cows during milking quantified by eructa- tion peaks or ratio with carbon dioxide. J. Dairy Sci. 97:6536– 6546. https:​/​/​doi​.org/​10​.3168/​jds​.2013​-7889. Benaouda, M., C. Martin, X. Li, E. Kebreab, A. N. Hristov, Z. Yu, D. R. Yáñez-Ruiz, C. K. Reynolds, L. A. Crompton, J. Dijkstra, A. Bannink, A. Schwarm, M. Kreuzer, M. McGee, P. Lund, A. L. F. Hellwing, M. R. Weisbjerg, P. J. Moate, A. R. Bayat, K. J. Shingfield, N. Peiren, and M. Eugène. 2019. Evaluation of the performance of existing mathematical models predicting enteric methane emissions from ruminants: Animal categories and dietary mitigation strategies. Anim. Feed Sci. Technol. 255:114207. https:​ /​/​doi​.org/​10​.1016/​j​.anifeedsci​.2019​.114207. Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE http://dx.doi.org/10.1016/j.bdr.2015.04.001 http://dx.doi.org/10.1016/j.bdr.2015.04.001 https://doi.org/10.1111/gcb.13339 https://doi.org/10.2527/tas2017.0032 https://doi.org/10.2527/tas2017.0032 https://doi.org/10.3168/jds.2013-7889 https://doi.org/10.1016/j.anifeedsci.2019.114207 https://doi.org/10.1016/j.anifeedsci.2019.114207 Journal of Dairy Science Vol. 105 No. 6, 2022 Blaxter, K. L., and J. L. Clapperton. 1965. Prediction of the amount of methane produced by ruminants. Br. J. Nutr. 19:511–522. https:​ /​/​doi​.org/​10​.1079/​BJN19650046. Blondel, M., A. Onogi, H. Iwata, and N. Ueda. 2015. A ranking ap- proach to genomic selection. PLoS One 10:e0128570. https:​/​/​doi​ .org/​10​.1371/​journal​.pone​.0128570. Boadi, D. A., K. M. Wittenberg, S. L. Scott, D. Burton, K. Buckley, J. A. Small, and K. H. Ominski. 2004. Effect of low and high forage diet on enteric and manure pack greenhouse gas emissions from a feedlot. Can. J. Anim. Sci. 84:445–453. https:​/​/​doi​.org/​10​.4141/​ A03​-079. Breiman, L. 2001. Random forests. Mach. Learn. 45:5–32. https:​/​/​doi​ .org/​10​.1023/​A:​1010933404324. Cassandro, M. 2020. Animal breeding and climate change, mitigation and adaptation. J. Anim. Breed. Genet. 137:121–122. https:​/​/​doi​ .org/​10​.1111/​jbg​.12469. Cassandro, M., M. Marcello, and B. Stefanon. 2013. Genetic aspects of enteric methane emission in livestock ruminants. Ital. J. Anim. Sci. 12:450–458. https:​/​/​doi​.org/​10​.4081/​ijas​.2013​.e73. Charmley, E., S. R. O. Williams, P. J. Moate, R. S. Hegarty, R. M. Herd, V. H. Oddy, P. Reyenga, K. M. Staunton, A. Anderson, and M. C. Hannah. 2016. A universal equation to predict meth- ane production of forage-fed cattle in Australia. Anim. Prod. Sci. 56:169–180. https:​/​/​doi​.org/​10​.1071/​AN15365. de Haas, Y., J. J. Windig, M. P. L. Calus, J. Dijkstra, M. de Haan, A. Bannink, and R. F. Veerkamp. 2011. Genetic parameters for pre- dicted methane production and the potential for reducing enteric emissions through genomic selection. J. Dairy Sci. 94:6122–6134. https:​/​/​doi​.org/​10​.3168/​jds​.2011​-4439. Deighton, M. H., S. R. O. Williams, M. C. Hannah, R. J. Eckard, T. M. Boland, W. J. Wales, and P. J. Moate. 2014. A modified sulphur hexafluoride tracer technique enables accurate determina- tion of enteric methane emissions from ruminants. Anim. Feed Sci. Technol. 197:47–63. https:​/​/​doi​.org/​10​.1016/​j​.anifeedsci​.2014​ .08​.003. Ellis, J. L., A. Bannink, J. France, E. Kebreab, and J. Dijkstra. 2010. Evaluation of enteric methane prediction equations for dairy cows used in whole farm models. Glob. Chang. Biol. 16:3246–3256. https:​/​/​doi​.org/​10​.1111/​j​.1365​-2486​.2010​.02188​.x. Ellis, J. L., E. Kebreab, N. E. Odongo, B. W. McBride, E. K. Okine, and J. France. 2007. Prediction of methane production from dairy and beef cattle. J. Dairy Sci. 90:3456–3466. https:​/​/​doi​.org/​10​ .3168/​jds​.2006​-675. Engineering ToolBox. 2003. Gases - densities. Accessed Jun. 11, 2018. https:​/​/​www​.Engineeringtoolbox​.Com/​Gas​-Density​-d​_158​.Html. Engineering ToolBox. 2004. STP – standard temperature and pres- sure and NTP – normal temperature and pressure. Accessed Jun. 11, 2018. https:​/​/​www​.engineeringtoolbox​.com/​stp​-standard​-ntp​ -normal​-air​-d​_772​.html. FAO (Food and Agriculture Organization of the United Nations). 2016. Greenhouse Gas Emissions from Agriculture, Forestry and Other Land Use. FAO. http:​/​/​www​.fao​.org/​3/​a​-i6340e​.pdf. FAO (Food and Agriculture Organization of the United Nations). 2018. Enteric fermentation. FAOSTAT. Accessed June 10, 2018. http:​/​/​www​.fao​.org/​faostat/​en/​#data/​ge. Garnsworthy, P. C., J. Craigon, J. H. Hernandez-Medrano, and N. Saunders. 2012. Variation among individual dairy cows in meth- ane measurements made on farm during milking. J. Dairy Sci. 95:3181–3189. https:​/​/​doi​.org/​10​.3168/​jds​.2011​-4606. Garnsworthy, P. C., G. F. Difford, M. J. Bell, A. R. Bayat, P. Huhtanen, B. Kuhla, J. Lassen, N. Peiren, M. Pszczola, D. Sorg, M. H. P. W. Visker, and T. Yan. 2019. Comparison of methods to measure methane for use in genetic evaluation of dairy cattle. Animals (Basel) 9:837. https:​/​/​doi​.org/​10​.3390/​ani9100837. González-Recio, O., and S. Forni. 2011. Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genet. Sel. Evol. 43:7. https:​/​/​doi​.org/​10​.1186/​1297​-9686​-43​-7. Gower, J. C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27:857–874. https:​/​/​doi​.org/​10​.2307/​ 2528823. Hellwing, A. L. F., P. Lund, J. Madsen, and M. R. Weisbjerg. 2013. Comparison of enteric methane production predicted from the CH4/CO2 ratio and measured in respiration chambers. Adv. Anim. Biosci. 4:557. Hill, W. G. 1984. On selection among groups with heterogenous vari- ance. Anim. Prod. 39:473–477. Hristov, A. N., E. Kebreab, M. Niu, J. Oh, A. Bannink, A. R. Bayat, T. M. Boland, A. F. Brito, D. P. Casper, L. A. Crompton, J. Dijk- stra, M. Eugène, P. C. Garnsworthy, N. Haque, A. L. F. Hellwing, P. Huhtanen, M. Kreuzer, B. Kuhla, P. Lund, J. Madsen, C. Mar- tin, P. J. Moate, S. Muetzel, C. Muñoz, N. Peiren, J. M. Powell, C. K. Reynolds, A. Schwarm, K. J. Shingfield, T. M. Storlien, M. R. Weisbjerg, D. R. Yáñez-Ruiz, and Z. Yu. 2018. Symposium review: Uncertainties in enteric methane inventories, measurement techniques, and prediction models. J. Dairy Sci. 101:6655–6674. https:​/​/​doi​.org/​10​.3168/​jds​.2017​-13536. Hristov, A. N., J. Oh, J. L. Firkins, J. Dijkstra, E. Kebreab, G. Wag- horn, H. P. S. Makkar, A. T. Adesogan, W. Yang, C. Lee, P. J. Gerber, B. Henderson, and J. M. Tricarico. 2013. Special topics: Mitigation of methane and nitrous oxide emissions from animal operations: I. A review of enteric methane mitigation options. J. Anim. Sci. 91:5045–5069. https:​/​/​doi​.org/​10​.2527/​jas​.2013​-6583. Jantke, K., M. J. Hartmann, L. Rasche, B. Blanz, and U. A. Schneider. 2020. Agricultural greenhouse gas emissions: Knowledge and posi- tions of German farmers. Land (Basel) 9:130. https:​/​/​doi​.org/​10​ .3390/​land9050130. Järvelin, K., and J. Kekäläinen. 2002. Cumulated gain-based evalua- tion of IR techniques. ACM Trans. Inf. Syst. 20:422–446. https:​/​/​ doi​.org/​10​.1145/​582415​.582418. Jentsch, W., M. Schweigel, F. Weissbach, H. Scholze, W. Pitroff, and M. Derno. 2007. Methane production in cattle calculated by the nutrient composition of the diet. Arch. Anim. Nutr. 61:10–19. https:​/​/​doi​.org/​10​.1080/​17450390601106580. Johnson, K. A., and D. E. Johnson. 1995. Methane emissions from cattle. J. Anim. Sci. 73:2483–2492. https:​/​/​doi​.org/​10​.2527/​1995​ .7382483x. Kebreab, E., K. Clark, C. Wagner-Riddle, and J. France. 2006. Meth- ane and nitrous oxide emissions from Canadian animal agriculture: A review. Can. J. Anim. Sci. 86:135–157. https:​/​/​doi​.org/​10​.4141/​ A05​-010. Kebreab, E., K. A. Johnson, S. L. Archibeque, D. Pape, and T. Wirth. 2008. Model for estimating enteric methane emissions from United States dairy and feedlot cattle. J. Anim. Sci. 86:2738–2748. https:​ /​/​doi​.org/​10​.2527/​jas​.2008​-0960. Knief, U., and W. Forstmeier. 2021. Violating the normality assump- tion may be the lesser of two evils. Behav. Res. Methods 53:2576– 2590. https:​/​/​doi​.org/​10​.1101/​498931. Kowarik, A., and M. Templ. 2016. Imputation with the R package VIM. J. Stat. Softw. 74:1–16. https:​/​/​doi​.org/​10​.18637/​jss​.v074​ .i07. Kriss, M. 1931. A comparison of feeding standards for dairy cows, with especial reference to energy requirements. J. Nutr. 4:141–161. https:​/​/​doi​.org/​10​.1093/​jn/​4​.1​.141. Liaw, A., and M. Wiener. 2002. Classification and regression by ran- domforest. R News 2:18–22. http:​/​/​CRAN​.R​-project​.org/​doc/​ Rnews/​. Mills, J. A. N., E. Kebreab, C. M. Yates, L. A. Crompton, S. B. Cam- mell, M. S. Dhanoa, M. S. R. E. Agnew, and J. France. 2003. Al- ternative approaches to predicting methane emissions from dairy cows. J. Anim. Sci. 81:3141–3150. https:​/​/​doi​.org/​10​.2527/​2003​ .81123141x. Moate, P. J., S. R. O. Williams, C. Grainger, M. C. Hannah, E. N. Ponnampalam, and R. J. Eckard. 2011. Influence of cold-pressed canola, brewers grains and hominy meal as dietary supplements suitable for reducing enteric methane emissions from lactating dairy cows. Anim. Feed Sci. Technol. 166–167:254–264. https:​/​/​ doi​.org/​10​.1016/​j​.anifeedsci​.2011​.04​.069. Moraes, L. E., A. B. Strathe, J. G. Fadel, D. P. Casper, and E. Ke- breab. 2014. Prediction of enteric methane emissions from cattle. Glob. Chang. Biol. 20:2140–2148. https:​/​/​doi​.org/​10​.1111/​gcb​ .12471. Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE https://doi.org/10.1079/BJN19650046 https://doi.org/10.1079/BJN19650046 https://doi.org/10.1371/journal.pone.0128570 https://doi.org/10.1371/journal.pone.0128570 https://doi.org/10.4141/A03-079 https://doi.org/10.4141/A03-079 https://doi.org/10.1023/A:1010933404324 https://doi.org/10.1023/A:1010933404324 https://doi.org/10.1111/jbg.12469 https://doi.org/10.1111/jbg.12469 https://doi.org/10.4081/ijas.2013.e73 https://doi.org/10.1071/AN15365 https://doi.org/10.3168/jds.2011-4439 https://doi.org/10.1016/j.anifeedsci.2014.08.003 https://doi.org/10.1016/j.anifeedsci.2014.08.003 https://doi.org/10.1111/j.1365-2486.2010.02188.x https://doi.org/10.3168/jds.2006-675 https://doi.org/10.3168/jds.2006-675 https://www.Engineeringtoolbox.Com/Gas-Density-d_158.Html https://www.engineeringtoolbox.com/stp-standard-ntp-normal-air-d_772.html https://www.engineeringtoolbox.com/stp-standard-ntp-normal-air-d_772.html http://www.fao.org/3/a-i6340e.pdf http://www.fao.org/faostat/en/#data/ge https://doi.org/10.3168/jds.2011-4606 https://doi.org/10.3390/ani9100837 https://doi.org/10.1186/1297-9686-43-7 https://doi.org/10.2307/2528823 https://doi.org/10.2307/2528823 https://doi.org/10.3168/jds.2017-13536 https://doi.org/10.2527/jas.2013-6583 https://doi.org/10.3390/land9050130 https://doi.org/10.3390/land9050130 https://doi.org/10.1145/582415.582418 https://doi.org/10.1145/582415.582418 https://doi.org/10.1080/17450390601106580 https://doi.org/10.2527/1995.7382483x https://doi.org/10.2527/1995.7382483x https://doi.org/10.4141/A05-010 https://doi.org/10.4141/A05-010 https://doi.org/10.2527/jas.2008-0960 https://doi.org/10.2527/jas.2008-0960 https://doi.org/10.18637/jss.v074.i07 https://doi.org/10.18637/jss.v074.i07 https://doi.org/10.1093/jn/4.1.141 http://CRAN.R-project.org/doc/Rnews/ http://CRAN.R-project.org/doc/Rnews/ https://doi.org/10.2527/2003.81123141x https://doi.org/10.2527/2003.81123141x https://doi.org/10.1016/j.anifeedsci.2011.04.069 https://doi.org/10.1016/j.anifeedsci.2011.04.069 https://doi.org/10.1111/gcb.12471 https://doi.org/10.1111/gcb.12471 Journal of Dairy Science Vol. 105 No. 6, 2022 Negussie, E. 2022. Supplemental Table S1. Harvard Dataverse, V1. https:​/​/​doi​.org/​10​.7910/​DVN/​BINDG9 Negussie, E., Y. de Haas, F. Dehareng, R. J. Dewhurst, J. Dijkstra, N. Gengler, D. P. Morgavi, H. Soyeurt, S. van Gastelen, T. Yan, and F. Biscarini. 2017a. Invited review: Large-scale indirect measure- ments for enteric methane emissions in dairy cattle: A review of proxies and their potential for use in management and breeding decisions. J. Dairy Sci. 100:2433–2453. https:​/​/​doi​.org/​10​.3168/​ jds​.2016​-12030. Negussie, E., O. González-Recio, Y. de Haas, N. Gengler, H. Soyeurt, N. Peiren, M. Pszczola, P. Garnsworthy, M. Battagin, A. R. Bayat, J. Lassen, T. Yan, T. Boland, B. Kuhla, T. Strabel, A. Schwarm, A. Vanlierde, and F. Biscarini. 2019. Machine learning ensemble algorithms in predictive analytics of dairy cattle methane emis- sion using imputed versus non-imputed datasets. Page 40 in Pro- ceedings of 7th GGAA (Greenhouse Gas and Animal Agriculture) Conference, Iguassu Falls, Brazil. Embrapa Southeast Livestock. Negussie, E., J. Lehtinen, P. Mäntysaari, A. R. Bayat, A.-E. Liinamo, E. A. Mäntysaari, and M. H. Lidauer. 2017b. Non-invasive indi- vidual methane measurement in dairy cows. Animal 11:890–899. https:​/​/​doi​.org/​10​.1017/​S1751731116002718. Nielsen, N. I., H. Volden, M. Åkerlind, M. Brask, A. L. F. Hellwing, T. Storlien, and J. Bertilsson. 2013. A prediction equation for en- teric methane emission from dairy cows for use in NorFor. Acta Agric. Scand. A Anim. Sci. 63:126–130. https:​/​/​doi​.org/​10​.1080/​ 09064702​.2013​.851275. Niu, M., E. Kebreab, A. N. Hristov, J. Oh, C. Arndt, A. Bannink, A. R. Bayat, A. F. Brito, T. Boland, D. Casper, L. A. Crompton, J. Dijkstra, M. A. Eugène, P. C. Garnsworthy, M. N. Haque, A. L. F. Hellwing, P. Huhtanen, M. Kreuzer, B. Kuhla, P. Lund, J. Madsen, C. Martin, S. C. McClelland, M. McGee, P. J. Moate, S. Muetzel, C. Muñoz, P. O’Kiely, N. Peiren, C. K. Reynolds, A. Schwarm, K. J. Shingfield, T. M. Storlien, M. R. Weisbjerg, D. R. Yáñez-Ruiz, and Z. Yu. 2018. Prediction of enteric methane production, yield, and intensity in dairy cattle using an intercon- tinental database. Glob. Chang. Biol. 24:3368–3389. https:​/​/​doi​ .org/​10​.1111/​gcb​.14094. O’Neill, B. F., M. H. Deighton, B. M. O’Loughlin, F. J. Mulligan, T. M. Boland, M. O’Donovan, and E. Lewis. 2011. Effects of a perennial ryegrass diet or total mixed ration diet offered to spring- calving Holstein-Friesian dairy cows on methane emissions, dry matter intake, and milk production. J. Dairy Sci. 94:1941–1951. https:​/​/​doi​.org/​10​.3168/​jds​.2010​-3361. Ramin, M., and P. Huhtanen. 2013. Development of equations for pre- dicting methane emissions from ruminants. J. Dairy Sci. 96:2476– 2493. https:​/​/​doi​.org/​10​.3168/​jds​.2012​-6095. Roberts, D. R., V. Bahn, S. Ciuti, M. S. Boyce, J. Elith, G. Guille- ra-Arroita, S. Hauenstein, J. J. Lahoz-Monfort, B. Schröder, W. Thuiller, D. I. Warton, B. A. Wintle, F. Hartig, and C. F. Dor- mann. 2017. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40:913– 929. https:​/​/​doi​.org/​10​.1111/​ecog​.02881. Rojas-Downing, M. M., A. P. Nejadhashemi, T. Harrigan, and S. A. Woznicki. 2017. Climate change and livestock: Impacts, adapta- tion, and mitigation. Clim. Risk Manage. 16:145–163. https:​/​/​doi​ .org/​10​.1016/​j​.crm​.2017​.02​.001. Schielzeth, H., N. J. Dingemanse, S. Nakagawa, D. F. Westneat, H. Al- legue, C. Teplitsky, D. Réale, N. A. Dochtermann, L. Z. Garamsze- gi, and Y. G. Araya-Ajoy. 2020. Robustness of linear mixed-effects models to violations of distributional assumptions. Methods Ecol. Evol. 11:1141–1152. https:​/​/​doi​.org/​10​.1111/​2041​-210X​.13434. Sobrinho, T. L. P., R. H. Branco, E. Magnani, A. Berndt, R. C. Ca- nesin, and M. E. Z. Mercadante. 2018. Development and evalu- ation of prediction equations for methane emission from Nellore cattle. Acta Sci. Anim. Sci. 41:e42559. https:​/​/​doi​.org/​10​.4025/​ actascianimsci​.v41i1​.42559. St-Pierre, N. R. 2001. Invited review: Integrating quantitative findings from multiple studies using mixed model methodology. J. Dairy Sci. 84:741–755. https:​/​/​doi​.org/​10​.3168/​jds​.S0022​-0302(01)74530​ -4. Storlien, T. M., H. Volden, T. Almøy, K. A. Beauchemin, T. A. McAl- lister, and O. M. Harstad. 2014. Prediction of enteric methane production from dairy cows. Acta Agric. Scand. A Anim. Sci. 64:98–109. https:​/​/​doi​.org/​10​.1080/​09064702​.2014​.959553. Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tib- shirani, D. Botstein, and R. B. Altman. 2001. Missing value esti- mation methods for DNA microarrays. Bioinformatics 17:520–525. https:​/​/​doi​.org/​10​.1093/​bioinformatics/​17​.6​.520. Visscher, P. M., and W. G. Hill. 1992. Heterogeneity of variance and dairy-cattle breeding. Anim. Sci. 55:321–329. https:​/​/​doi​.org/​10​ .1017/​S0003356100021012. Waghorn, G. C., H. Clark, V. Taufa, and A. Cavanagh. 2008. Monen- sin controlled-release capsules for methane mitigation in pasture- fed dairy cows. Aust. J. Exp. Agric. 48:65–68. https:​/​/​doi​.org/​10​ .1071/​EA07299. Wang, Q., and H. Bovenhuis. 2019. Validation strategy can result in an overoptimistic view of the ability of milk infrared spectra to pre- dict methane emission of dairy cattle. J. Dairy Sci. 102:6288–6295. https:​/​/​doi​.org/​10​.3168/​jds​.2018​-15684. Wickham, H. 2009. Ggplot2: Elegant Graphics for Data Analysis. 2nd ed. Springer Nature. Williams, S. R. O., T. Clarke, M. C. Hannah, L. C. Marett, P. J. Moate, M. J. Auldist, and W. J. Wales. 2013. Energy partitioning in herbage-fed dairy cows offered supplementary grain during an extended lactation. J. Dairy Sci. 96:484–494. https:​/​/​doi​.org/​10​ .3168/​jds​.2012​-5787. Wolfert, S., L. Ge, C. Verdouw, and M.-J. Bogaardt. 2017. Big data in smart farming–A review. Agric. Syst. 153:69–80. https:​/​/​doi​.org/​ 10​.1016/​j​.agsy​.2017​.01​.023. Yan, T., R. E. Agnew, F. J. Gordon, and M. J. Porter. 2000. The pre- diction of methane energy output in dairy and beef cattle offered grass silage-based diets. Livest. Prod. Sci. 64:253–263. https:​/​/​doi​ .org/​10​.1016/​S0301​-6226(99)00145​-1. Zhang, C., and Y. Ma. 2012. Random forest. Pages 157–175 in En- semble Machine Learning: Methods and Applications. A. Cutler, D. R. Cutler, and J. R. Stevens, ed. Springer. https:​/​/​doi​.org/​10​ .1007/​978​-1​-4419​-9326​-7​_5. Zhao, Y., X. Nan, L. Yang, S. Zheng, L. Jiang, and B. Xiong. 2020. A review of enteric methane emission measurement techniques in ruminants. Animals (Basel) 10:1004. https:​/​/​doi​.org/​10​.3390/​ ani10061004. ORCIDS Enyew Negussie https:​/​/​orcid​.org/​0000​-0003​-4892​-9938 Oscar González-Recio https:​/​/​orcid​.org/​0000​-0002​-9106​-4063 Mara Battagin https:​/​/​orcid​.org/​0000​-0001​-7309​-6793 Ali-Reza Bayat https:​/​/​orcid​.org/​0000​-0002​-4894​-0662 Tommy Boland https:​/​/​orcid​.org/​0000​-0002​-7433​-130X Yvette de Haas https:​/​/​orcid​.org/​0000​-0002​-4331​-4101 Aser Garcia-Rodriguez https:​/​/​orcid​.org/​0000​-0001​-5519​-6766 Philip C. Garnsworthy https:​/​/​orcid​.org/​0000​-0001​-5131​-3398 Nicolas Gengler https:​/​/​orcid​.org/​0000​-0002​-5981​-5509 Michael Kreuzer https:​/​/​orcid​.org/​0000​-0002​-9978​-1171 Björn Kuhla https:​/​/​orcid​.org/​0000​-0002​-2032​-5502 Jan Lassen https:​/​/​orcid​.org/​0000​-0002​-1338​-8644 Nico Peiren https:​/​/​orcid​.org/​0000​-0001​-5500​-1607 Marcin Pszczola https:​/​/​orcid​.org/​0000​-0003​-2833​-5083 Angela Schwarm https:​/​/​orcid​.org/​0000​-0002​-5750​-2111 Hélène Soyeurt https:​/​/​orcid​.org/​0000​-0001​-9883​-9047 Amélie Vanlierde https:​/​/​orcid​.org/​0000​-0002​-4619​-1936 Tianhai Yan https:​/​/​orcid​.org/​0000​-0002​-1994​-5202 Filippo Biscarini https:​/​/​orcid​.org/​0000​-0002​-3901​-2354 Negussie et al.: PROXY-BASED RANDOM FOREST PREDICTION OF METHANE https://doi.org/10.7910/DVN/BINDG9 https://doi.org/10.3168/jds.2016-12030 https://doi.org/10.3168/jds.2016-12030 https://doi.org/10.1017/S1751731116002718 https://doi.org/10.1080/09064702.2013.851275 https://doi.org/10.1080/09064702.2013.851275 https://doi.org/10.1111/gcb.14094 https://doi.org/10.1111/gcb.14094 https://doi.org/10.3168/jds.2010-3361 https://doi.org/10.3168/jds.2012-6095 https://doi.org/10.1111/ecog.02881 https://doi.org/10.1016/j.crm.2017.02.001 https://doi.org/10.1016/j.crm.2017.02.001 https://doi.org/10.1111/2041-210X.13434 https://doi.org/10.4025/actascianimsci.v41i1.42559 https://doi.org/10.4025/actascianimsci.v41i1.42559 https://doi.org/10.3168/jds.S0022-0302(01)74530-4 https://doi.org/10.3168/jds.S0022-0302(01)74530-4 https://doi.org/10.1080/09064702.2014.959553 https://doi.org/10.1093/bioinformatics/17.6.520 https://doi.org/10.1017/S0003356100021012 https://doi.org/10.1017/S0003356100021012 https://doi.org/10.1071/EA07299 https://doi.org/10.1071/EA07299 https://doi.org/10.3168/jds.2018-15684 https://doi.org/10.3168/jds.2012-5787 https://doi.org/10.3168/jds.2012-5787 https://doi.org/10.1016/j.agsy.2017.01.023 https://doi.org/10.1016/j.agsy.2017.01.023 https://doi.org/10.1016/S0301-6226(99)00145-1 https://doi.org/10.1016/S0301-6226(99)00145-1 https://doi.org/10.1007/978-1-4419-9326-7_5 https://doi.org/10.1007/978-1-4419-9326-7_5 https://doi.org/10.3390/ani10061004 https://doi.org/10.3390/ani10061004 https://orcid.org/0000-0003-4892-9938 https://orcid.org/0000-0002-9106-4063 https://orcid.org/0000-0001-7309-6793 https://orcid.org/0000-0002-4894-0662 https://orcid.org/0000-0002-7433-130X https://orcid.org/0000-0002-4331-4101 https://orcid.org/0000-0001-5519-6766 https://orcid.org/0000-0001-5131-3398 https://orcid.org/0000-0002-5981-5509 https://orcid.org/0000-0002-9978-1171 https://orcid.org/0000-0002-2032-5502 https://orcid.org/0000-0002-1338-8644 https://orcid.org/0000-0001-5500-1607 https://orcid.org/0000-0003-2833-5083 https://orcid.org/0000-0002-5750-2111 https://orcid.org/0000-0001-9883-9047 https://orcid.org/0000-0002-4619-1936 https://orcid.org/0000-0002-1994-5202 https://orcid.org/0000-0002-3901-2354 Negussie et al 2021.pdf Enyew_etal2022_JDS_CH4_machineLearning (1) Integrating heterogeneous across-country data for proxy-based random forest prediction of enteric methane in dairy cattle INTRODUCTION MATERIALS AND METHODS Data Data Integration Imputation of Missing Proxies Data Analyses Prediction Models for CH4 Emissions from Proxies Accuracy of Prediction Variable Importance Software and Computing Environment RESULTS Integrated Across-Country Data Prediction Accuracy of CH4 from RF Versus MLR Models. Variable Importance DISCUSSION Accuracy of Proxy-Based Prediction of CH4 Using RF Versus MLR Models Within-Herd and Between-Herd Methane Prediction Accuracy Variable Importance and Effect of Imputation of Missing Data Points Benefits of Heterogeneous Across-Country Data and Potentials of Machine Learning in Predictive Modeling Future Considerations ACKNOWLEDGMENTS REFERENCES