Jukuri, open repository of the Natural Resources Institute Finland (Luke) All material supplied via Jukuri is protected by copyright and other intellectual property rights. Duplication or sale, in electronic or print form, of any part of the repository collections is prohibited. Making electronic or print copies of the material is permitted only for your own personal use or for educational purposes. For other purposes, this article may be used in accordance with the publisher’s terms. There may be differences between this version and the publisher’s version. You are advised to cite the publisher’s version. This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail. Author(s): Peter Rubbens, Stephanie Brodie, Tristan Cordier et al. Title: Machine learning in marine ecology: an overview of techniques and applications Year: 2023 Version: Publisher’s version Copyright: The author(s) 2023 Rights: CC BY 4.0 Rights url: https://creativecommons.org/licenses/by/4.0/ Please cite the original version: Peter Rubbens, Stephanie Brodie, Tristan Cordier et al. (2023) Machine learning in marine ecology: an overview of techniques and applications. ICES Journal of Marine Science. doi:10.1093/icesjms/fsad100 ICES Journal of Marine Science , 2023, 0 , 1–25 DOI: 10.1093/icesjms/fsad100 Review Article Machine learning in marine ecology: an overview of t ec hniques and applications Pet er Rubbens 1 ,2 , St ephanie Brodie 3 , Tristan Cor dier 4 ,5 , Diog o Destro Bar cellos 6 , Paul Dev os 7 , Jose A. Fer nandes-Salv ador 8 , Jennif er I Finc ham 9 , Alessandr a Gomes 6 , Nils Ola v Handeg ar d 10 , Kerry Ho w ell 11 , Cédric J amet 12 , K yr re Heldal Kar tv eit 10 , Hassan Moustahfid 13 , Clea P arcer isas 1 ,7 , Dimitr is P olitikos 14 , Raphaëlle Sauzède 15 , Mar ia Sokolov a 16 , Laura Uusitalo 17 ,18 , Laure Van den Bulcke 19 ,20 , Aloysius T. M. van Helmond 21 , Jordan T. Watson 22 ,23 , Heather Welch 3 , Oscar Beltran-Perez 24 , Samuel Chaf fr on 25 ,26 , D a vid S . Greenberg 27 , Bernhard Kühn 28 , Rainer Kiko 29 ,30 , Madiop Lo 31 , Rubens M. Lopes 6 , Klas Ove Möller 32 , William Michaels 33 , Ahmet Pala 34 ,10 , Jean-Baptiste Romagnan 35 , Pia Schuchert 36 , Vahid Seydi 37 , Sebastian Villasante 38 , Ketil Malde 10 ,39 , and Jean-Olivier Irisson 29 ,* 1 Flanders Marine Institute (VLIZ), 8400 Oostende, Belgium 2 Kytos BV, Technologiepark-Zwijnaarde 82, 9052 Gent, Belgium 3 Institute of Marine Science , University of California Santa Cruz, Santa Cruz, CA 95064, USA 4 Department of Genetics and Evolution, University of Geneva, 1205 Geneva, Switzerland 5 NORCE Climate, NORCE Norwegian Research Centre AS, Bjerknes Centre for Climate Research, Jahnebakken 5, 5007 Bergen, Norway 6 Oceanographic Institute, University of São Paulo, Praça do Oceanográfico, 191, 05508-120, São Paulo, Brazil 7 Department of Information Technology, Research group WAVES, Ghent University, Tech Lane Ghent Science Park, 126, B-9058 Gent, Belgium 8 AZTI, Marine Research, Basque Research and Technology Alliance (BRTA). Txatxarramendi Ugartea z/g, 48395 Sukarrieta, Spain 9 Cefas, Pakefield Road, Lowestoft, Suffolk NR33 0HT, UK 10 Institute of Marine Research, Nykirkekaien 1, 5005 Bergen, Norway 11 School of Biological and Marine Sciences, University of Plymouth, Drake Circus, Plymouth PL4 8AA, UK 12 Université du Littoral Côte d’Opale, CNRS, Univ. Lille, IRD, UMR 8187, LOG, Laboratoire d’Océanologie et de Géosciences, F-62930 Wimereux, France 13 National Oceanic and Atmospheric Administration, US Integrated Ocean Observing System, Silver Spring, MD 20910, USA 14 Institute of Marine Biological Resources and Inland, Hellenic Centre for Marine Research, 16452 Argyroupoli, Greece 15 Sorbonne Université, CNRS, Institut de la Mer de Villefranche, FR3761, F-06230 Villefranche-Sur-Mer, France 16 Wageningen University and Research, Droevendaalsesteeg 1, Building 107, 6708 PB Wageningen, The Netherlands 17 Finnish Environment Institute, Latokartanonkaari 11, FI-00790 Helsinki, Finland 18 Natural Resources Institute Finland (Luke), Latokartanonkaari 9, FI-00790 Helsinki, Finland 19 Flanders Research Institute for Agriculture, Fisheries and Food, Marine Research, Jacobsenstraat 1, 8400 Ostend, Belgium 20 Department of Data Analysis and Mathematical Modelling—Knowledge-based Systems Research Group, University of Ghent, Coupure Links 653, 9000 Gent, Belgium 21 Wageningen University and Research, Wageningen Marine Research, 1976 CP IJmuiden, The Netherlands 22 Present affiliation: Pacific Islands Ocean Observing System, University of Hawai‘i at M ̄anoa, 1680 East West Road, POST 815, Honolulu HI 96822, USA 23 Auke Bay Laboratory, National Oceanic and Atmospheric Administration, 17609 Pt. Lena Loop Rd., Juneau, AK 99801, USA 24 Leibniz Institute for Baltic Sea Research Warnemünde (IOW), Seestrasse 15, 18119 Rostock, Germany 25 Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France 26 Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GOSEE, F-75016 Paris, France 27 Institute of Coastal Systems, Helmholtz-Zentrum Hereon, Max-Planck-Straße 1, 21502 Geesthacht, Germany 28 Johann Heinrich von Thünen Institute of Sea Fisheries, Herwigstraße 31, 27572 Bremerhaven, Germany 29 Sorbonne Université, CNRS, Laboratoire d’Océanographie de Villefranche, LOV, F-06230 Villefranche-sur-Mer, France 30 GEOMAR Helmholtz Centre for Ocean Research Kiel, 24148 Kiel, Germany 31 Aix Marseille Univ., Univ. Toulon, CNRS, IRD, Mediterranean Institute of Oceanography, F-13009 Marseille, France 32 Institute of Carbon Cycles, Helmholtz-Zentrum Hereon, Max-Planck-Straße 1, 21502 Geesthacht, Germany 33 NOAA, National Marine Fisheries Service, Office of Science and Technology, Silver Spring, MD 20910, USA Received: 29 September 2022; Revised: 14 April 2023; Accepted: 26 May 2023 © The Author(s) 2023. Published by Oxford University Press on behalf of International Council for the Exploration of the Sea. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( https:// creativecommons.org/ licenses/by/ 4.0/ ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 https://orcid.org/0000-0001-9172-5514 https://orcid.org/0000-0002-9708-9042 https://orcid.org/0000-0001-7466-0288 https://orcid.org/0000-0002-9955-2184 https://orcid.org/0000-0001-9992-5334 https://orcid.org/0000-0002-6380-4052 https://orcid.org/0000-0002-5143-5253 https://orcid.org/0000-0002-1686-0377 https://orcid.org/0000-0002-7886-6424 https://orcid.org/0000-0002-7851-9107 https://orcid.org/0000-0002-9709-073X https://orcid.org/0000-0003-3118-2161 https://orcid.org/0000-0001-7381-1849 https://orcid.org/0000-0003-4920-3880 https://creativecommons.org/licenses/by/4.0/ 2 R. Peter et al. 34 Department of Mathematics, University of Bergen, Allégaten 41, 5007 Bergen, Norway 35 DECOD (Ecosystem Dynamics and Sustainability), IFREMER, INRAe, Institut-Agro-Agrocampus Ouest, rue de L’île d’Yeu, 44311 Nantes Cedex 3, France 36 Agri-Food & Biosciences Institute (AFBI), Environment and Marine Sciences Division, 18a Newforge Lane, Belfast BT9 5PX, UK 37 Centre for Applied Marine Science, Bangor University, Menai Bridge LL59 5AB, UK 38 EqualSea Lab-Cross-Research in Environmental Technologies (CRETUS), Department of Applied Economics, University of Santiago de Compostela, Santiago de Compostela 15782, Spain 39 Department of Informatics, University of Bergen, Allégaten 41, 5007 Bergen, Norway ∗Corresponding author: tel: + 33 (0)4 93 76 38 04; e-mail: irisson@normalesup.org . Machine learning co v ers a large set of algorithms that can be trained to identify patterns in data. Thanks to the increase in the amount of data and computing po w er a v ailable, it has become perv asiv e across scientific disciplines. We first highlight wh y machine learning is needed in marine ecology. Then we provide a quick primer on machine learning techniques and vocabulary. We built a database of ∼10 0 0 publications that implement such techniques to analyse marine ecology data. For various data types (images, optical spectra, acoustics, omics, geolocations, biogeochemical profiles, and satellite imagery), we present a historical perspective on applications that pro v ed influential, can serve as templates f or ne w w ork, or represent the div ersity of approaches. T hen, w e illustrate ho w machine learning can be used to better understand ecological sy stems, b y combining v arious sources of marine data. T hrough this co v erage of the literature, w e demonstrate an increase in the proportion of marine ecology studies that use machine learning, the perv asiv eness of images as a data source, the dominance of machine learning for classification-type problems, and a shift to w ards deep learning for all data types. This overview is meant to guide researchers who wish to apply machine learning methods to their marine datasets. Keywords: acoustics, ecology, image, machine learning, omics, profiles, remote sensing, review. W m T i d A c b w c a m c p b b t m c l f M l l b fi r w l t o r p b d b a p m t e t t i ( h t o a t h h i o h u 2 i e r C s Figure 1. Deep learning is a subdomain of machine learning, which on its own is a subdomain of artificial intelligence, as illustrated. Specific methods are mentioned in each subdomain. D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 hat is machine learning and why does arine ecology need it? he term “machine learning” (ML) has become omnipresent n both the scientific literature and everyday news. Its first use ates back to the late 1950s: Regarding a game of checkers, rthur Samuel, an electrical engineer at IBM, stated that “a omputer can be programmed so that it will learn to play a etter game of checkers than can be played by the person who rote the program”by using so-called “machine-learning pro- edures” (Samuel, 1959 , p. 219). In its broadest definition, n ML system improves its performance by extracting infor- ation from data (Mitchell, 1997 ). In contrast to traditional omputer programs, which encode a solution designed by the rogrammer, an ML system can learn to solve a task without eing provided an explicit recipe. Instead, the task is learned y providing the system with examples, i.e. data. The ability o produce a solution to a problem that is not representable echanistically can be extremely powerful, but it depends cru- ially on selecting an appropriate representation of the prob- em (an “objective” function) and on having adequate data rom which to learn. Although often used interchangeably in popular literature, L is a subdomain of the larger field of artificial intel- igence (AI), which encompasses knowledge representation, ogic models, algorithms, and computational methods capa- le of intelligent behaviour ( Figure 1 ). Within ML, the sub- eld of deep learning (DL; LeCun et al., 2015 ) has advanced apidly over the last decade. DL systems use large neural net- orks ( Table 1 ) to extract relevant features from raw data and earn from them, instead of requiring explicit engineering of hose features. These data are often complex (such as images r sounds) and big (thousands to millions of records). In this eview, we cover ML, therefore, including DL. The success of ML is associated with the increase in com- utational power over the last 20 years (Mitchell, 1999 ), ut also with the increasing volume of available data (Jor- an and Mitchell, 2015 ), which led to the development of a roader diversity of algorithms, implemented in widely avail- ble software. Scientists from many disciplines outside of com- uter science are now actively applying ML methods, and arine sciences are no exception, as exemplified by a recent hemed set in this journal (Beyan and Browman, 2020 ). Most xamples in this themed set actually relate to ecological ques- ions, within which a central focus is the detection and quan- ification of the abundance and distribution of living organ- sms. ML is promising in marine ecology for several reasons. i) Modern instruments produce large volumes of data (Tan- ua et al., 2019 ; Guidi et al., 2020 ) that require scaling up heir processing; the flexibility and adaptability of ML meth- ds make them a natural choice for such automation. (ii) This utomation can also help to reduce the biases necessarily in- roduced by manual processing (e.g. Culverhouse et al., 2003 ), ence improving reproducibility. (iii) Finally, ML is adept at andling high degrees of uncertainty (i.e. dealing with noise n the data) associated with unknown underlying mechanisms r with non-stationary processes; therefore, they often yield igh predictive power (Baker et al., 2018 ) and are increasingly sed to gain an understanding of ecological processes (Lucas, 020 ). Within marine sciences, ML has been used more extensively n some subdomains. Specialized reviews have already cov- red some applications. For example, Liu and Weisberg (2011) eviewed the use of Self-Organizing Maps in Oceanography, ulverhouse et al. (2006) , Benfield et al. (2007) , and Iris- on et al. (2022) reviewed ML techniques for the taxonomic mailto:irisson@normalesup.org Machine learning in marine ecology 3 Table 1. Definitions of machine learning algorithms commonly used in marine ecology studies and cited in this re vie w. Method Description Decision tree (DT) A hierarchy (“tree”) of successive decision criteria based on the input variables, in order to label data instances. Popular implementations include the classification and regression trees and C4.5 algorithms. Random forest (RF) An ensemble method that combines predictions of multiple decision trees, in which each tree is trained on a bootstrap resample of the data. Gradient boosted trees, boosted regression trees (BRTs) An ensemble method that combines decision trees, each working on the residuals of the previous one. Gradient boosting in general is a smart way of combining multiple “weak” learners. Matrix factorization Method to find a representation of the input data in fewer variables by decomposing the original data matrix into two latent matrices of fewer dimensions. k-nearest neighbours New data points are labelled according to the average/majority label of its k-nearest neighbours. The value of k must be set beforehand. Linear discriminant analysis (LDA) A multivariate Gaussian distribution is fitted to the inputs for each class, in which each distribution has its own mean but a shared covariance matrix. New data instances are assigned to the distribution with the highest conditional probability. Support vector machine (SVM) An algorithm that tries to find an optimally separating linear boundary in a large transformed space of the input variables. Naive Bayes (NB) Classifier that combines Bayes’ theorem with the “naive” assumption that all variables are independent from each other; a type of Bayesian network. Bayesian network A directed acyclic graph in which each vertex (node) has a probability distribution or a conditional distribution, conditional on the value of its parents. Gaussian mixture model (GMM) A distribution (i.e. “mixture”) of multiple normal distributions is fitted to the data. Data points are then assigned to the closest normal distribution. k-means Data instances are clustered into k groups, in which the within-cluster variance is minimized. Artificial neural network (ANN) An algorithm that combines layers of nodes or “neurons”. Each neuron receives as input a weighted linear combination of the outputs of neurons in the previous layer. Nodes in the first layer represent the input variables. Weights are updated incrementally using training data, starting from the last layer. Predictions are done by feeding new data through the layers once the weights are set. Self-Organizing Map (SOM) An unsupervised artificial neural network, in which nodes in a grid are optimized to match with groups of similar data points. Convolutional neural network (CNN) A class of methods within deep learning. A convolutional neural network takes an array as input, and performs convolutions and reductions (pooling) on it to extract features, which are then fed to an artificial neural network. Region-based convolutional neural network (R-CNN) R-CNN is a deep learning architecture designed to recognize, localize (bounding box), and classify multiple objects in an image. Mask R-CNN is a variant that is able to recognize, segment, and classify individual pixels to multiple objects in an image. Deep belief network (DBN) A form of deep learning in which a generative graphical model is composed out of multiple layers of latent variables. Layers are connected with each other, but the nodes within a layer are not. Long short-term memory (LSTM) A type of recurrent neural networks that falls under deep learning. These types of networks are used to predict sequences of data. LSTMs provide feedback connections within the network, while many deep learning architectures only provide feedforward connections. t o t d r t s s t “ F o i l A I t g d t d ( ( D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 classification of plankton images, Reichstein et al. (2019) gave an overview of DL for earth sciences—including oceanic applications, and Malde et al. (2020) provided a brief re- view on recent developments in DL and highlighted both opportunities and challenges for its adoption in marine sciences. For researchers whose expertise is outside of computer sci- ences, ensuring a proper application of ML methods and keep- ing track of new developments is challenging. The aim of the present review is to serve as a resource for marine ecologists who want to apply ML to their own data. To that effect, the section “A quick primer on machine learning” serves as a starting point for non-practitioners and introduces relevant vocabulary. The section “The setup of the database and its tags” describes our survey of the literature and the resulting structured database, on which the rest of the review is built. From it, we identified that ML is used at two stages in eco- logical research: (i) to process the raw data collected and ex- tract ecologically meaningful datasets from it and then (ii) to combine these ecology-ready datasets together, and with others, to improve our understanding of ecological systems. Therefore, the section “Machine learning to extract ecolog- ical information from observational data” describes appli- cations where ML was used to generate ecological datasets from various raw data types: images and video, optical spec- ra of single cells, acoustics, omics, geolocation records, and cean colour imagery and biogeochemical profiles. The sec- ion “Machine learning to improve ecological understanding” escribes how ML can be used to gain knowledge on the elationships between species and their environment (sec- ion “Predicting species abundance and distribution”), among pecies (section “Capturing dynamic ecological relation- hips”), and between us, humans, and marine ecosystems (sec- ions “Summarizing ecosystems through regionalization” and Supporting human decisions on ecosystem management”). inally, the section “Discussion and perspectives” concludes n the commonalities among ML applications, suggests what s currently limiting them in ecology, and gives a general out- ook of the field. quick primer on machine learning n this section, we provide a short overview of the different asks that ML can achieve, the overall process that ML studies o through in the context of marine ecology, and then present ifferent ML algorithms and software tools that implement hem. Interested readers are invited to consult classic texts to eepen their understanding, either in an introductory manner James et al., 2013 ) or in a more mathematically oriented one Friedman et al., 2001 ). 4 R. Peter et al. u p b o m p a i i f o e 2 d o o F s i f i t b a r d v p m m K p p v t r a l g o v m d o g i t s n D m v t t h ( l n a v t m o i p f P ( ( m o J w T A e d s b d r t t T d w t s T i r e e s e i t t t n t c o ( u a s r c t p n ( d i o D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 ML approaches are often divided between supervised and nsupervised. Supervised systems are given a set of input data oints and their corresponding output (measurements or la- els assigned by experts). The output is often called the target r response variable. In this case, an ML system learns the apping from the input variables to the output variable (e.g. redict fish diversity from environmental variables; Smoli ́nski nd Radtke, 2017 ). Supervised systems can further be divided nto classification, where the output is categorical and the task s to assign a class to input data (e.g. classify plankton taxa rom images; Gorsky et al., 2010 ), and regression, where the utput is continuous or at least ordered (e.g. predict nutri- nt concentrations from hydrological variables; Sauzède et al., 017 ). A supervised task relevant to marine ecology is object etection: The ML system locates objects of interest in a form f regression, often of their bounding box (e.g. detect benthic rganisms in images of the seafloor; Liu and Wang, 2021 ). inally, sometimes, the target variable is only available for a ubset of data points, a situation called semi-supervised learn- ng. Unsupervised systems are given input data only and search or patterns without the availability of a target variable. For nstance, unsupervised methods can aim to cluster data points ogether based on a definition of similarity (e.g. define distinct ioregions based on community compositions; Sonnewald et l., 2020 ), to define simpler representations for the data while etaining salient properties, also known as dimensionality re- uction (e.g. represent correlations between environmental ariables through the first two dimensions of a principal com- onent analysis; Zhao and Costello, 2019 ), or to construct a odel for the distribution of the data (e.g. produce a smooth ap of the density of active fishing vessels from point records; roodsma et al., 2018 ). Additional steps can be performed before or within an ML ipeline. An important part of many ML systems is the pre- rocessing (e.g. feature normalization or smoothing) of input ariables, in order to make them as relevant as possible. Fea- ure extraction derives new informative variables from initial, aw ones (e.g. automated extraction of measurements from n image; Hu and Davis, 2005 ). Feature selection eliminates ess relevant variables, either to improve performance or to ain explainability thanks to a simpler system (e.g. removal f correlated variables; Thomas et al., 2018 ). Finally, the co- ariance structure in the input variables can be used to impute issing values, which are common in field-collected data, or etect outliers, i.e. values that go beyond the expected range f covariance (e.g. a dissolved oxygen concentration too high iven the temperature of the water). The general process for tackling an ML task is shown n Figure 2 , and the successive steps are described in its cap- ion. Of course, depending on the approach and study case, ome steps will be modified. For example, target variables are ot available in the case of unsupervised learning (step 2). In L, feature extraction is included in the model (step 4). In any situations, cross-validation is used in lieu of a dedicated alidation set (step 6): The training set is split into subsets, he model is trained on all subsets but one, and validated on his remaining one; this process is repeated until each subset as been held out once. Comparisons with an external dataset step 8), although important, are rarely performed due to the ack of such independent data. Finally, many ML models are ever deployed (step 9), but serve to describe and understand particular dataset. Diverse ML algorithms have been developed to solve a large ariety of tasks. In Table 1 , we provide a brief description of hose commonly used in marine ecology publications. Finally, several open-source software libraries implement any ML methods under a consistent interface. Thus, nce one understands the general process (as highlighted n Figure 2 ), exploring various methods is relatively easy and rogress can be quick. The better known libraries of relevance or marine ecology are scikit-learn ( https:// scikit-learn.org/ ; edregosa et al., 2011 ) and, more recently, TensorFlow https://www .tensorflow .org ; Abadi et al., 2016 ) and PyTorch https://pytorch.org ; Paszke et al., 2019 ) in Python, the tidy- odels collection of packages in R ( https://www.tidymodels. rg/ ; Kuhn and Wickham, 2020 ), Flux ( https:// fluxml.ai/ ) in ulia, or Weka in Java ( https://www.cs.waikato.ac.nz/ml/ eka/ ). he setup of the database and its tags s a basis for this paper, we built a database of literature ref- rences covering the application of ML methods to marine ata (supplemented by a few additional works, outside of this cope, but providing context and cited in this review). In its roadest definition, ML covers a wide array of methods and ata types. Because many methods have been applied in ma- ine ecology, it is extremely challenging to make an exhaus- ive inventory. Therefore, the goal of this database is instead o showcase the diversity of ML applications to marine data. o do so, multiple keyword-based searches in various scholar atabases were performed by the authors, including the key- ords “machine learning”, “marine”, “ecology”, and varia- ions thereof. The results were complemented with the per- onal libraries of the authors, who span a range of specialties. his (already large) nucleus of papers was further grown us- ng the references cited within them, starting from the most ecent and going backwards in time. This procedure was it- rative (the references of the newly added papers being also xamined) and the search was stopped after several rounds of uch tentative additions did not yield any new reference. After assembling this large body of potentially relevant lit- rature, the authors screened the suitability of each paper for ts inclusion in the database according to the following cri- eria: (i) the paper is peer-reviewed, (ii) its “Methods” sec- ion describes the ML approach used, and (iii) it applies it o a marine dataset. Because some classical statistical tech- iques can be perceived as ML, we further reduced the scope o studies that follow the general process of Figure 2 (i.e. in- lude a validation and/or a test dataset). Then, papers were rganized through tags, defining the type of data they analyse “data: ∗” tag), ML tasks achieved (“task: ∗”), the algorithms sed (“method: ∗”), and other useful characteristics (e.g. avail- bility of code and/or data; “meta: ∗”). The content of the re- ulting database, organized according to data type, is summa- ized in Figure 3 . This selection process still yielded over 1000 papers, which annot all be described in this review. To decide which ones o cite, we considered the following additional criteria: the aper (i) has been widely adopted by the research commu- ity (e.g. is cited very often, defines a method widely applied), ii) is easily reproducible because its methodology is well- escribed and/or code and data are publicly available, or (iii) s representative of a body of work not covered by criteria (i) r (ii). https://scikit-learn.org/ https://www.tensorflow.org https://pytorch.org https://www.tidymodels.org/ https://fluxml.ai/ https://www.cs.waikato.ac.nz/ml/weka/ Machine learning in marine ecology 5 Figure 2. The general process of (supervised) machine learning. After being collected (1), data need to be labelled (2), which means associating the inputs with a number or a name as output ( l = 1, 2, or 3 in the e xample). T he data are then split into training, validation, and test datasets (3) while taking its str uct ure into account (e.g . ensure that all labels are represented in each dat aset). Each input in the training set can be summarized into features (4). T he (transf ormed) training set is used to train the model (5), b y minimizing a loss function (L) that computes the v alue of one or se v eral perf ormance metrics (M). The validation set undergoes the same transformation as the training set, if any, and is then used to evaluate the predictive performance of the model, ideally with the same metric(s) (6). Se v eral v ersions of the model can be trained with different h yperparameters (i.e. settings, noted h ∗) of the machine learning system, and the one with the best performance on the validation set is retained. At this point, the model is frozen and its final performance is assessed on the test set (7). If external information, different from the original data, is available, it should be used to ensure that model predictions are reasonable, in addition to achieving a given performance (8). Finally, the model is ready to be deployed and used with newly collected data (9). e o v c Q M a fi o t w e f T m a l D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 While this database is not exhaustive, the methodical ap- proach described above should avoid overt biases and large omissions. We therefore consider it representative of the diversity of approaches and of the relative volume of re- search in various domains. More importantly, we hope its use will become continuously maintained and updated by its users. To do so, users can browse the library online ( https: // www.zotero.org/groups/ 2325748/ wgmlearn/library ) and, if they wish to contribute, register to the WGMLEARN Zotero group ( https:// www.zotero.org/groups/ 2325748/ wgmlearn/ ), indicating what their contribution would be. The library in its state at the time of submission is available as Supplementary Material (S1). Machine learning to extract ecological information from observational data A first set of extensive and successful applications of ML is the processing of raw inputs (images, sounds, sequences, tc.) into ecologically meaningful data, often in the form f tables with samples (locations, times, etc.) as rows and ariables (taxa densities, biogeochemical quantities, etc.) as olumns. uantifying marine objects from images and video ethods to segment and classify objects of interest from im- ges or video are not sensitive to whether the object is a sh, a bird, or a piece of plastic debris. Yet, the processing f this dominant ( Figure 3 ) type of data has a long history hat is often siloed within specific communities, sometimes ith reason. For example, object segmentation is very differ- nt for benthic objects lying over a complex background than or pelagic ones, imaged over a rather uniform background. herefore, the literature is presented separately for benthos, arine macrolitter, nekton, and plankton. The commonalities mong the methods used for these data, and others, are high- ighted in the “Discussion and perspectives” section. https://www.zotero.org/groups/2325748/wgmlearn/library https://www.zotero.org/groups/2325748/wgmlearn/ 6 R. Peter et al. Figure 3. Treemap representation of the papers in the database that can be categorized according to the type of data they use. The area of each rectangle is proportional to the number of papers (written in brackets). The broad data types are bold and coloured with a given hue. Sub-types, when they exist, are in variations of the same hue. B U c p l h 2 2 t u r k t fi o e p c ( i t a T a t p s e t m t f i r a i n i 2 t S ( F m n H m f u s M E f o E a e t t M p d v a h j a d t 2 e c m g m b b o t t t f m m l i r s n m l g D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 enthos nderwater imaging of the benthic environment has grown onsiderably in the last few decades. We reviewed over 100 apers that used ML to process such data and all were pub- ished after 2000 (the earliest is Soriano et al., 2001 ). Almost alf of these focused on habitat mapping (e.g. Porskamp et al., 018 ) and coral reefs and their inhabitants (e.g. Villon et al., 018 ), followed by studies focussing on the detection of ben- hic invertebrates (e.g. Kiranyaz et al., 2010 , 2011 ). The most sed algorithms included support vector machines (SVMs), andom forest (RF), convolutional neural networks (CNNs), -nearest neighbours (kNN), and classification and regression rees (C AR Ts). Those were used mostly for image/pixel classi- cation (in more than half of the studies) on their own or with ther algorithms, as previously pointed out by Lopez-Vazquez t al. (2020) . More recently, object-based classification has re- laced pixel-based classification (Zhang et al., 2013 ), espe- ially using CNNs, which reached much higher performance Gómez-Ríos et al., 2019 ; Piechaud et al., 2019 ). Growth in this field has naturally been accompanied by an ncrease in the number of images of benthic fauna and habi- ats. Though ML offers promise towards unlocking the cat- logue of unused benthic images, many challenges remain. here are growing concerns regarding the identification of vailable data for training, the pre-training of deep nets, and he handling of class imbalance in training datasets. For exam- le, of the millions of images acquired each year on coral reef urveys, just 1–2% are labelled (Beijbom et al., 2012 ). Lumini t al. (2020) compared several CNN architectures and found hat combinations of several models (i.e. ensembles) were the ost successful for image classification of coral (and plank- on) datasets. Fincham et al. (2020) , who classified images rom across multiple benthic habitats, found an imbalance n their training data due to the frequency of habitat occur- ence, which was countered by using data augmentation to rtificially expand the training by flipping, scaling, and rotat- ng images. The challenge in accessing high-quality training datasets is ow being addressed through developments such as standard- zed reference catalogues (Althaus et al., 2015 ; Fisher et al., 016 ; Howell et al., 2019 ), wide adoption of specialized anno- ation software such as BIIGLE 2 (Langenkämper et al., 2017 ), QUIDLE + (Williams and Friedman, 2018 ), and CoralNet Beijbom et al., 2015 ), and annotated image databases e.g. athomNet (Boulais et al., 2020 ). In addition, the develop- ent of user-friendly software such as VIAME and Superan- otate is making ML more accessible to benthic ecologists. owever, for researchers to best apply these tools, much re- ains to be learned regarding model performance under dif- erent conditions (e.g. depending on the number of classes sed), on training dataset size, on the use of single models ver- us ensembles of models, etc. (Durden et al., 2021 ). acrolitter ach year, tonnes of human-created waste litters the sea sur- ace, seafloor, and shorelines and poses a major threat to ceanic ecosystems and coastal communities (NOAA, 2014 ). xtensive surveys and research are conducted worldwide to ssess litter distributions and concentrations in coastal ar- as and the open sea, to identify litter accumulation zones hrough numerical models, and to design management ac- ions to promote litter removal and recycling (NOAA, 2016 ; adricardo et al., 2020 ). To quantify marine litter, video and hotography-based monitoring is increasingly adopted and eployed on bottom trawl or nets, autonomous underwater ehicles, remotely operated vehicles, unmanned aerial systems, nd drones. However, litter identification is mainly done by umans, which is time-consuming, costly, and often very sub- ective, creating the need for automatic approaches (Canals et l., 2020 ). Region-based convolutional neural networks (R-CNNs), esigned for object detection, have been increasingly applied o automatically detect and classify beached (Watanabe et al., 019 ), floating (Lieshout et al., 2020 ), and seafloor (Politikos t al., 2021 ) macrolitter items. Additionally, traditional CNN lassifiers have been used to categorize litter types from seg- ented images (Garcia-Garin et al., 2021 ). Such studies have enerally shown that the classification and detection perfor- ance of neural networks is high for floating litter ( > 80%) ut often lower for underwater and seafloor litter, which can e attributed to the challenges of underwater imagery (vari- us camera angles, zoom levels, light shadings, litter buried in he seabed). Several authors have used open and experimen- al datasets for their analysis, focusing mainly on the predic- ive performance of the algorithms. The applicability of ML or marine macrolitter research has been recently reviewed in ore detail (Politikos et al., 2023 ). Ultimately, DL has the potential to support monitoring of arine litter by providing automatic, rapid, and scalable so- utions. Nevertheless, a collection of images and video record- ngs from real-world environments and more effective algo- ithms are needed to support litter assessment goals set by takeholders (Politikos et al., 2023 ). Finally, new imaging tech- ologies such as infrared detection (Inada et al., 2001 ) or Ra- an imaging (Gallager, 2019 ), which can identify plastics at east in a laboratory setting, could be implemented and inte- rated with ML techniques for improved results. Machine learning in marine ecology 7 s o g t w m l P B l i f m s a o r t t F t m a D a o i s a l s t b p t F c f a s o a e u i t s r t C s t i I m T c D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 Nekton Monitoring of nekton informs decision-making for biodiver- sity conservation and sustainable fisheries management. Imag- ing surveys constitute a non-invasive complement to con- ventional monitoring. However, it yields large datasets and ML has come into play to automate and speed up the data processing. Nekton monitoring from images is challenging due to the diversity of tasks that need to be solved (e.g. species classification but also morphometric estimations) and the very different conditions in which images are collected (e.g. both underwater and on ships). In early fish imaging studies, classic ML methods were used with data obtained in controlled, experimental setups. For ex- ample, in Storbeck and Daan (2001) , the image acquisition system consisted of a camera and a laser, which allowed ob- taining images but also information on fish volume. They clas- sified six species of fish with 95% accuracy using a shallow ar- tificial neural network (ANN) based on fish contour features. In Zion et al. (2007) , three edible fish species were sorted us- ing a minimum Mahalanobis distance classifier that combined geometric features and object contours as inputs to yield an accuracy of > 96%. In situ monitoring of nekton is largely focused on fish as well, but those studies present additional challenges due to the wide variations in observation conditions. Datasets are typi- cally collected with underwater cameras (Fisher et al., 2016 ), but larger organisms, such as marine mammals, are also mon- itored via satellite images. Here also, before the development of CNNs, global features were used together with background modelling to detect and track objects under water. For exam- ple, Spampinato et al. (2010) developed a Gaussian mixture model (GMM) and moving average algorithm followed by an adapting mean-shift algorithm to detect and track fish in in situ videos, with an 85% success rate. Hu et al. (2012) reached > 97% accuracy in the classification of fish images based on texture and colour features, using two kinds of SVMs. Another approach to handle the change in appearance of objects under- water is to consider the information from a sequence of frames in a video rather than from only one frame, as done in Shafait et al. (2016) where, for ten species, accuracy ranged from 71 to 100%. Finally, artificial alterations in images (i.e. data aug- mentation) are a common way to improve performance and generalization in CNNs. Allken et al. (2019) trained an In- ception 3 architecture on 5000 data-augmented images per species to reach 94% accuracy on a test set, while the baseline model, trained on the 70 original images, reached an accuracy between 50 and 71%. Bogucki et al. (2019) used a combina- tion of three CNNs to detect and identify North Atlantic right whales in aerial and satellite images. The first CNNs located the whales in satellite images, the second detected key points on the whales’ heads in aerial survey images, and the third identified the whales. Images of nektonic organisms are also collected outside of the water, by camera systems deployed on fishing vessels (known as electronic monitoring), which have replaced some on-board fishery observers. Deep models have become valu- able tools to process the videos collected (Helmond et al., 2020 ). Specifically, Mask R-CNN was provided pixel-level masks and bounding boxes around organisms to automat- ically monitor catches on-board fishing vessels (Tseng and Kuo, 2020 ). Such masks can be used to automatically mea- sure the length of the detected objects. In Garcia et al. (2020) , egmentation performance was assessed with the intersection ver union (IoU) metric computed between the predicted and round truth masks. In this study, 1605 images were used to rain a Mask R-CNN model, and an average IoU of 0.89 as obtained on 200 independent test images. Notably, in ore challenging images where fishes overlapped, the IoU was ower, as expected. lankton ecause planktonic organisms are often micrometric to mil- imetric, high magnification is needed to image them, which mplies a short depth of field and can lead to many out-of- ocus objects. In addition, in situ , images are dominated by orphologically diverse detrital particles that are similar in ize to living organisms. Finally, the organisms themselves are lso incredibly diverse. Therefore, the automatic classification f such images is a difficult and interesting ML problem, and emains a major bottleneck for their exploitation. The first attempts at machine-based classification of plank- on images derived various features from the images: statis- ical moments (which capture size, average lightness, etc.), ourier transforms of the contour of the object, texture pat- erns (Tang et al., 1998 ), and, later, grey-level co-occurrence atrices (Hu and Davis, 2005 ). Those features were input into classifier, often an ANN (Tang et al., 1998 ), an SVM (Hu and avis, 2005 ), or a combination of both to classify images into limited (mostly fewer than ten) number of taxa. These approaches matured and the next decade saw the rise f their application for numerous ecological studies. The most nfluential papers of this period are associated with popular in- truments and software. For instance, Grosjean et al. ( 2004 ) nd Gorsky et al. (2010) , while presenting the ZooScan, high- ighted that (i) the performance of different classifiers is largely imilar and therefore mostly determined by the original fea- ures, (ii) this performance decreases strongly when the num- er of taxa to classify increases, and (iii) with 8 taxa, predictive ower saturates beyond 300 example images per taxon in the raining set. Sosik and Olson (2007) presented the Imaging lowCytoBot and described in detail the reasoning and pro- ess to derive features particularly relevant for phytoplankton, rom the original images. Despite the large number of papers, pplications of those techniques at broad spatial and temporal cales are still rare (but see Irigoien et al., 2009 ). The next evolution in this research was the increasing use f CNNs, particularly since 2015, owing to a plankton im- ge classification competition run on Kaggle.com (Robinson t al., 2017 ; Figure 4 ). However, the thoroughness of papers sing this technique is inconsistent and many are published n conference proceedings that are difficult to access. By con- rast, Ellen et al. (2019) provide an extensive overview of the etup of a CNN from scratch, including the choice of its pa- ameters, the inclusion of classic image features and other ex- ernal information into the CNN’s classifier, and compare the NN’s performance with the more classical approaches de- cribed above. CNNs will likely be increasingly relied upon in the fu- ure. Their implementation within dedicated plankton imag- ng software such as EcoTaxa (Picheral et al., 2017 ) or the FCB dashboard will facilitate their routine use by a wide com- unity of ecologists ( https://ifcb-data.whoi.edu/dashboard ). he separation of their feature-extraction part from their lassification part seems like a promising avenue for transfer https://ifcb-data.whoi.edu/dashboard 8 R. Peter et al. 0% 25% 50% 75% 100% [1978 1995] n=3 ]1995 2000] n=7 ]2000 2005] n=7 ]2005 2010] n=11 ]2010 2015] n=22 20 16 16 20 17 8 20 18 9 20 19 18 20 20 16 P ro po rt io n Arti cial Neural Network (non-convolutional) Convolutional Neural Network Discriminant Analysis k Nearest Neighbours Random Forest Support Vector Machine Figure 4. Classifiers used for plankton image recognition through time. The plot displays the proportions rather than absolute numbers. The time bins on the x -axis are not regular. The plot highlights the quick adoption of support vector machines and their current decline, the rise and fall of random forests, the increase of convolutional neural networks (particularly since 2015), and their current dominance. l e h v p v 2 m d s s A p d w I F m r h A t a P w 7 t w e w p ( a 2 f b d f k W p s m m e t c i i p f m t O v c a m t u d t a a c s g t t m t i c p D L t p o t d B t f c n A A t g u s D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 earning (i.e. using a model initially trained on one, often gen- ral, dataset to quickly “fine-tune” it on a another dataset, ere a plankton one; Orenstein and Beijbom, 2017 ), unsuper- ised classification (Schroeder et al., 2020 ), and active learning rocedures, whereby only few images representative of the di- ersity of the dataset are shown to the user (Bochinski et al., 019 ). An important problem in plankton image datasets, like in any other biological ones, is class imbalance (a few classes ominate the samples). Among several solutions, generating ynthetic images in the rare classes using a generative adver- arial network (GAN) was recently tested (Li et al., 2021 ). lternatively, quantification approaches, which do not aim to erfectly classify each individual object but rather to directly erive concentration estimates for each class, deal intrinsically ith the distribution among classes (Gonzalez et al., 2019 ). dentifying microorganisms from single-cell spectra low cytometry has been used since the 1990s to study marine icrobial communities. In flow cytometry, scatter and fluo- escent properties of individual particles are measured at very igh rates (i.e. hundreds to thousands of particles per second). lthough most researchers manually analyse the resulting “cy- ograms”, automated methods have become available for the nalysis of such microbial flow cytometry data (Rubbens and rops, 2021 ). Since the 1990s and early 2000s, artificial neural net- orks (ANNs) have been developed to identify up to 2 lab-grown phytoplankton species using flow cytome- ry (Boddy et al., 2001 ). Supervised single-cell classifiers ere then successfully applied for the identification of het- rotrophic bacteria as well, by combining flow cytometry ith a nucleic acid stain in most cases. Besides ANNs, sup- ort vector machines (SVMs), linear discriminant analysis LDA), and random forests (RFs) have been successfully pplied in this setup (Rajwa et al., 2008 ; Rubbens et al., 017 ). These lab-based studies have demonstrated the use- ulness of the information captured by flow cytometry for acterial and phytoplankton identification. However, it is ifficult to transfer this knowledge directly to samples taken rom the field. As the identity of species present is often un- nown, labels are not available to train supervised models. hen analysing field samples, unsupervised clustering ap- roaches are therefore used to group together cells that have imilar optical properties. Examples include Gaussian mixture odels (GMMs), graph-based clustering, and self-organizing aps (SOMs) (Hyrkas et al., 2016 ; Sgier et al., 2016 ; Bowman t al., 2017 ). In some cases, cell populations do not form distinct patches hat can be isolated by clustering: when the complexity of mi- robial communities is high (i.e. many taxa) or the resolution s limited (e.g. due to the instrumental setup or when study- ng heterotrophic organisms). Cytometric fingerprinting ap- roaches do not try to identify cell populations; instead, they ocus on modelling the multivariate distribution of the cyto- etric data, by defining informative regions in this distribu- ion and recording cell counts or densities in those regions. ften, binning approaches are employed, although more ad- anced strategies have become available as well, e.g. by over- lustering the data using a GMM (Rubbens et al., 2021 ) or by n automated deleting, merging, and shrinking of Gaussian ixtures (Bruckmann et al., 2022 ). A few hybrid approaches have been proposed for freshwa- er samples, in which information from laboratory cultures is sed to analyse natural samples. RF classification was used to ifferentiate noise from signal using lab-grown cultures and hen used to remove the noise in natural samples (Thomas et l., 2018 ). Learned representations of lab-grown cultures can lso be used as proxies to describe the dynamics of a microbial ommunity in a natural sample (Özel Duygan et al., 2020 ). Raman spectroscopy is an alternative, information-rich, ingle-cell technology for the identification of marine microor- anisms. Spectra typically contain many more variables than raditional flow cytometry data; therefore, the use of convolu- ional neural networks (CNNs) should be beneficial to sum- arize this information and get to single-particle identifica- ion. When a CNN was trained on Raman spectroscopy data, t resulted in high classification accuracy for 13 marine mi- roorganisms ( ∼95%) but similar to that of SVM and LDA, robably due to a low sample size (Liu et al., 2020 ). escribing ecosystems with acoustics ight attenuates faster in water than in air, limiting cameras o observing a small volume (albeit at high resolution). Sound ropagates over long distances and is used to monitor the cean interior. Sound also samples larger volumes of water han towed nets and can be used in areas that are otherwise ifficult to reach, such as deep water and rough bathymetry. oth active sensors, which emit sound and measure the re- urned echoes, either from organisms in the water column or rom the seabed, and passive sensors, which just “listen”, are ommonly used in marine science. The following text is orga- ized along those categories. ctive acoustics for target classification ctive acoustics are widely used in fisheries and aquacul- ure to evaluate the spatial and temporal distributions of or- anisms, measure their size distribution, and calculate pop- lation structure, as well as characterize the behaviours of pecies. In all cases, the analysis starts with the identification Machine learning in marine ecology 9 Convolutional Neural Network n=8 Random Forest n=11 Support Vector Machine n=6 Principal Component Analysis n=10 k means n=9 Artificial Neural Network n=14 Mixture Model n=5 Discriminant Analysis n=6 [1988 1998] n=5 ]1998 2008] n=5 ]2008 2013] n=14 ]2013 2016] n=9 ]2016 2018] n=15 ]2018 2020] n=21 Date M et ho d Number of references 0 1 2 3 4 5 6 Figure 5. Evolution of the methods used for target classification from active acoustics data, through time. The labels give the total number of references per row or column of the plot. The colour is proportional to the number of references published in the time period of the column and using the method of the row. A single reference can appear in several rows if it uses several methods. S f t d a p m b r 2 o l s t A A a l h e b e B t t v w u v i v M S e G o a p o m n S t o S b S m f m n c F b p t o a D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 of the returned echo, also called target classification (Kor- neliussen, 2018 ). This process frequently involves manually checking, cleaning, processing, and scrutinizing the echogram features. Target objects are then delineated and ascribed to species using “expert”knowledge gained from biological sam- ples. This heavy dependence on manual operations makes the process time-consuming and vulnerable to bias; scalable and reproducible methods, such as ML-based approaches, are therefore needed. Early attempts to automate target classification typically used deterministic features computed from the data, using details in individual echo pulses (Rose and Leggett, 1988 ) and/or school-based features like shape or energy, as well as auxiliary information like location or depth; this information was passed to a range of classifiers. Artificial neural networks (ANNs) were used early on (Cabreira et al., 2009 ; Figure 5 ). Random forests (RFs) were used with school-based features and auxiliary information, for fish identification (Fallon et al., 2016 ). Peña ( 2018 ) recently reviewed clustering techniques for acoustic data and concluded that expectation-maximization (EM) clustering is the only technique that properly separates acoustic signatures (and noise), after a supervised initializa- tion. Wideband or multi-frequency echosounders added the fre- quency dimension to the data, which allowed for improved discriminatory power. Using the frequency response usually involved averaging over certain ping- or range-bins and com- paring the scatter distributions to the properties of known ag- gregations. Using the full broadband echo spectrum, an RF classifier was successful in classifying individual fishes (Gugele et al., 2021 ). More recently, convolutional neural networks (CNNs) were used to classify the entire echogram and identify the primary species on patches of echo (Hirama et al., 2017 ; Figure 5 ). hang and Li ( 2018 ) used simulated data to compare dif- erent classifiers using different features and CNNs reached he best performance. Regions of interest, identified on real ata, were more accurately identified by CNNs with various rchitectures (ResNet, DenseNet, Inception) than by a sup- ort vector machine (SVM) classifier working on traditional anual features (Rezvanifar et al., 2019 ). CNNs have also een used for pixel-level predictions (i.e. segmentation) on aw acoustic data, using a U-net architecture (Brautaset et al., 020 ) or Mask-Regional CNN (Marques et al., 2021 ), trained n manually labelled data. Such supervised methods require arge amounts of training data, while recently developed semi- upervised methods allowed only ∼10% of the training data o be labelled (Choi et al., 2021 ). ctive acoustics for seabed and sediment mapping ctive acoustics are also used to map seabed topography nd sediment cover, which condition the type of benthic bio- ogical community that can develop. Various methods reach igh spatial resolutions and accuracy, such as single-beam chosounders, sidescan sonars, and reflection sismographs, ut multi-beam echosounders (MBES) are the most cost- ffective for mapping large areas (Anderson et al., 2008 ; rown et al., 2011 ). Bathymetry and backscatter data (and heir derivatives) are interpreted in order to characterize the ype of seabed substrate. For a thorough description of con- entional sea bottom classification systems, see the extensive ork of Hamilton (2001) . One major challenge for seabed mapping is that the man- al interpretation of seabed features from acoustic data is ery time-consuming and highly subjective. This explains the ncreased interest for automated approaches, including in- ersion algorithms, image-processing techniques, and, mostly, L (Brown et al., 2011 ; Stephens and Diesing, 2014 ). In the 1990s, the early ML approaches were ANN, e.g. tewart et al. (1994) , who successfully classified three differ- nt seafloor types based on sidescan sonar data. Dartnell and ardner (2004) used hierarchical decision trees (DTs) trained n four types of images (backscatter intensity and three vari- nce images). Using 60 ground truth sediment samples, they redicted seafloor types in Santa Monica Bay with an accuracy f 72%, which was better than other automated classification ethods at the time. Since then, a variety of ML methods have been scruti- ized through comparative studies (Ierodiaconou et al., 2011 ; tephens and Diesing, 2014 ; Shao et al., 2021 ). The classifica- ion algorithms were very diverse, covering tree-based meth- ds (DT, random forest—RF, Quick Unbiased and Efficient tatistical Tree—QUEST, and Classification Rule with Un- iased Interaction Selection and Estimation—CRUISE, etc.), VMs, maximum-likelihood classifiers (MLCs), and ANNs. In any cases, ML-based approaches were not significantly dif- erent from one another but were vastly superior to the usual anual interpretation procedures. Recently, Cui et al. (2021) demonstrated how a deep belief etwork (DBN) based on fuzzy ranking feature optimization an be used to map sediment distribution over large areas. uzzy ranking is a technique used to identify the feature com- ination, derived from the MBES data, that is most appro- riate for the DBN to correctly classify the seabed sediment ype. The accuracy of the DBN proved higher than that of five ther supervised classification models (DT, RF, SVM, MLC, nd ANN). 10 R. Peter et al. P P m h f c v H q t t v m m B c e a u t ( p t r w a s l t ( t r o r a f f f e a f o t a c b p t o i f t r d s t a t s a t o s o a T c i a P e T s I c s p t c M g g f l a a c s h p v a t g h e a p g c s u q 2 e a g w u r u t f Q h ( D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 assive acoustics monitoring assive acoustic recordings are a reliable and cost-effective ethod to monitor habitat use, distribution, density, and be- aviour of species over space and time. They can be obtained rom boats, autonomous devices (either fixed or moving ones), abled stations, and animal tags, making them usable in a ariety of situations (Kowarski and Moors-Murphy, 2021 ). owever, because of their relative ease of use, hydrophones uickly generate large datasets that require automation to ex- ract information from them (Gibb et al., 2019 ). The most common approach to process acoustic data is o detect and classify specific acoustic events in a super- ised manner. Sound source classification studies have pri- arily focused on shipping (Zaugg et al., 2010 ) and mam- als’ vocalizations (66 out of the 101 references we recorded; ittle and Duncan, 2013 ). In the latter, detection and classifi- ation algorithms have been used to identify species (Bermant t al., 2019 ), specific calls (Bergler et al., 2019 ), or even di- lects and individuals (Brown et al., 2010 ). ML can also be sed to localize the position of, or estimate the range to, a cer- ain source without the need to model the sound propagation Niu et al., 2017 ), outperforming conventional matched field rocessing methods. Another application is to relate proper- ies of the source with characteristics of the sound, through egression; these properties included the size of male sperm hales (Beslin et al., 2018 ) or fish abundance (Rowell et l., 2017 ). In addition, ML can be used for acoustic source eparation, a problem known as the cocktail party prob- em (Bermant, 2021 ). Finally, approaches to characterize en- ire habitats from their soundscape have also been explored Lin et al., 2019 ). A common approach is to extract human-engineered fea- ures from the sound and use them as input for an ML algo- ithm. These features can be derived from the time, frequency, r cepstral domain (transformation of the data to highlight pe- iodic signals), or based on the full image of the spectrogram, visual representation of sound intensity per frequency as a unction of time (Sharma et al., 2020 ). The algorithms used or classification include SVMs (Jarvis et al., 2008 ), RFs (Mal- ante et al., 2018 ), Gaussian mixture models (GMMs; Roch t al., 2011 ), and k-means (Weilgart and Whitehead, 1997 ), mong others. More focus has been put on identifying which eatures are relevant for the classification and characterization f sound events than on which classifier performs best. Often hese features or other rule-based signal processing techniques re also used to first segment the data and then ML is used to lassify the detected segments. Advances in image and speech-recognition algorithms have een applied to underwater sound, reducing the amount of reprocessing and improving performance and generaliza- ions (Schröter et al., 2019 ). In DL approaches, sound is ften converted into a spectrogram, which is considered as an mage and input into a convolutional neural network (CNN) or classification, regression, or feature extraction and clus- ering (Bermant et al., 2019 ; Thomas et al., 2020 ). However, ecently some models have been developed that are applied irectly on the waveform (Roch et al., 2021 ). In the marine context, sounds of interest can be very parsely occurring and datasets can comprise long periods of ime. This leads to highly imbalanced datasets. This imbal- nce is usually solved by first detecting and then classifying he detected sounds, where the detection step is a rule-based ignal-processing algorithm and the classification step is a DL pproach (Stowell, 2022 ). However, the biggest limitation for he application of ML to passive acoustic recordings is the lack f knowledge regarding which sounds are produced by which pecies, because visual surveys to associate sound with images f the species are often impossible. This leads to a lack of data nnotation and limits the usage of supervised ML approaches. o compensate for the lack of ground-truth data, unsupervised lustering algorithms are being developed to acquire general nformation about the ecology of certain habitats (Ozanich et l., 2021 ). rofiling biological communities with nvironmental genomics he study of nucleic acids obtained from an environmental ample is coined as environmental genomics (or meta-omics). n marine ecology studies, the genetic information usually omes from a community of organisms rather than from a ingle specimen, which is our focus here. Metabarcoding (am- lification by polymerase chain reaction and sequencing of a axonomically informative gene) allows documenting biologi- al communities in terms of species presence and proportions. etagenomics (shotgun sequencing of a complex mixture of enomic DNA) provides information of random sections of enomes, allowing us to gain insight into both taxonomy and unctions. Metatranscriptomics (shotgun sequencing of iso- ated RNA transcripts) provides similar information for genes ctive at the time of sampling. ML approaches have long been used for genomics data nalysis. This includes both translating raw signals into nu- leotides using base-calling algorithms (Wick et al., 2019 ) and equence data analysis. For instance, hidden Markov models ave been extensively used for functional annotations, multi- le sequence alignments (Yoon, 2009 ), and more recently for iral signatures detections in metagenomic datasets (Ponsero nd Hurwitz, 2019 ). However, few studies have applied ML o strictly marine meta-omics data. We therefore provide a eneral overview of the analysis of metabarcoding data and ighlight some ML applications to marine data. Metabarcoding datasets are usually processed by well- stablished bioinformatics software, e.g. QIIME 2 (Bolyen et l., 2019 ), which translates raw sequences into statistically ex- loitable species-to-sites count matrices. Sequences are often rouped into operational taxonomic units (OTUs) or ampli- on sequences variants (ASVs) based on their similarity. These equence units then serve as a proxy for species/strains to doc- ment biodiversity changes. Current algorithms to cluster se- uences into OTUs or ASVs are VSEARCH (Rognes et al., 016 ), which relies on an arbitrary similarity cutoff to delin- ate OTUs (e.g. 97%), SWARM (Mahé et al., 2015 ), which ggregates neighbouring sequences to abundant, supposedly enuine, seed sequences, or D AD A2 (Callahan et al., 2016 ), hich uses base calling values to separate spurious from gen- ine sequences. These two latter methods find more “natu- al” boundaries of OTUs and, as such, can be considered as nsupervised approaches. Some OTUs are then assigned a axonomic name based on similarities with known sequences rom curated databases (e.g. PR2, Guillou et al., 2013 ; SILVA, uast et al., 2013 ). To this end, several ML-based methods ave been developed, including naive Bayes (NB) classifiers the RDP classifier, Wang et al., 2007 ) and classification trees Machine learning in marine ecology 11 a m p W e 2 M s b w t e O l K b b d t r D i H i i A s a 2 t l c f f H t t 2 t p g t c t i S t f fi w a o s t m r t D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 using k-mers distributions across sequences (Murali et al., 2018 ). More recent work successfully applied convolutional neural networks (CNNs) to process and taxonomically anno- tate raw metabarcoding data faster, without relying on oper- ational OTUs or ASVs (Flück et al., 2022 ). Resulting OTU-to-site count matrices are then amenable to biodiversity analysis using compositionality-aware mul- tivariate statistics (Quinn et al., 2019 ). For example, ML allows routine monitoring of the impact of industries on ma- rine biodiversity. Based on metabarcoding datasets labelled with ecological states obtained by conventional methods, ran- dom forest (RF) models can be trained to assess the eco- logical status of new samples, based on their metabarcoding profiles alone. This is faster and more cost-effective than conventional morpho-taxonomy approaches, enabling scal- ing up the spatio-temporal scales of biomonitoring programs (Cordier et al., 2018 ; Frühe et al., 2021 ). Network ecology research has been developed on interac- tions between macro-organisms (e.g. plant-pollinator interac- tion networks), but many interactions remain difficult to ob- serve and validate. This is especially true within microbial communities, for which statistical frameworks have been de- veloped to detect co-occurrence patterns and include them into more holistic ecological studies. ML techniques can be used to predict species interactions (Vacher et al., 2016 ; Bo- han et al., 2017 ) and can outperform the identification of trait- matching combinations compared to generalized linear mod- els (Pichler et al., 2020 ). Microbial networks can be inferred from genomics data (Faust and Raes, 2012 ; Lima-Mendez et al., 2015 ) as a means to predict putative biotic interactions, which opens new avenues for understanding the links between marine microbial communities and the large-scale function- ing of marine ecosystems (Guidi et al., 2016 ; Chaffron et al., 2021 ). Finally, ML is expected to contribute to improve our capacity to analyse massive meta-datasets composed of nu- merous collated cross-study genomics data, by controlling for covariates (Wirbel et al., 2021 ). Quantifying and mapping fishing pressure from geolocation data Fishing and shipping activities are putting important pressure on marine ecosystems. They are often tracked using vessel monitoring systems (VMSs) or the automatic identification system (AIS), which transmits vessel locations at regular inter- vals (Thoya et al., 2021 ). VMSs are required by fisheries man- agement agencies for many commercial fishing vessels and the data are often confidential. AIS is designed for maritime safety, for any type of vessel, and the data are more broadly acces- sible. These data are often extensively processed using ML to identify vessel and gear types (Russo et al., 2011 ; Marzuki et al., 2018 ; Taconet et al., 2019 ). Many studies have classified fishing vs. non-fishing be- haviours using artificial neural networks (ANNs; Bertrand et al., 2008 ; Russo et al., 2014 ) and random forests (RFs; Ducharme-Barth and Ahrens, 2017 ; Behivoke et al., 2021 ). To do so, the movement characteristics of vessels across space, time, and habitats are often studied and summarized before being provided to the ML classifier. Kroodsma et al. (2018) trained convolutional neural networks (CNNs) with AIS data to identify fishing vs. non-fishing behaviours and fishing gear types, producing the first map of the global footprint of fish- eries (Taconet et al., 2019 ). The outputs of these models have been used not only to ssess fishing pressure but also in ecological studies to esti- ate noise impacts (Allen et al., 2018 ), assess marine spatial lanning or monitor conservation areas (Robards et al., 2016 ; hite et al., 2020 ), identify species distribution (Le Guyader t al., 2016 ), minimize mammal strike risk (Fournier et al., 018 ), and mitigate bycatch (Richards et al., 2021 ). To integrate fishing activity with the rest of the ecosystem, L efforts on fishery geolocation data have used an expanded uite of predictor variables. For example, several studies used oosted regression trees (BRTs) to relate fishing locations ith environmental information (e.g. sea surface tempera- ure) and then predict dynamic maps of fishing activity from nvironmental data (Soykan et al., 2014 ; Crespo et al., 2018 ). ther studies added bio-economic considerations into fisher ocation-choice frameworks, with ANNs (Dreyfus-Leon and leiber, 2001 ; Russo et al., 2019 ). By characterizing fishing ehaviours using these broader features (e.g. environment, io-economics), ML approaches provide a valuable foun- ation for operational, dynamic, ocean management tools hat support ecosystem-based fishery management in near eal-time (Hazen et al., 2018 ). eriving biogeochemical variables from satellite mages and floats profiles istorically, most in situ measurements used for the character- zation of ocean biogeochemical processes were acquired us- ng ships, resulting in critical undersampling at a global scale. dvances in remote sensing (by ocean colour satellites) and in itu robots now allow sampling marine bio-optical variables t unprecedented spatio-temporal resolution (Claustre et al., 020 ). Yuan et al. (2020) provide a review of applications of DL o environmental remote sensing for estimating atmospheric, and, and oceanic physical, chemical, optical, and biogeo- hemical variables. One section is dedicated to the use of ML or remotely sensed ocean colour parameters retrieval, mainly ocussed on the estimation of the chlorophyll-a concentration. owever, ML has also been applied to remote-sensing data o derive fields of inherent optical properties of the seawa- er (Ioannou et al., 2011 , 2013 ), p CO 2 (Landschützer et al., 015 ), primary production (Mattei et al., 2018 ), phytoplank- on community composition (Stock and Subramaniam, 2020 ), articulate organic carbon (Liu et al., 2021 ), dissolved inor- anic carbon (Roshan and DeVries, 2017 ), and nitrogen fixa- ion rate (Tang et al., 2019 ), as well as perform atmospheric orrection (Jamet et al., 2005 ; Brajard et al., 2012 ) ( Figure 6 ). One remarkable example of using ML for ocean science is he synergy between satellite observations and in situ profiles, n particular from the Argo programs ( > 100000 currently). auzède et al. (2016) used a multi-layer perceptron to ex- end surface bio-optical properties to depth. This produces our-dimensional (i.e. longitude, latitude, depth, and time) elds of biogeochemical variables at global or regional scales, hich fill in situ observational gaps. Such continuous fields re particularly valuable for the initialization and validation f biogeochemical models. They are now reaching operational tatus since four-dimensional fields of chlorophyll-a concen- ration and particulate organic carbon generated by these ethods have recently been made publicly available on the Eu- opean online portal Copernicus Marine Environment Moni- oring Service. 12 R. Peter et al. Bayesian Network Multi Layer Perceptron Artificial Neural Networks Other Trees Self Organising Map Convolutional Neural Network fuzzy logic Genetic Algorithm other Support Vector Machine Gradient Boosting Random Forest Figure 6. Machine learning methods used with satellite imagery data. Artifical neural networks (in blue shades), and, in particular, multi-layer perceptrons, dominate the literature that was reviewed. s c w a t 2 a o t r t c g fi i M u O r m g t t t I t d i b t s s v > o t s P T l i t i u i b u a b s N f o v d a w K l v e z d r d t q d t 2 e m g f 2 e p ( w e p s p a 2 v a n a t e D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 Finally, ML methods are also used to estimate the more carcely measured biogeochemical variables from the more ommonly measured physical ones. For example, an ANN as trained to predict nutrient concentrations and carbon- te system variables from over 250000 profiles of pressure, emperature, salinity, and oxygen concentration (Bittig et al., 018 ). The predictor variables can be measured with very high ccuracy by autonomous floats and now ANN-based meth- ds can spatially and temporally populate the fields of nu- rients and carbon variables, which were previously loosely esolved. MLPs have also been used to predict the phytoplank- on community composition from profiles of fluorescence of hlorophyll-a (Sauzède et al., 2015a ), making it possible to ather and homogenize tens of thousands of fluorescence pro- les available from historical databases, which could not be ntegrated in global analyses before (Sauzède et al., 2015b ). achine learning to improve ecological nderstanding nce ecology-ready tables of data have been extracted from aw sources (see section “Machine learning to extract infor- ation from observational data”), they can be analysed to ain a better understanding of socio-ecological marine sys- ems (this section). Such studies traditionally use statistics, of- en multivariate, and modelling to capture relationships be- ween observed variables; this task is also amenable to ML. n this section, we highlight how ML techniques are used o relate species to their environment and, in particular, pre- ict species distributions, detect dynamic interactions involv- ng several species, and, finally, inform ecosystem management y partitioning the environment in easier-to-understand units hrough regionalization and fueling monitoring and decision- upport tools. This field is even more difficult to map through literature earches than the more technical studies presented in the pre- ious section. Some searches with relevant keywords yielded 10000 results, while others with minor differences yielded nly hundreds. Therefore, in this section, even more than in he previous one, we really focus on presenting papers that howcase different approaches. redicting species abundance and distribution he ability for ML approaches to capture complex and non- inear relationships, as well as their ability to work with miss- ng and heterogeneous data, has driven their popularity for he analysis of species–environment relationships. When data are sparse or heterogeneous, often also lead- ng to high uncertainty, Bayesian ML methods have proven seful. Fernandes et al. ( 2010 ) predicted fish recruitment us- ng a naive Bayes (NB) classifier relying on spawning stock iomass, climate, and weather data. Fernandes et al. (2013) sed multi-dimensional Bayesian networks for a similar task nd found that predicting three species simultaneously dou- led the chance of being correct, compared to three single- pecies models. Lehikoinen et al. (2019) used tree-augmented B models to evaluate the influence of various environmental actors, all heterogeneous in type and in spatio-temporal res- lution, on coastal fish abundance. They note that some en- ironmental factors are not relevant to predict average abun- ances, but are important for extreme ones. Tree-based ensemble models such as random forests (RFs) nd boosted regression trees (BRTs) have also proven useful ith ecological data thanks to their versatility and ease of use. nudby et al. ( 2010 ) found tree-based methods superior to inear models in predicting species richness, biomass, and di- ersity in coral reefs based on habitat variables. Suikkanen t al. (2021) used RF regression to analyse the relationships of oo- and phytoplankton (particularly cyanobacteria) in multi- ecadal (but relatively sparse) monitoring data to find whether elationships found in experiments could also be seen in field ata. Species distribution models (SDMs) are frequently applied o perform spatially explicit analyses of ecological data. They uantify the relationship between species occurrence or abun- ance and their environment and can be then used to predict heir potential geographical distribution (Guisan and Thuiller, 005 ; Elith and Leathwick, 2009 ). A significant body of lit- rature compared the performance of ML-based SDMs with ultivariate linear regression or climate envelope methods, enerally finding that ML methods yield better predictive per- ormance but are prone to overfitting (e.g. Derville et al., 018 ). The most widely applied ML method for SDMs is Max- nt, with over 6000 published papers, which showcases the ower and broad applicability of ML for ecological inference Phillips and Dudík, 2008 ; Elith et al., 2011 ). Maxent works ith records of a species present at given points in space and it- ratively maximizes the probability of presence at these points, redicted from functions of environmental variables at the ame points (Phillips et al., 2006 ). But, many other ML ap- roaches are also used in species distribution modelling, such s decision trees (DTs; Hunt et al., 2020 ), BRTs (Elith et al., 008 ; Cimino et al., 2020 ), RFs (Reiss et al., 2011 ), support ector machines (SVMs; Knudby, 2010 ; Vestbo et al., 2018 ), nd artificial (Benkendorf, 2020 ) and convolutional neural etworks (CNNs; Deneu et al., 2021 ). These models have been pplied to resolve a diverse range of ecological and conserva- ion issues, including understanding species ecology (Brodie t al., 2018 ), responses to current and future environmental Machine learning in marine ecology 13 p o S m F a h g M i o s F e m i E 2 t p e r b a L i H b m 2 s m 2 u s i u fl 2 e n a q f d c a c t t W a t o 2 r b ( D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 change (Hindell et al., 2020 ), threat overlap (Welch et al., 2018 ), and the design and evaluation of spatial management scenarios (Stock et al., 2020 ; Smith et al., 2021 ). Across all ap- plications, communicating the uncertainty of SDMs to stake- holders is critical. In general, estimating uncertainty within ML-based SDMs is difficult, and most solutions underestimate model uncertainty (Beale and Lennon, 2012 ; Watling et al., 2015 ; Brodie et al., 2020 ). However, new approaches, such as Bayesian additive regression trees, are emerging and improv- ing our estimation of uncertainty (Carlson, 2020 ). Capturing dynamic ecological relationships As climate variability and long-term change drive non- stationarity in ecosystems, more research is needed to see how ML approaches can improve our ability to predict and fore- cast potentially changing species relationships with their en- vironment and other species. Latent (hidden) variable mod- elling provides one way to detect an underlying systemic change, or to approximate an ecosystem component that is not represented in the dataset. Trifonova et al. ( 2015 ) mod- elled the North Sea ecosystem using dynamic Bayesian net- works with hidden variables (DBN-HVs), and concluded that a hidden variable in the model managed to learn the zooplank- ton biomass variations in all modelled areas. Trifonova et al. (2017) used this model to predict ecosystem responses under different scenarios. Uusitalo et al. (2018) and Maldonado et al. (2019) created a DBN-HV model for the central Baltic Sea food web and found that the hidden variables replicated the regime shift, i.e. the drastic change in the ecosystem organiza- tion that has been reported by Alheit et al. (2005) and others. These studies exemplify the ability to combine data analytics and domain knowledge through ML to provide explanatory models that provide new insight into ecosystem functioning. Sander et al. (2017) used DBNs to infer ecological relation- ships, but note that presence–absence data may not provide enough signal for these models. Pichler et al. (2020) evaluated the ability of multiple ML methods to infer species interac- tions in the terrestrial domain, but similar approaches could be applied to marine data. Summarizing ecosystems through regionalization In recognition that the ocean is spatially and temporally het- erogeneous, its division into various types of regions (biore- gions, ecoregions, provinces, essential habitats, etc.) provides a means of simplifying and summarizing this heterogeneity into units amenable to further analysis and management. Pioneer- ing this approach was Longhurst et al. (1995) , who defined 57 biogeochemical provinces mainly using regional variation of remotely sensed chlorophyll-a. In more recent years, ML techniques have been adopted to provide more objective clas- sifications. For example, bioregions have been defined based on chlorophyll-a dynamics using k-means clustering (Mayot et al., 2016 ) and hierarchical Iso Cluster classification (Welch et al., 2016 ). Multiple biophysical variables have been used as input to multivariate unsupervised clustering to define pelagic habitats (Hobday, 2011 ; Reygondeau et al., 2018 ) or track the spatial variability of ocean water masses (Phillips et al., 2020 ). The concentration of biological organisms derived from sur- vey data (Santora, 2012 ), ecosystem models (Sonnewald et al., 2020 ), and species distribution models (Welch and McHenry, 2018 ) has also been integrated into classifiers to define ecore- gions. Such ecoregions can be useful for spatial planning urposes since they are quite close to the biological targets f such management procedures (Douglass et al., 2014 ). upporting human decisions on ecosystem anagement inally, we also need to evaluate human–ecosystem inter- ctions and define management strategies that support the ealth and sustainable use of marine ecosystems. These strate- ies are often defined in intergovernmental texts (e.g. the EU arine Strategy Framework Directive) that summarize them n terms of quantifiable objectives; ML can help assess those bjectives. For example, the likelihood to reach the goals et by the European Union’s Water Framework Directive in inland was modelled using Bayesian networks (Fernandes t al., 2012 ). In another example, the accuracy of the auto- atic classification of plankton images was assessed by check- ng whether it could provide zooplankton indicators for the U’s Marine Strategy Framework Directive (Uusitalo et al., 016 ). Early warning regarding specific health indices or poten- ially harmful species is another area where the fast through- ut of ML approaches can improve our practice. A major ffort has been spent in predicting algal blooms affecting ecreational activities, fisheries, and shellfish farming (Camp- ell et al., 2013 ; Fernandes-Salvador et al., 2021 ). But, similar pproaches are used for predicting fish recruitment (Dreyfus- eón and Chen, 2007 ; Fernandes et al., 2010 ) or forecast- ng litter accumulations on beaches (Granado et al., 2019 ; ernández-González et al., 2019 ). An international commitment to protect 10% of the ocean y 2020 showcased the importance of spatial planning as a anagement tool for marine resources (Grorud-Colvert et al., 019 ). ML methods, such as automated plankton image clas- ification, are used to monitor and inform the creation of arine protected areas (Muñoz et al., 2017 ; Benedetti et al., 019 ). Dedman et al. ( 2017 ) developed a tool to simplify the se of marine spatial planning tools based on boosted regres- ion trees. Bayesian networks in combination with geograph- cal information systems are being used to analyse conflicting ses, e.g. how to reallocate aquaculture and different fishing eets with minimal harm (Coccoli et al., 2018 ; Gimpel et al., 018 ), to plan the locations of new activities such as wind nergy (Pınarba ̧s ı et al., 2019 ), or to consider social and eco- omic aspects in addition to environmental ones (Pınarba ̧s ı et l., 2017 ; Laurila-Pant et al., 2019 ). The efficient management of marine ecosystems would re- uire taking decisions that are informed by the current and uture states of these systems. ML can be used to build such ecision support tools. For example, fish abundance and re- ruitment are good indicators of the status of fish stocks, and re used to set fishing regulations. But, small pelagic fish re- ruitment does not follow traditional stock–recruitment rela- ionships, which is why environmental conditions were used o forecast recruitment using ML-based regression (Chen and are, 1999 ; Fernandes et al., 2015 ) and to influence fisheries dvice (Fernandes et al., 2009 ). The ML-based species dis- ribution models described above have been integrated into perational, dynamic, ocean management tools (Hazen et al., 018 ; Abrahms et al., 2019 ), in which management and policy ecommendations update regularly in response to changes in iological, environmental, economic, and societal conditions Welch et al., 2019 ). 14 R. Peter et al. D G d T i a t t A r c t t f d c f e d I f c i t g m r ( T t T i h p n s q a w a e m e t t m u ( c l w S a m T o e R 2 b t Artificial Neural Network (non-convolutional) Convolutional Neural Network Random Forest Support Vector Machine 0.000 0.005 0.010 0.015 1990 2000 2010 2020 Year P ro po rt io n of r ef er en ce s us in g th e m et ho d (w ith in th e fil ed o f m ar in e ec ol og y) Figure 7. Amount of references per time period using one of the four most common ML methods in the database. To a v oid being misled by the global increase in the number of scientific publications, in any field, the amount is expressed as the proportion of the total number of references published in marine ecology in each time period (defined as the result of the query “WC = (Ecology) AND TS = (marine OR sea OR ocean)” on the Web of Science, i.e. Web of Science category is Ecology and title, abstract, or k e yw ords contain “marine”, “sea”, or “ocean”). All curves increase through time, which means that ML is becoming more common within the field of marine ecology. s i i g e p t i b b s q i i n o a e ( s n r t s p o a r i o w t ( t e s D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 iscussion and perspectives eneral trends in machine learning applications: ata, methods, and tasks he diversity in the sections above shows that ML is now used n many fields of marine ecology, albeit at different levels of dvancement. Several factors can account for the success of he application of ML in a given scientific domain. Based on he examples above, a major one seems to be the type of data: pplications of ML were more successful when they could ely on techniques developed and tested in other fields, which ould be repurposed to marine ecology because data were of he same type. This contributes to explaining the dispropor- ionate number of applications of ML to images and videos rom cameras, which constitute ∼45% of the references in the atabase to which ∼15% of references using satellite imagery an be added ( Figure 3 ). Many of those applications benefited rom advances in ML motivated by the ubiquity of images in veryday life. For example, several CNN architectures were eveloped to classify general-purpose image datasets (often mageNet; Deng et al., 2009 ), and when they were success- ul at this task, they also proved relevant for marine appli- ations; for example, the ResNet architecture alone is used n at least 60 papers in the database. Beyond architectures, he weights that result from training CNNs on such large eneric datasets are freely distributed by companies (to pro- ote their technology) and can be slightly modified by a short etraining on a marine dataset to yield domain-specific tools e.g. detect fishes in recordings from underwater cameras). his is called fine-tuning and requires much fewer resources han training from scratch, while yielding very good results. his general approach, called transfer learning, is ubiquitous n the applications of CNNs reviewed above. On the other and, single-cell spectra obtained from cytometry, for exam- le, constitute a very peculiar type of data and therefore do ot benefit from ready-made models; applications of ML to uch data are therefore more difficult and scarcer. While se- uences of nucleic acids are not common in everyday life, their nalysis could still benefit from architectures and pre-trained eights designed for Natural Language Processing, since both re sequences of tokens (e.g. Quang and Xie, 2016 ). How- ver, practically, omics often rely on well-established bioinfor- atics pipelines, which are not specific to questions in marine cology and in which some steps do not involve ML; this con- ributes to explaining the relative scarcity of references from his large field here. In terms of methods, the four most used algorithms in arine ecological research were, in increasing order of pop- larity, support vector machines (SVMs), random forests RFs), convolutional neural networks (CNNs), and non- onvolutional artificial neural networks (ANNs; mostly multi- ayer perceptrons). ANNs have been used for a long time, hich partly explains why they top the list of algorithms; VMs, and then RFs, came after 2000; since 2013, the us- ge of CNNs has increased steeply and now they are the ML ethod most commonly found in new publications ( Figure 7 ). he timing of the usage of those methods in marine ecol- gy largely reflects their appearance or popularization in gen- ral: 1995 for SVMs (Cortes and Vapnik, 1995 ), 2001 for Fs (Breiman, 2001 ), and 2012 for CNNs (Krizhevsky et al., 012 ); this highlights an early adoption of ML innovations y the marine ecology community. In addition, after the ini- ial adoption, the proportion of studies using them among all tudies in marine ecology has grown steeply ( Figure 7 ), which s further evidence of a particular interest for ML approaches n this community. The growth of CNNs, which have pro- ressed the fastest, is associated with their popularity for sev- ral data types. Indeed, CNNs take so-called “tensors” as in- ut: multidimensional arrays of numbers. Any type of data hat can be made to look like an array within which the prox- mity between similar numbers is meaningful is amenable to eing processed by CNNs. For example, while sounds can e treated as such, most acoustics records can also be repre- ented as spectrograms (intensity as a function of time and fre- uency), which are tensors and can be processed with models nitially designed for images (Stowell, 2022 ). Finally, depend- ng on the output shape and the loss function used, the same etwork architecture can be used for regression, classification, bject detection, etc. (Goodwin et al., 2022 ). Among the papers tagged in the database, ML algorithms re most often used to perform classification ( ∼60% of ref- rences) or regression ( ∼20%), and, finally, object extraction detection or segmentation, ∼15%). Yet, the classification of ignals, at least, first requires their extraction from the origi- al data (e.g. the detection of an event in a continuous acoustic ecording, the segmentation of an organism from an image), so he discrepancy in usage is puzzling. Actually, most automated ignal extraction is performed using rules deterministically ap- lied to the raw data. Those rules can be as simple as thresh- lding (e.g. considering all adjacent dark pixels in an image s objects of interest) but are often much more complex and equire both domain expertise to design and signal process- ng know-how to implement. This hindered the development f automated solutions and explains why objects of interest ere (and are still) often extracted manually from underwa- er videos or acoustics recordings in operational deployments e.g. Solsona-Berga et al., 2020 ). DL should enable ecologists o forgo some of the expertise in signal processing and allow xtracting signals of interest only from labels placed on a sub- et of the data. The relative scarcity of their application has Machine learning in marine ecology 15 m o c o w s c p w I e q u a s i e e p w s v t d t 2 t w c s t s c e ( m s s r t a p i a s i g l e G A o c a k b b ( k D ow nloaded from https://academ ic.oup.com /icesjm s/advance-article/doi/10.1093/icesjm s/fsad100/7236451 by N atural R esources Institute Finland (Luke) user on 14 August 2023 likely several explanations. First, deep models for object de- tection/segmentation are newer (Girshick et al., 2014 ) than for classification (Lecun et al., 1998 ) and their applications lag accordingly. Second, they are a bit more complex to set up than classifiers: Drawing bounding boxes or segmentation masks is more time-consuming than sorting files into folders, training classifiers can often start from just this set of sorted raw files, while object detectors/segmenters require text files in a specific format containing the labels linked to the raw data files, etc. However, as labelling tools (e.g. Labelbox), architec- tures, and reference datasets (e.g. Katija et al., 2022 ) continue to improve, such applications are likely to explode in the fu- ture. Finally, supervised ML approaches are much more com- mon than unsupervised ones. This is partly linked with the dominance of classification tasks in the references reviewed. Supervised classification is the archetype of task where ML techniques outperform all others: mimic a simple human ac- tion, learn it only from examples generated by humans, and be evaluated almost solely on the quality of the prediction. Limitations for the application of machine learning Machine learning is particularly effective when the primary concern of ecologists aligns with the performance metric op- timized by the technique (e.g.