A Classification-Based Machine Learning Approach to the Prediction of Cyanobacterial Blooms in Chilgok Weir, South Korea

None, None; None, None; None, None

A Classification-Based Machine Learning Approach to the Prediction of Cyanobacterial Blooms in Chilgok Weir, South Korea

Journal Article (2022)

Author(s)

J. Kim (IHE Delft Institute for Water Education, TU Delft - Civil Engineering & Geosciences, Human Resources Development Institute)

Andreja Jonoski (IHE Delft Institute for Water Education)

D.P. Solomatine (TU Delft - Civil Engineering & Geosciences, IHE Delft Institute for Water Education, Russian Academy of Sciences)

Research Group

Water Resources

Machine learning Feature selection Cyanobacterial blooms Classification algorithm Imbalanced dataset Oversampling

DOI related publication

https://doi.org/10.3390/w14040542 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:e11bdfc7-f0e5-4358-9410-c5e9f67eb54b

More Info

expand_more

Publication Year

2022

Language

English

Research Group

Water Resources

Journal title

Water

Issue number

4

Volume number

14

Article number

542

Downloads counter

269

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Cyanobacterial blooms appear by complex causes such as water quality, climate, and hydrological factors. This study aims to present the machine learning models to predict occurrences of these complicated cyanobacterial blooms efficiently and effectively. The dataset was classified into groups consisting of two, three, or four classes based on cyanobacterial cell density after a week, which was used as the target variable. We developed 96 machine learning models for Chilgok weir using four classification algorithms: k-Nearest Neighbor, Decision Tree, Logistic Regression, and Support Vector Machine. In the modeling methodology, we first selected input features by applying ANOVA (Analysis of Variance) and solving a multi-collinearity problem as a process of feature selection, which is a method of removing irrelevant features to a target variable. Next, we adopted an oversampling method to resolve the problem of having an imbalanced dataset. Consequently, the best performance was achieved for models using datasets divided into two classes, with an accuracy of 80% or more. Comparatively, we confirmed low accuracy of approximately 60% for models using datasets divided into three classes. Moreover, while we produced models with overall high accuracy when using logCyano (logarithm of cyanobacterial cell density) as a feature, several models in combination with air temperature and NO3-N (nitrate nitrogen) using two classes also demonstrated more than 80% accuracy. It can be concluded that it is possible to develop very accurate classification-based machine learning models with two features related to cyanobacterial blooms. This proved that we could make efficient and effective models with a low number of inputs.

Files

Water_14_00542.pdf

(pdf | 29.3 Mb)