Dublin Core
Title
Machine Learning-Driven Prediction of Heart Strokes
Abstract
Heart strokes remain one of the leading health risks in the world today. Timely prediction can significantly improve patient outcomes and healthcare resource allocation. This study aims to harness machine learning techniques to develop efficient predictive models for the early detection of heart strokes.
Research is based on a dataset created by combining different (five) datasets. The dataset encompasses patient demographics, clinical measurements, and historical medical records. The analysis focused on five machine learning models: Logistic Regression, Decision Tree, K-Nearest Neighbors, Random Forest, and Support Vector Machine.
The goal was not only to test different algorithms, but also to understand how data preparation, feature selection, and model choice impact the final results. The models were trained and tested on both the original dataset and an extended version, where new features were added by combining existing ones.
The results showed that models such as Logistic Regression, Decision Tree, and KNN performed better when applied to the original data. The Decision Tree model achieved an accuracy of 87.8% and an F1 score of 0.881, while Logistic Regression and K-Nearest Neighbors each attained F1 scores of 0.850 and 0.849, respectively. On the other hand, Random Forest and SVM showed significant improvements with the extended dataset. Random Forest performed the best overall, with an F1 score of 0.920 and an accuracy of 91.6% with enhanced results.
SVM also benefited from enhanced results, improving its F1 score from 0.892 to 0.879, which highlights how specific models can leverage additional features for improved generalization.
This tool could help detect risks earlier, allowing for timely interventions and prevention, thereby reducing the burden of strokes on healthcare systems and improving patient care. Limitations include data quality and availability, as well as potential bias in healthcare records.
Research is based on a dataset created by combining different (five) datasets. The dataset encompasses patient demographics, clinical measurements, and historical medical records. The analysis focused on five machine learning models: Logistic Regression, Decision Tree, K-Nearest Neighbors, Random Forest, and Support Vector Machine.
The goal was not only to test different algorithms, but also to understand how data preparation, feature selection, and model choice impact the final results. The models were trained and tested on both the original dataset and an extended version, where new features were added by combining existing ones.
The results showed that models such as Logistic Regression, Decision Tree, and KNN performed better when applied to the original data. The Decision Tree model achieved an accuracy of 87.8% and an F1 score of 0.881, while Logistic Regression and K-Nearest Neighbors each attained F1 scores of 0.850 and 0.849, respectively. On the other hand, Random Forest and SVM showed significant improvements with the extended dataset. Random Forest performed the best overall, with an F1 score of 0.920 and an accuracy of 91.6% with enhanced results.
SVM also benefited from enhanced results, improving its F1 score from 0.892 to 0.879, which highlights how specific models can leverage additional features for improved generalization.
This tool could help detect risks earlier, allowing for timely interventions and prevention, thereby reducing the burden of strokes on healthcare systems and improving patient care. Limitations include data quality and availability, as well as potential bias in healthcare records.
Keywords
heart stroke prediction, machine learning, Decision Tree Classifier (DTC), Random Forest Classifier (RFC), K-Nearest Neighbor (KNN), Logistic Regression (LR), Support Vector Machine (SVM), data, dataset
