Effective Primary Healthcare Differential Diagnosis: A Machine Learning Approach

More Info
expand_more

Abstract

Primary health care facilities are usually the first point of call for patients seeking medical help. However, mis-diagnosis at this stage of the clinical encounter is still quite prevalent. Mis-diagnosis can be potentially harmful to the patient and even when not the case, there is an increased financial cost of arriving at the correct diagnosis borne by the patient and an increased pressure on the capacity of the medical system. The focus of this thesis is an evaluation of machine learning models which can make a differential diagnosis of possible patient conditions from presented symptoms. In this project, a systematic approach to the acquisition and generation of data relevant to the task is presented. This approach sidesteps one of the major barriers to the application of artificial intelligence methods in the health care domain i.e. access to data. With a generated dataset of approximately 5 million records, containing 801 conditions and 376 symptoms, three machine learning models - Naive Bayes, Random Forest and Multilayer perceptron (MLP) - are evaluated and compared on the generated data using the accuracy, precision and Top-5 accuracy as evaluation metrics. The Naive Bayes model achieves a 58.8% accuracy score, 63.3% precision score and an 85.3% top-5 accuracy score. The Random Forest achieves 57.1% accuracy with a precision score of 61.2% and a top-5 accuracy of 84.5%. The MLP model achieves similar performance with Naive Bayes with an accuracy of 58.8%, a precision of 63.0% and a top-5 accuracy of 85.5%. The number of symptoms expressed per condition was shown to have a strong effect on the achieved metric scores. When evaluated on a generated dataset with at least 5 symptoms per condition, the accuracy score lay between 80.2% and 83.6%, the precision was within the range of 84.2% and 87.6% and the top-5 accuracy was between 95.7% and 96.6% across all evaluated models. For a better understanding of the potential efficacy of these models in a real world setting, a number of possible real world scenarios are proposed and new datasets are generated based on these scenarios. The trained models are then evaluated on these new datasets. It is shown that model performance is closely related to the relevance and number of observed symptoms for each condition - a higher number of symptoms expressed per condition results in higher performance by the models. It is also shown that model performance degrades considerably when the new datasets are very different from the original generated data. The models perform poorly especially in the case when symptoms not usually associated with a condition are presented even when the presentation probability is still low.