A Comparison Between Naïve Bayes and Random Forest to Predict Breast Cancer

Accurate diagnosis of breast cancer is very beneficial as breast cancer is the second-leading cause of cancer death in women after lung cancer in the US. This study compares two machine learning approaches to diagnose breast cancer using a publicly available dataset, which comprises of features computed from a digitized image of a fine needle aspirate (FNA). We employ two different machine learning techniques, namely Naïve Bayes and Random Forest to measure the accuracy of the diagnosis. Using 569 patient's information and 31 features, the above three machine learning classifiers are implemented. According to the findings, the Random Forest classifier performed better than the Naïve Bayes method by reaching a 97.82% of accuracy. Furthermore, classification accuracy can be improved with the appropriate use of the feature selection technique. Furthermore, this section explains the feature selection technique used in the study. The analysis procedure is discussed, and the dataset and the performance indicators are described.


Introduction
Breast cancer is considered the most common type of cancer among women throughout the world (World cancer report, 2008). Furthermore, it is estimated that 23 out of 124 women will die due to breast cancer annually (Cancer Statistics Review, 2012). Though Mammography, Fine Needle Aspiration (FNA), and surgical biopsy are the main techniques to diagnose breast cancer, FNA is considered as the most important diagnostic technique to detect breast cancer in early stages (Fiuzy et al., 2012). Further studies about FNA can be seen in (Fiuzy et al., 2012) and (Saxena and Burse, 2012). In this study we aim to utilize machine learning techniques to predict the accuracy of diagnosing breast cancer, using the Breast Cancer Wisconsin Data (Dua and Graff, 2019), which was collected using the FNA technique. Machine learning is considered as a branch of artificial intelligence, which is considered as a method of data analysis that automates the model building process. Easy identification of patterns among the data and the ability to improve machine learning over time are two of the main advantages of machine learning over traditional data classification techniques.
This study aims to classify subjects, based on the characteristics of their breast biopsies into one of the two groups indicating whether the subject has cancer or not. According to the literature, Naïve Bayes and Random Forest techniques are two of the popular machine learning techniques. Therefore, in this study we employ both Naïve Bayes and Random Forest techniques to classify the above data. The rest of this manuscript is organized as follows. Section 2 discusses the two data classification algorithms, followed by the data analysis in section 3. In section 4, the results are discussed and the section 5 concludes the manuscript.

Methodology
In this manuscript, we implemented two machine learning techniques to classify data. They are Naïve Bayes (NB), and Random Forest (RF).
Consider the n-dimensional feature set and let , be a 2dimensional vector (classes). In this study, is the number of features.

Naïve Bayes (NB)
NB is considered as a simple and accurate classification algorithm. Due to the flexibility of the algorithm, a wide range of applications can be seen (Arar & Ayan, 2017 Therefore, our aim is to find y such that the above expression is maximized. This means, we need to find y, which is

Random Forest (RF)
Random Forest is an extension of the popular decision tree algorithm by introducing a higher number of decision trees. This approach aims to reduce the variance of the novel decision tree (Couronné, 2018). The construction of the decision tree is done by selecting a collection of random variables (features). Finally, such a collection of random trees is called a Random Forest, or RF for short. RF is considered as one of the most accurate classification algorithm, due to the higher classification accuracy (Breiman, 2001;Biau and Scornet, 2016). Another characteristic of RF is its significance for unbalanced and missing data (Shah et al., 2014), compared to other alternative techniques. Further experimental and theoretical activities of RF can be seen in Bernard et al. (2007), Breiman (2001).

Feature Selection Technique
To improve the classification accuracy a feature selection technique was applied. The purpose of this is to reduce the dimension of the dataset. In other words, instead of considering all the variables (31 features) of the data, it attempts to filter the most important features that impact the classification. Here we selected a feature selection technique called, Recursive Feature Elimination Technique (RFE), which attempts to remove the most insignificant features from the data until the pre-specified number of significant features is reached. RFF is easy to configure and to handle due to its effectiveness to select features that have significant relationship to predict the target variable. When using the RFF, the elimination of insignificant features is done in a recursive way using the dependency and the correlation of the variables in the dataset.

Dataset
This study was implemented on the Breast Cancer Wisconsin dataset, which was obtained from a publicly available source. The dataset consists of 569 patients data with 31 features. The class variable is the diagnosis of the breast tissues [Benign, Malignant]. The rest of the features have been computed from a digitized image of a process called, fine needle aspirate (FNA) of a breast mass. These features consist of characteristics such as radius (mean of distances from center to points on the perimeter), texture (standard deviation of grayscale values), perimeter, area, smoothness (local variation in radius lengths), compactness concavity (severity of concave portions of the contour), concave points (number of concave portions of the contour), symmetry, and fractal dimension and fractal dimension.

Quantifying the Performance
To quantify the accuracy of classification, we use sensitivity, specificity, and precision as the performance indicators. All of these performance indicators are based on the confusion matrix, which represents the two states of actual and the predicted. Here, TP -true positives (true instances predicted correctly), FP -false positives (false instances predicted as true), TN -true negatives (false instances predicted correctly), |N| -the total of true instances and |P| -total of false instances in the testing sample. Furthermore, we define F-measure, which is also called the harmonic mean between the precision and the sensitivity. The use of various performance indicators will bring more insight to interpret the accuracy of the data prediction. For instance, precision is the proportion of related information out of all the retrieved information. This is a valuable indicator in almost all applications. Sensitivity measures the true-positive recognition rate. This becomes very useful where there is a high importance of classifying positives such as in security checking. In contrast, specificity measures the rate of actual negatives and it is useful in areas such as diagnosing health conditions prior to treatments.

Results
According to the table 1 and 2, both algorithms perform better with the introduction of the feature selection algorithm. Though specificity and precision show the opposite relationship with NB after introducing the feature selection algorithm, RF shows significant progress with the implementation with the feature selection algorithm. Comparing both of the feature selection algorithms, it is clear that RF outperforms NB.

Discussion
At present, the impact of machine learning has reached the majority of areas. When considering the healthcare industry, this influence is immense. The skillset that can be assumed from machine learning by health care professional such as physicians have taken healthcare to a different height. In this manuscript we aimed to use the knowledge of machine learning to diagnose a cancer, based on the characteristics of the biopsy taken from the breast using the Fine Needle Aspiration (FNA) technique. After employing two machine learning algorithms, Naïve Bayes and Random Forest we found that both techniques can be used to classify cancer patients effectively. Out of these two techniques, Random Forest outperformed the Naïve Bayes and the predicting accuracy can be further improved with appropriate selection of feature set. According to the experimental data, the highest accuracy of 97.82% was reached with the Random Forest by selecting only 17 features from a total of 31. In a future study, it is important to consider other possible parameters involved with the classification technique to improve the classification accuracy further. Furthermore, these techniques can be effectively used in other areas of data classification applications.
Furthermore, when choosing a machine learning algorithm for data classification one needs to think about aspects of the execution time and the simplicity of the model. With NB, the model is very simple, fast at the execution, lower risk of overfitting the data, and higher accuracy with categorical data compared to the numerical data. Unfortunately, the issues with probability calculations and assumption of independency with predictors are some of the limitations of the NB model. On the other hand, RF has less variability in the prediction due to the selection of multiple trees, and it can handle a higher volume of data effectively. Two of the notable drawbacks with RF are the complexity of the model and the higher possibility to be over fitted. It is recommended to tune its hyper-parameters involved to minimize the impact of overfitting.