Malignant cells classification across digitized images of fine needle aspirate (FNA) of a breast mass 3-dimensional fragmentated samples
Author
Oscar Cardec
Published
June 21, 2020
Introduction
Breast cancer is “the most common cause of cancer deaths among women worldwide”. In the United States, breast cancer is second to lung cancer related deaths, making it a national health critical issue. Statistical facts show breast cancer as the most frequently diagnosed cancer in women in 140 out of 184 countries. Key to survival and remission of breast cancer is closely linked with early detection and intervention.(Henderson 2015).
Early signs of irregular cells growth are detected by sampling and analyzing nuclear changes and parameter using diagnostic tools. The results of these nuclear morphometry tests are evaluated for structural deviations, which are representative of cancer diagnosis.(Narasimha, Vasavi, and Harendra Kumar 2013) Now, considering the significance in accuracy of these evaluations, one may question how the medical industry uses machine learning models like neural networks to augment diagnosis and judgement of such vital medical assessments.
The following is post-study report of numerous breast cancer preventive screenings, scrutinizing cell nuclei parameters in order to classify the specimens as either malignant or benign. The data have been previously categorized, thus, the intent here is to employ a neural network methodology to replicate this categorization and measure the algorithm’s effectiveness supporting medical professionals in the identification and early detection of breast carcinoma.
Data
The employed data, Breast Cancer Wisconsin - Diagnostic,comes from the UCI Machine Learning Repository. The multivariate set contains 569 instances and 32 attributes as described bellow. As previously stated, the data set features quantitative observations representative of the images obtained by means of a fine needle aspirate (FNA). These digitized samples were studied, measured and recorded, ultimately enabling the classification of every instance as malignant or benign. Essentially, there is a total of 10 real-value features per cell, however, given the 3-dimensional fragmentation of each cell sampling, it produces a total of 30 observations across 3 planes.(William Wolberg 1993)
Variables list across each plane
1 - ID Name :: Identification number of the sample
2 - Diagnosis :: Dependable variable label. M = Malignant, B = Benignant
3 - Feature_1 Radius :: Mean of distances from center to points on the perimeter
4 - Feature_2 Texture :: Standard deviation of gray-scale values
5 - Feature_3 Perimeter ::
6 - Feature_4 Area ::
7 - Feature_5 Smoothness :: Local variation in radius lengths
8 - Feature_6 Compactness :: Perimeter^2 / Area - 1.0
9 - Feature_7 Concavity :: Severity of concave portions of the contour
10 - Feature_8 Concave Points :: Number of concave portions of the contour
Density exploration of diagnosis across four specific variables: radius, area, texture, and concave
# Explore density g1<-ggplot(data =df)+theme_minimal()+geom_density(mapping =aes(Feature_1, fill =Diagnosis), col="darkgrey", show.legend =FALSE, alpha=0.5)+labs(title ="Density of Diagnosis per Attribute | Red = Benignant, Blue = Malignant", y =" ", x ="Radius")g2<-ggplot(data =df)+theme_minimal()+geom_density(mapping =aes(Feature_2, fill =Diagnosis), col="darkgrey", show.legend =FALSE, alpha=0.5)+labs(title ="", y =" ", x ="Texture")g3<-ggplot(data =df)+theme_minimal()+geom_density(mapping =aes(Feature_4, fill =Diagnosis), col="darkgrey", show.legend =FALSE, alpha=0.5)+labs(title ="", y =" ", x ="Area")g4<-ggplot(data =df)+theme_minimal()+geom_density(mapping =aes(Feature_7, fill =Diagnosis), col="darkgrey", show.legend =FALSE, alpha=0.5)+labs(title ="", y =" ", x ="Concavity")grid.arrange(arrangeGrob(g1, g2, g3, g4), nrow=1)
Key insight, these density plots are useful alternatives illustrating continuous data point. The selected variables are just examples of what can swiftly be illustrated to identify potential relationships between the feature and dependent variable.
Pre-Visualization of Data
Here I use a conditional inference tree to estimate relationships across the data and how its recursively partitioned by the algorithm criteria. The main intent here is to have an idea on what to expect regarding the classification of this data set.
# mutate diagnosis character to numericdf<-df|>mutate(Diagnosis =ifelse(Diagnosis=="M", 1, 0))# CTREE model and plotmodel<-ctree(Diagnosis~., data =df)plot(model, type="extended", ep_args =list(justmin=8), main="Breast Cancer | Preliminary Analysis", drop_terminal=FALSE, tnex=1.5, gp =gpar(fontsize =12, col="darkblue"), inner_panel =node_inner(model, fill=c("white","green"), pval=TRUE), terminal_panel=node_barplot(model, fill=rev(c("darkred","lightgrey")), beside=TRUE, ymax=1.0, just =c(0.95,0.5), ylines=TRUE, widths =1.0, gap=0.05, reverse=FALSE, id=TRUE))
Scatterplot of Matrix (SPLOM)
This scatterplox matrix portraits immediate correlation between the included variables and possible multicollinearity among these. The selected features were limited to the first plane only (items 1 to 10). The other two planes include similar features. My immediate take was to consider a method for dimensionality reduction even when considering a neural network classification model.
As defined, the purpose of dimensionality reduction is to find a method that can represents a given data set using a smaller number of features but still containing the original data’s properties. I know there are different methods to accomplish this, case in point, LDA, PCa, t-SNE, K-NN, UMAP, etc., but I ultimately decided to use PCA for feature extraction. The following steps illustrate how the method identifies eigenvector of largest eigenvalues of across the covariance matrix. As a result, I create a sub-set of the data using these variables with maximum influence on variance.
# pca of original datares.pca<-PCA(df, scale.unit =TRUE, graph =FALSE, ncp =4)eig.val<-get_eigenvalue(res.pca)var<-get_pca_var(res.pca)# Color by cos2 values: quality on the factor mapfviz_pca_var(res.pca, col.var ="contrib", gradient.cols =c("#00AFBB", "#E7B800", "#FC4E07"), repel =TRUE)
Sub-selected Features
Here’s my source code to create a subset based on the PCA, followed by conditioning the target variable as a factor for down-sampling purposes. The down-sampling approach was ensure an equally number of target outcomes and avoiding any model disposition towards one way or the other. Completed the down-sampling, I scale and centered the data to maximize model performance.
# selection of principal components pdf<-df|>select(Diagnosis, Feature_1, Feature_2, Feature_3, Feature_4, Feature_6, Feature_7, Feature_8,Feature_10, Feature_11, Feature_13, Feature_14, Feature_16, Feature_18, Feature_20,Feature_21, Feature_23, Feature_24, Feature_26, Feature_27, Feature_28,Feature_30)# mutating as a factor for downsampling pdf<-pdf|>dplyr::mutate(Diagnosis =as.factor(Diagnosis))# class definitiontarget<-"Diagnosis"# downsamplingset.seed(12345)downsampled_df<-downSample(x =pdf[, colnames(pdf)!=target], y =pdf[[target]])downsampled_df<-cbind(downsampled_df, downsampled_df$Class)colnames(downsampled_df)[ncol(downsampled_df)]<-target# subset, mutate, and scaledf2<-pdfdf2<-df2|>mutate(Diagnosis =as.character(Diagnosis), Diagnosis =as.numeric(Diagnosis))df2[, -1]<-scale(df2[, -1])
Final summary statistics across the data set.
Data summary
Name
df2
Number of rows
569
Number of columns
22
_______________________
Column type frequency:
numeric
22
________________________
Group variables
None
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Diagnosis
0
1
0.37
0.48
0.00
0.00
0.00
1.00
1.00
▇▁▁▁▅
Feature_1
0
1
0.00
1.00
-2.03
-0.69
-0.21
0.47
3.97
▂▇▃▁▁
Feature_2
0
1
0.00
1.00
-2.23
-0.73
-0.10
0.58
4.65
▃▇▃▁▁
Feature_3
0
1
0.00
1.00
-1.98
-0.69
-0.24
0.50
3.97
▃▇▃▁▁
Feature_4
0
1
0.00
1.00
-1.45
-0.67
-0.29
0.36
5.25
▇▃▂▁▁
Feature_6
0
1
0.00
1.00
-1.61
-0.75
-0.22
0.49
4.56
▇▇▂▁▁
Feature_7
0
1
0.00
1.00
-1.11
-0.74
-0.34
0.53
4.24
▇▃▂▁▁
Feature_8
0
1
0.00
1.00
-1.26
-0.74
-0.40
0.65
3.92
▇▃▂▁▁
Feature_10
0
1
0.00
1.00
-1.82
-0.72
-0.18
0.47
4.91
▆▇▂▁▁
Feature_11
0
1
0.00
1.00
-1.06
-0.62
-0.29
0.27
8.90
▇▁▁▁▁
Feature_13
0
1
0.00
1.00
-1.04
-0.62
-0.29
0.24
9.45
▇▁▁▁▁
Feature_14
0
1
0.00
1.00
-0.74
-0.49
-0.35
0.11
11.03
▇▁▁▁▁
Feature_16
0
1
0.00
1.00
-1.30
-0.69
-0.28
0.39
6.14
▇▃▁▁▁
Feature_18
0
1
0.00
1.00
-1.91
-0.67
-0.14
0.47
6.64
▇▇▁▁▁
Feature_20
0
1
0.00
1.00
-1.10
-0.58
-0.23
0.29
9.84
▇▁▁▁▁
Feature_21
0
1
0.00
1.00
-1.73
-0.67
-0.27
0.52
4.09
▆▇▃▁▁
Feature_23
0
1
0.00
1.00
-1.69
-0.69
-0.29
0.54
4.28
▇▇▃▁▁
Feature_24
0
1
0.00
1.00
-1.22
-0.64
-0.34
0.36
5.92
▇▂▁▁▁
Feature_26
0
1
0.00
1.00
-1.44
-0.68
-0.27
0.54
5.11
▇▅▁▁▁
Feature_27
0
1
0.00
1.00
-1.30
-0.76
-0.22
0.53
4.70
▇▅▂▁▁
Feature_28
0
1
0.00
1.00
-1.74
-0.76
-0.22
0.71
2.68
▅▇▅▃▁
Feature_30
0
1
0.00
1.00
-1.60
-0.69
-0.22
0.45
6.84
▇▃▁▁▁
Visualization of variables as defined by the PCA across the first and second dimensions.
# pca of subset datares.pca<-PCA(df2, scale.unit =FALSE, graph =FALSE, ncp =4)eig.val<-get_eigenvalue(res.pca)var<-get_pca_var(res.pca)# Color by cos2 values: quality on the factor mapfviz_pca_var(res.pca, col.var ="contrib", gradient.cols =c("#00AFBB", "#E7B800", "#FC4E07"), repel =TRUE)
Training of the neural network using the R library of neuralnet. This poweful algorithm “is based on the resilient backpropagation without weight backtracking and additionally modifies one learning rate, either the learningrate associated with the smallest absolute gradient (sag) or the smallest learningrate (slr) itself. The learning rates in the grprop algorithm are limited to the boundaries defined in learningrate.limit.”(Fritsch, Guenther, and Wright 2019)
Narasimha, Aparna, B Vasavi, and ML Harendra Kumar. 2013. “Significance of Nuclear Morphometry in Benign and Malignant Breast Aspirates.”International Journal of Applied and Basic Medical Research 3 (1): 22. https://doi.org/10.4103/2229-516x.112237.
William Wolberg, Olvi Mangasarian. 1993. “Breast Cancer Wisconsin (Diagnostic).” UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B.