Supervised learning classifications using the R package partykit conditional inference trees
Author
Oscar Cardec
Published
October 22, 2020
Introduction
Cardiotocograms, also known as CTGs, have been instrumental within clinical medicine for a long time. Obstetricians use these measurements and classifications to obtain detailed information and intelligence about newborns and their mother prior and during labor. In 2018, an article presented through the Journal of Clinical Medicine detailed the practicality of CTG. The same article noted that interpretations of these censorial readings is mainly attributed to the observer; which creates challenges of consistency of interpretations and defies the human naked- eye. Questions like what happens if/when the interpreter misses a key detail, or what could be the meaning of a combination of diagnostic signals, furthermore, what time-sensitive conditions may these measurements expose, requiring immediate actions? These are few examples of concerns posed by the continuous practice of merely optical assessments of a CTG. (Zhao, Zhang, and Deng 2018)
The following exploration presents an assessment of CTGs using the conditional inference tree (ctree) model. The same shows how the algorithm expedites and enhances the interpretation of CTG readings while appraising multiple fetal readings simultaneously. Moreover, the study aims to identify potential hidden patters which may require further attention.
Data
The analyzed data comes for the UCI Machine Learning Repository(D. Campos 2000), and it consists of measurements of fetal heart rate (FHR) and other important characteristics as identified and recorded within each cardiotocograms. Ultimately, all CTGs were classified by three subject matter experts, and under unanimity, assigned with response-labels based on the fetal state and/or morphological detected patterns. The following is a list of the variables meaning according to the UCI repository:
LB - FHR baseline (beats per minute)
AC - # of accelerations per second
FM - # of fetal movements per second
UC - # of uterine contractions per second
DL - # of light decelerations per second
DS - # of severe decelerations per second
DP - # of prolonged decelerations per second
ASTV - percentage of time with abnormal short term variability
MSTV - mean value of short term variability
ALTV - percentage of time with abnormal long term variability
MLTV - mean value of long term variability Width - width of FHR histogram
Min - minimum of FHR histogram
Max - Maximum of FHR histogram
Nmax - # of histogram peaks
Nzeros - # of histogram zeros
Mode - histogram mode
Mean - histogram mean
Median - histogram median
Variance - histogram variance
Tendency - histogram tendency
CLASS - FHR pattern class code (1 to 10)
NSP - fetal state class code (N=normal; S=suspect; P=pathologic)
Exploratory Data Analysis
During exploratory data analysis the data is confirmed as a combination of 2126 observations and 23 variables. The following is a preview of the first six observations after been ingested as as_tibble.
The following code chunks portray a basic assessment of specific attributes and areas of importance such as variability of observations, presence of missing values, mean, standard deviation,
# How much variability the main predictor shows? lbx<-IQR(df$LB)summary(df$LB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
106.0 126.0 133.0 133.3 140.0 160.0
Note: LB attribute’s IQR equals 14, which is significantly small indicating a most values to be clustered around the middle. The following histogram confirms the small IQR.
hist(df$LB, breaks =12, main="Histogram of FHR Baseline", xlab="(beats per minute)", border="darkblue",col ="lightgrey", labels =F)
# Are there any missing values present?colSums(is.na(df))
LB AC FM UC DL DS DP ASTV
0 0 0 0 0 0 0 0
MSTV ALTV MLTV Width Min Max Nmax Nzeros
0 0 0 0 0 0 0 0
Mode Mean Median Variance Tendency CLASS NSP
0 0 0 0 0 0 0
One Sample t-test
data: df$LB
t = 624.59, df = 2125, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
132.8853 133.7224
sample estimates:
mean of x
133.3039
# very first graph representation with manual boundary calculationsupr2=m+(std*2)lwr2=m-(std*2)# Plot LB distribution boundaries plot.new()plot(df$LB, type="l", col="grey51", ylab="LB", main="1 & 2 Standard Deviations")abline(h =m, col ="blue")abline(h =upr, col ="orange", lty=2)abline(h =lwr, col ="orange", lty=2)abline(h =upr2, col ="red", lty=2)abline(h =lwr2, col ="red", lty=2)text(-65,134, "mean:133.30", col ="blue", adj =c(0, -.1))text(-65,upr, round(upr, 2), col ="black", adj =c(0, -.1))text(-65,lwr, round(lwr, 2), col ="black", adj =c(0, -.1))text(-65,upr2, round(upr2, 2), col ="black", adj =c(0, -.1))text(-65,lwr2, round(lwr2, 2), col ="black", adj =c(0, -.1))
# LB Observations higher than 2-s.d.lba<-(sum(df$LB>152.99))#39# LB Observations lower than 2-s.d.lbb<-(sum(df$LB<113.62))#44lba+lbb#=83 obs outside of 2-s.d.
[1] 83
sum(between(df$LB, 113.62, 152.99))/nrow(df)# of obs within 2-s.d.
[1] 0.9609595
# Exclude non-original measurements, rename targeted valuesdf[12:22]<-NULLdf$NSP<-as.numeric(df$NSP)# enumeration of labels with the factor functiondf$NSP<-factor(df$NSP, levels=1:3, labels =c("Normal","Suspect", "Pathologic"))
# Visualization of original NSPplot(df$NSP, main="Original NSP Distribution", xlab="Fetal State Classification", ylab="Frequency", col=c(3, 7, 2))text(df$NSP, labels=as.character(tabulate(df$NSP)), adj=3, pos=3)
# additional way to preview distribution of attributes# distributions previewdf[,1:12]%>%gather()%>%ggplot(aes(value))+theme_light()+labs( title="FHR Measurement Distributions")+theme(axis.text.x =element_text(angle=90))+facet_wrap(~key, scales ="free", shrink =TRUE)+geom_bar(mapping =aes(value), color="darkblue", fill="lightgrey")
In progress …
# Summary of DF after encoding the label vector as numbers. summary(df)
LB AC FM UC
Min. :106.0 Min. :0.000000 Min. :0.000000 Min. :0.000000
1st Qu.:126.0 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.002000
Median :133.0 Median :0.002000 Median :0.000000 Median :0.004000
Mean :133.3 Mean :0.003178 Mean :0.009481 Mean :0.004366
3rd Qu.:140.0 3rd Qu.:0.006000 3rd Qu.:0.003000 3rd Qu.:0.007000
Max. :160.0 Max. :0.019000 Max. :0.481000 Max. :0.015000
DL DS DP ASTV
Min. :0.000000 Min. :0.000e+00 Min. :0.0000000 Min. :12.00
1st Qu.:0.000000 1st Qu.:0.000e+00 1st Qu.:0.0000000 1st Qu.:32.00
Median :0.000000 Median :0.000e+00 Median :0.0000000 Median :49.00
Mean :0.001889 Mean :3.293e-06 Mean :0.0001585 Mean :46.99
3rd Qu.:0.003000 3rd Qu.:0.000e+00 3rd Qu.:0.0000000 3rd Qu.:61.00
Max. :0.015000 Max. :1.000e-03 Max. :0.0050000 Max. :87.00
MSTV ALTV MLTV NSP
Min. :0.200 Min. : 0.000 Min. : 0.000 Normal :1655
1st Qu.:0.700 1st Qu.: 0.000 1st Qu.: 4.600 Suspect : 295
Median :1.200 Median : 0.000 Median : 7.400 Pathologic: 176
Mean :1.333 Mean : 9.847 Mean : 8.188
3rd Qu.:1.700 3rd Qu.:11.000 3rd Qu.:10.800
Max. :7.000 Max. :91.000 Max. :50.700
# split the data into a training and test setsset.seed(1234)ind<-sample(2, nrow(df), replace =T, prob =c(0.70, 0.30))train.data<-df[ind==1, ]test.data<-df[ind==2, ]#run the method on a training datamyFormula<-NSP~.model<-ctree(myFormula, data =train.data)
Zhao, Zhidong, Yang Zhang, and Yanjun Deng. 2018. “A Comprehensive Feature Analysis of the Fetal Heart Rate Signal for the Intelligent Assessment of Fetal State.”Journal of Clinical Medicine 7 (8): 223. https://doi.org/10.3390/jcm7080223.