Assessment of different classification algorithms and their respective results
Author
Oscar Cardec
Published
November 12, 2021
Introduction
On any given day, thousands of flights are maneuvering throughout the U.S. national airspace. All of these are constantly monitored by one entity, the Federal Aviation Administration (FAA). The FAA’s primary mission involves ensuring that flight operations are conducted efficiently, and to the highest levels of safety and security. In support of such endeavor, the continuous monitoring and accurate prediction of an aircraft position is a vital process across aeronautics and the FAA’s mission. Accurate forecasting of a flight could have a significant impact on businesses’ schedules, transportation logistics, or even protecting the environment. In today’s era of big data and technology advances monitoring of en-route flights its an imperative.
Disclaimer: The views and opinions expressed in this report are those of the author and do not necessarily reflect the views or positions of any of the entities herein referred.
The following assessment builds out of a previously conducted analysis (Paglione et al. 2010) which documented a comprehensive evaluation of numerous aircraft’s lateral deviations. For context, lateral deviations enclose divergent measurements of an aircraft’s actual position in comparison to its authorized flight route. Here I assess and identify alternate options to sustain aerial operations management using modern machine learning algorithms to expose aircraft lateral anomaly detection. It employs innovative statistical analyses, compare results with the previous findings, and introduces a more sophisticated approach to improve the tracking of civil and military aviation on near real-time basis.
Data
To accomplish the aforementioned, historical data of numerous flights is utilized. This data involves different continuous and categorical observations including the aircraft’s altitude, measurement times, calculated distances from targeted route, lateral and vertical statuses and suggested corrective heading among other.
The original data encompasses 20 control centers within the Continental United States averaging around 500,000 observations per center. That aggregates to over 10,000,000 measurements nation-wide in less than a 24-hour window. Analysis of such figures result costly when it comes to computational power and time. For such reason, I take a sampled-data approach, assuming that the data is representative of the entire population and statistically speaking inferences may be applied to the entire population. The following diagram provides a basic depiction of involved variables.
Exploratory Data Analysis
As mentioned, the sampled data contains approximately 1.9 million flight observations from 4 specific Air Route Traffic Control Centers (ARTCC), namely, Chicago (ZAU), New York (ZNY), Miami (ZMA), and Los Angeles (ZLA). These observations contain attributes of an aircraft while cruising from one fix point or ARTCC to another, and the recorded data in increments of 10 seconds.
Note: During exploratory steps the data is ingested and analyzed from a descriptive statistics standpoint. The 14 variables (different format types) are confirmed along with the total of 1.9 million observations. Also, notice how the “buildTime” variable is given as cumulative seconds, and the “latAdherStatus” as a character type. Notice, the “latAdherStatus” (lateral adherence status) variable which describes how distant the aircraft is from its authorized route (threshold recorded in nautical miles). This multivariate attribute is used as the target or dependable variable.
Note: After gaining a basic understanding of the data set structure, I selected a particular set of variables, renamed them for easier understanding, factorized the labels from categorical values to a numerical for best calculation purposes, and defined a transformation function to convert time values from numeric to HMS format.
# data selectionmdata<-rawdf%>%as_tibble()%>%dplyr::select(acid,cid,buildTime,stCenterTm,endCenterTm,redLat,angle2NextFix,latAdherStatus, artcc)# factorization of lateral adherance statuslevels<-c("innerInConf", "midInConf", "midNonConf", "outerNonConf", "endOfRoute")labels<-c("1", "2", "3", "4", "5")mdata$latAdherStatus<-factor(mdata$latAdherStatus, levels =levels, labels =labels)# variables renamingmaindf<-dplyr::rename(.data =mdata, AircraftID =acid, ComputerID =cid, MeasureTaken =buildTime, ControlStartTm =stCenterTm, ControlEndTm =endCenterTm, LateralDeviation =redLat, CorrectionAngle =angle2NextFix, LateralStatus =latAdherStatus, ARTCC =artcc)# time conversion pacman::p_load(lubridate, hms)timeconvert<-function(x){# Transform time period from string to hms format ## mt<-seconds_to_period(x)mtstring<-sprintf('%02d:%02d:%02d', mt@hour, minute(mt), second(mt))hms::as_hms(mtstring)}# trans with dplyr::mutatedf<-maindf%>%dplyr::mutate(ARTCC =as.factor(ARTCC), UniqueID =paste(AircraftID, ComputerID, sep ="_"), Airline =substring(AircraftID, 1, 3), Airline =as.factor(Airline), MeasureTaken =as.numeric(MeasureTaken), CorrectionAngle =as.character(CorrectionAngle), LateralDeviation =as.double(LateralDeviation), xMeasureTaken =timeconvert(MeasureTaken), xControlStartTm =timeconvert(ControlStartTm), xControlEndTm =timeconvert(ControlEndTm))df$CorrectionAngle<-as.double(df$CorrectionAngle)# exclusion of NAs introduced by coercion df$CorrectionAngle[is.na(df$CorrectionAngle)]<-max(df$CorrectionAngle, na.rm =TRUE)+1head(df, 10)
Note: Next, I portray the distributions of some variables. Considering the following graphics, I anticipate significant variance across the target variable. The original data appears to be consistent along the mean, of course with the exception of some outliers.
# Lateral Adherence Status Labelspar(mfrow =c(1, 1))plot(df$LateralDeviation, type ="l", ylab ="Nautical Miles", col ="darkblue", main ="Lateral Deviations")
Note: The correction angle attribute refers to the degrees required to regain the proper heading in order to reach the next fix point associated with the flight plan. The following is a granular view of the correction angle data observations.
par(mfrow=c(2,1))# Beforehist((sel1$CorrectionAngle), main="Correction Angle Histogram", xlab="Degrees", border="white", col ="darkred", labels =FALSE)# After Loghist(log2(sel1$CorrectionAngle), main="Correction Angle Histogram\n(Log2)", xlab="Degrees", border="white", col ="darkblue", labels =FALSE)
Note: Next I display each ARTCC and their respective aircraft behaviors in relation to their assigned route. Bottom line, the data distribution maintains uniformity when I separate the levels of deviations by their centers.
Note: One key insight, the above graph shows ZMA as the control center with the highest variability or dispersion. The trait makes sense considering that ZMA (Miami) sits in a significant different location more susceptible to different weather elements and offshore inbound and transient air traffic.
ssel3<-sel1# function to calculate boundscalc_iqr_bounds<-function(data){Q<-quantile(data, probs =c(0.25, 0.75), na.rm =TRUE)iqr<-IQR(data, na.rm =TRUE)lower<-Q[1]-1.5*iqrupper<-Q[2]+1.5*iqrreturn(list(lower =lower, upper =upper))}# calculation of boundsbounds<-calc_iqr_bounds(ssel3$LateralDeviation)NOutliers<-subset(ssel3$LateralDeviation, ssel3$LateralDeviation>bounds$lower&ssel3$LateralDeviation<bounds$upper)# distributions with and without outliers par(mfrow =c(1, 2))boxplot(ssel3$LateralDeviation, col ="darkgrey", border ="darkred", horizontal =FALSE, main ="Distribution with Outliers")boxplot(NOutliers, col ="darkgrey", border ="darkblue", main ="Distribution without Outliers")
Note: For best perception of these deviations, I went ahead and calculated additional measures applying moving average filters like weighted moving average with a backward window of two positions, the absolute values of the weighted moving average, and other backward windows to include the 4th prior period and the 7th period.
pacman::p_load(forecast, pracma)df1<-ssel3df1["WMA"]<-pracma::movavg(df1$LateralDeviation, n =2, type ="w")df1["Abs_WMA"]<-abs(df1$WMA)df1["ma4_lateraldev"]<-forecast::ma(df1$LateralDeviation, order=4)df1["ma7_lateraldev"]<-forecast::ma(df1$LateralDeviation, order=7)summary(df1)
ARTCC Airline UniqueID MeasureTaken
ZAU:549977 SWA : 190572 Length:1988068 Min. :61210
ZLA:547118 AAL : 158419 Class :character 1st Qu.:68980
ZMA:473121 UAL : 101846 Mode :character Median :74820
ZNY:417852 AWE : 97578 Mean :74696
DAL : 85748 3rd Qu.:80410
COA : 73412 Max. :86390
(Other):1280493
xMeasureTaken xControlStartTm xControlEndTm LateralDeviation
Length:1988068 Length:1988068 Length:1988068 Min. :-748.9390
Class1:hms Class1:hms Class1:hms 1st Qu.: -0.8658
Class2:difftime Class2:difftime Class2:difftime Median : 0.0043
Mode :numeric Mode :numeric Mode :numeric Mean : 0.3468
3rd Qu.: 0.8736
Max. : 596.9131
CorrectionAngle LateralStatus WMA Abs_WMA
Min. : 0.0000 1:802138 Min. :-748.7875 Min. : 0.0000
1st Qu.: 0.8003 2:228148 1st Qu.: -0.8656 1st Qu.: 0.2012
Median : 2.5434 3:161562 Median : 0.0045 Median : 0.8702
Mean : 24.9097 4:711040 Mean : 0.3468 Mean : 7.4450
3rd Qu.: 17.2016 5: 85180 3rd Qu.: 0.8747 3rd Qu.: 3.5313
Max. :180.9997 Max. : 596.6656 Max. :748.7875
ma4_lateraldev ma7_lateraldev
Min. :-747.9141 Min. :-747.3517
1st Qu.: -0.8684 1st Qu.: -0.8775
Median : 0.0046 Median : 0.0051
Mean : 0.3467 Mean : 0.3467
3rd Qu.: 0.8808 3rd Qu.: 0.8919
Max. : 595.2119 Max. : 594.3059
NA's :4 NA's :6
Note: In conclusion of the EDA, I went from having limited understanding of the variables, to a robust and feature-engineered data frame while maintaining the original characteristics of the data. At this point, we want to split the tidy data frame, train and test the classification model.