Item Details

Print View

Applying Genetic Algorithms to the Problem of Variable Selection in Large Datasets With Interaction Terms

Gan, Chee Chun
Thesis/Dissertation; Online
Gan, Chee Chun
Learmonth, Gerard
Variable selection is a key step in the development of predictive models. When the size of the dataset is relatively small, greedy algorithms such as stepwise selection perform well in the selection of informative variables. However, as the size of the dataset increases, the challenges faced by such variable selection methods increases rapidly. The addition of interaction terms drastically increases the complexity of the variable selection problem, rendering greedy stepwise selection ineffective. Past research on the topic has seldom included the effect of interaction terms on predictive modeling. Part of the reason may be the aforementioned difficulty involved in the variable selection process when considering a large dataset. Another possibility is the tradeoff between model accuracy and complexity, where the benefits from including interaction terms may be marginal. However, in certain applications such as medical diagnosis models, any marginal increase in predictive ability may lead to significant improvements in terms of lives saved. In addition, information obtained during the variable selection process such as which interaction terms are significant may serve as a guide for future research efforts to explore why such interaction terms exist among certain primary predictors. A genetic algorithm (GA) is developed in this study to handle the expanded search space of primary and interaction terms for variable selection. While GAs have been used for variable selection in the past, the chromosome formulation and selection process must be modified to accommodate interaction terms in large datasets. The GA framework is highly flexible and is able to handle a large variety of different models simply by choosing the appropriate fitness function. Experimental runs show that there is benefit to including interaction terms in large datasets in addition to main effects.
University of Virginia, Department of Systems Engineering, PHD (Doctor of Philosophy), 2016
Published Date
PHD (Doctor of Philosophy)
Libra ETD Repository
In CopyrightIn Copyright
▾See more
▴See less


Read Online