Wednesday, April 3, 2013

Journey Across Learning Methods and Model Development


This is going to be a slightly more technical post trying to explore
tools that come somewhat standard with R. To be precise, the purpose
of the post is to explore performance characteristics with different
inference approaches and also walk through the journey of building,
validating and refining the models.

To set the stage, I had been looking to find an appropriate example
dataset to run this kind of comparative analysis. Thanks to a recent
interaction, I was made aware of the UCI machine learning datasets and
in particular was pointed to the census income dataset.

This website is a great compendium of datasets that are appropriately
broken down into training and test datasets along with the meta data
describing the data elements as well as the context of the dataset
publication and some commentary around the current set of results.

The dataset consists of various demographic, financial, and
educational variables (a combination of both categorical and
continuous variables) and an outcome variable that tells if the
subject has an annual income above $50,000. I have explained the
variables below:

age: integer
workclass: categorical
fnlwgt: some indexing factor for normalization
education: categorical
education.num: number of years of education, integer
marital.status: categorical
occupation: categorical
relationship: categorical
race: categorical
sex: categorical
capital.gain: continuous ($-gains)
capital.loss: continuous ($-loss)
hours.per.week: integer
native.country: categorical, only about 30 or so different countries are tabulated
income.flag: whether income is above or below $50,000 -- outcome
variable

What is the Analysis Methodology?


As a first pass, I figured that using naive estimation approaches with
various classes of learning models: logistic regression, random
forests, support vector machines, and gradient boosting methods.

So, the naive model is simply trying to fit the outcome variable
against the rest of the variables in the dataset. The so called model
formula looks as below:

outcome ~ age + workclass + fnlwgt + education + education.num + marital.status + occupation + relationship + race + sex + capital.gain + capital.loss + hours.per.week + native.country

This will form the basis of most investigations below. I had to make a
few tweaks in order for this to work (e.g. remove native.country since
that results into too many categorical levels and some packages can't
solve for that).

There are a number of interesting metrics to look at, but for this
purpose I am focusing on two important metrics -- AUC and the
contingency table of the outcomes. AUC is important as it allows us to
find out how good the model is. When the AUC is above 0.8, it is
empirically considered to be a good model. Contingency table allows us
to quickly look at the trade-offs between false positives (in this
case, classifying someone as having income greater than $50K, but is
actually below that earning threshold) and false
negatives (model forecasts to be below $50K, but in actuality the
income is higher than $50K).


Another interesting thing to consider when building models is to see
how the accuracy measures degrade between the training and testing
datasets.

First of all, let us look at the AUC measures tabulated below. (Note:
some values are missing due to data issues -- I will resolve and
repost).

Before we go any further, let me explain how I have chosen the cutoffs
for the logistic regression model. The naive approach is to classify
based upon probability being higher or lower than 0.5. Another
possibility is to look at the distribution of positive vs negative
cases in the training dataset and use that to choose a threshold in
the test dataset. In this case, roughly 24% of the observations have
an income level above $50K in training dataset. So, I picked the
100-24 ~ 75th percentile as the cutoff threshold. This resulted into
using 0.39 as the discriminant on the probability scale.

Now, we are ready to dive into the details. First of all, we see that
random forest seems to perform surprisingly well on the training
dataset (and that is not a surprise - it is known for overfitting) and
that issue becomes obvious as we go to the test dataset. It suffers
the highest decline in quality going from roughly 0.86 to 0.799.

The surprising finding is that the naive gradient boosting method is
not performing very well.

Logistic regression is very stable going from training to test dataset
(as expected), but stays (just) below the random forest metrics.

AUC       train         test
rf            0.8627366   0.7994727
svm        0.7603794
log_0.5   0.7687262   0.7649428
log_0.39 0.8001345   0.7958619
gbm        0.5754289   0.5727721


Next, let us take a look at the mis-classification errors. What we are
interested in here is to see when the forecast is different from the
actual income flag. We are going to use a contingency (or truth) table
for that.

Each row represents the actual outcomes. For example there a total of
(24707 + 13) 24720 cases where the income is below $50K in the
training dataset. 0 indicates the case of less than $50K income
(either for forecast or the actuals). The cell of interest are really
the (0,1) and (1,0) which are respectively false positives and false
negatives.

We are going to define the error as the total misclassifications
divided by the total number of observations (~32K in training and ~16K
in test).

Now, let's look at how these methods performed. We will beging with
the random forests (which performed quite well). Eventhough the error
rate has inched up from 10.5% to 14.9%, we find that it does a pretty
good job at classification as seen by roughly equal number of
misclassifications. On a separate note, this may or may not be the
desired outcome based upon what we intend to do with the result. For
example, if we are going to use this to give the individual a $1MM
loan, then we might bias more towards having false negatives -- we
want to minimize the number of people whose income is below $50K for
such a large loan amount.

RF (train, prob 0.39)
         test                    train
             0     1             0     1
       0 22855  1865       0 11137  1298
       1  1561  6280       1  1141  2705

error: 0.1052179          0.1498065

Next, let's look at the logistic regressions. We see interesting
dynamic here -- the 0.5 threshold gives a better overall error
rate. However, it tends to skew towards allowing for a much higher
number of false positives. Again, depending upon the usage, either one
of these two alternatives may serve well.
Logistic  Regression (0.5 threshold)
         test                       train
             0     1                       0     1
       0 23037  1683         0 11578   857
       1  3093  4748          1  1543  2303

 error: 0.1466785            0.1474111

Logistic Regression (0.39 threshold)
         test                       train
             0     1                     0     1
       0 22074  2646        0 11086  1349
       1  2295   5546        1   1153  2693

error: 0.151746             0.1536761


Finally, this takes us to gradient boosting method. We have seen poor
out-of-the box performance on this and so the results are not very
surprising. The error rates are much higher and the degradation going
from test to training is also significant.

More work is needed to tune gbm methods before we can make a
definitive call!!

gbm
         test                          train
             0     1                   0    1
       0 24707    13        0 8425 4010
       1  6654  1187        1  570 3276

error: 0.2047542            0.2813095


It should be noted that this is getting us close to the ballpark of some of the research work. As per the web pages, the error rates achieved by some of the approaches are in 14.05% to 20% range, with a good cluster around 14%-15% range. In comparison, our methods are close to 15%.

This was a quick post and I didn't have time to go into a number of
other details. I plan on following up this post with a few more
iterations showcasing incremental improvement in each of the
approaches and comparing additional metrics to understand the relative
merits of these approaches.



We will be looking a number of new things: more visualizations, experimentation with regularization methods and some exploration of interaction and higher order training terms.

Stay tuned!