Wednesday, April 3, 2013

Journey Across Learning Methods and Model Development


This is going to be a slightly more technical post trying to explore
tools that come somewhat standard with R. To be precise, the purpose
of the post is to explore performance characteristics with different
inference approaches and also walk through the journey of building,
validating and refining the models.

To set the stage, I had been looking to find an appropriate example
dataset to run this kind of comparative analysis. Thanks to a recent
interaction, I was made aware of the UCI machine learning datasets and
in particular was pointed to the census income dataset.

This website is a great compendium of datasets that are appropriately
broken down into training and test datasets along with the meta data
describing the data elements as well as the context of the dataset
publication and some commentary around the current set of results.

The dataset consists of various demographic, financial, and
educational variables (a combination of both categorical and
continuous variables) and an outcome variable that tells if the
subject has an annual income above $50,000. I have explained the
variables below:

age: integer
workclass: categorical
fnlwgt: some indexing factor for normalization
education: categorical
education.num: number of years of education, integer
marital.status: categorical
occupation: categorical
relationship: categorical
race: categorical
sex: categorical
capital.gain: continuous ($-gains)
capital.loss: continuous ($-loss)
hours.per.week: integer
native.country: categorical, only about 30 or so different countries are tabulated
income.flag: whether income is above or below $50,000 -- outcome
variable

What is the Analysis Methodology?


As a first pass, I figured that using naive estimation approaches with
various classes of learning models: logistic regression, random
forests, support vector machines, and gradient boosting methods.

So, the naive model is simply trying to fit the outcome variable
against the rest of the variables in the dataset. The so called model
formula looks as below:

outcome ~ age + workclass + fnlwgt + education + education.num + marital.status + occupation + relationship + race + sex + capital.gain + capital.loss + hours.per.week + native.country

This will form the basis of most investigations below. I had to make a
few tweaks in order for this to work (e.g. remove native.country since
that results into too many categorical levels and some packages can't
solve for that).

There are a number of interesting metrics to look at, but for this
purpose I am focusing on two important metrics -- AUC and the
contingency table of the outcomes. AUC is important as it allows us to
find out how good the model is. When the AUC is above 0.8, it is
empirically considered to be a good model. Contingency table allows us
to quickly look at the trade-offs between false positives (in this
case, classifying someone as having income greater than $50K, but is
actually below that earning threshold) and false
negatives (model forecasts to be below $50K, but in actuality the
income is higher than $50K).


Another interesting thing to consider when building models is to see
how the accuracy measures degrade between the training and testing
datasets.

First of all, let us look at the AUC measures tabulated below. (Note:
some values are missing due to data issues -- I will resolve and
repost).

Before we go any further, let me explain how I have chosen the cutoffs
for the logistic regression model. The naive approach is to classify
based upon probability being higher or lower than 0.5. Another
possibility is to look at the distribution of positive vs negative
cases in the training dataset and use that to choose a threshold in
the test dataset. In this case, roughly 24% of the observations have
an income level above $50K in training dataset. So, I picked the
100-24 ~ 75th percentile as the cutoff threshold. This resulted into
using 0.39 as the discriminant on the probability scale.

Now, we are ready to dive into the details. First of all, we see that
random forest seems to perform surprisingly well on the training
dataset (and that is not a surprise - it is known for overfitting) and
that issue becomes obvious as we go to the test dataset. It suffers
the highest decline in quality going from roughly 0.86 to 0.799.

The surprising finding is that the naive gradient boosting method is
not performing very well.

Logistic regression is very stable going from training to test dataset
(as expected), but stays (just) below the random forest metrics.

AUC       train         test
rf            0.8627366   0.7994727
svm        0.7603794
log_0.5   0.7687262   0.7649428
log_0.39 0.8001345   0.7958619
gbm        0.5754289   0.5727721


Next, let us take a look at the mis-classification errors. What we are
interested in here is to see when the forecast is different from the
actual income flag. We are going to use a contingency (or truth) table
for that.

Each row represents the actual outcomes. For example there a total of
(24707 + 13) 24720 cases where the income is below $50K in the
training dataset. 0 indicates the case of less than $50K income
(either for forecast or the actuals). The cell of interest are really
the (0,1) and (1,0) which are respectively false positives and false
negatives.

We are going to define the error as the total misclassifications
divided by the total number of observations (~32K in training and ~16K
in test).

Now, let's look at how these methods performed. We will beging with
the random forests (which performed quite well). Eventhough the error
rate has inched up from 10.5% to 14.9%, we find that it does a pretty
good job at classification as seen by roughly equal number of
misclassifications. On a separate note, this may or may not be the
desired outcome based upon what we intend to do with the result. For
example, if we are going to use this to give the individual a $1MM
loan, then we might bias more towards having false negatives -- we
want to minimize the number of people whose income is below $50K for
such a large loan amount.

RF (train, prob 0.39)
         test                    train
             0     1             0     1
       0 22855  1865       0 11137  1298
       1  1561  6280       1  1141  2705

error: 0.1052179          0.1498065

Next, let's look at the logistic regressions. We see interesting
dynamic here -- the 0.5 threshold gives a better overall error
rate. However, it tends to skew towards allowing for a much higher
number of false positives. Again, depending upon the usage, either one
of these two alternatives may serve well.
Logistic  Regression (0.5 threshold)
         test                       train
             0     1                       0     1
       0 23037  1683         0 11578   857
       1  3093  4748          1  1543  2303

 error: 0.1466785            0.1474111

Logistic Regression (0.39 threshold)
         test                       train
             0     1                     0     1
       0 22074  2646        0 11086  1349
       1  2295   5546        1   1153  2693

error: 0.151746             0.1536761


Finally, this takes us to gradient boosting method. We have seen poor
out-of-the box performance on this and so the results are not very
surprising. The error rates are much higher and the degradation going
from test to training is also significant.

More work is needed to tune gbm methods before we can make a
definitive call!!

gbm
         test                          train
             0     1                   0    1
       0 24707    13        0 8425 4010
       1  6654  1187        1  570 3276

error: 0.2047542            0.2813095


It should be noted that this is getting us close to the ballpark of some of the research work. As per the web pages, the error rates achieved by some of the approaches are in 14.05% to 20% range, with a good cluster around 14%-15% range. In comparison, our methods are close to 15%.

This was a quick post and I didn't have time to go into a number of
other details. I plan on following up this post with a few more
iterations showcasing incremental improvement in each of the
approaches and comparing additional metrics to understand the relative
merits of these approaches.



We will be looking a number of new things: more visualizations, experimentation with regularization methods and some exploration of interaction and higher order training terms.

Stay tuned!

Friday, August 31, 2012

Migration of Consumer Web Technologies into Enterprise World



As we trace the history of computing, it was primarily affordable only to the large organization with significant financial wherewithal. The introduction of a personal PC ushered into an era where the end consumers could enjoy some of the computational power.

With the rise of companies such as SAP, enterprise scale software became more prevalent. There were a host of companies that would design a solution and sell that solution to other large corporations (HR Management systems, Accounting systems, CRM systems etc).

However, with the rise of opensource technologies (Linux, Apache, tomcat and numerous others), there was a renaissance in full bloom. Interestingly, this phenomenon was driven by what we now call "Web Consumer Companies" -- companies that had websites and allowed end-users to interact directly with it through a browser. No expensive licenses or on-site installs were required.

Early pioneer in this area was Google (if you think about it carefully, Google's biggest contribution is probably the huge infrastructure rather than advertisements), which pioneered to use distributed computing using commodity hardware. Out of that effort came things such as MapReduce (and its ecosystem), BigTable etc. This sparked a revolution with other web-companies (FaceBook, Twitter, LinkedIn, Yahoo, etc) that were helping each other by contributing heavily to the open source initiatives.

This was a good thing and was completely driven by necessity. In the early 2000s, weblogic and IIS ruled the enterprise software vendor world. But, that came to a rapid end with the development of Apache and Tomcat, they were relegated into incongruity.

So, what good are these technologies to existing enterprises?


Well, a new spate of companies have realized the opportunity that these same concepts can be packaged up and re-sold to the traditional enterprises (these customers still like large packaged and supported softwares -- they don't think it is critical to employ software developers to use the open source systems). These companies are now seeing quite a bit of success. Here are some examples of such companies and how well they are doing:

1. Success Factors: By definition, web consumer companies were based in the cloud (in the sense that the data resided with the company). Successfactors combined that concept while automating the tedious HR review process. It was acquired by SAP in Dec 2011 for $3.4B (http://seekingalpha.com/article/312020-sap-s-successfactors-acquisition-a-1-2-billion-wealth-transfer) -- supposedly to enable SAP to adopt the same cloud-resident model.

2. Yammer: This was kind of a social network for the enterprises. Microsoft gobbled them up recently (June 2012) for $1.2B.

3. WorkDay: This is still a company in the works, but the founders essentially re-invented the HR management wheel in the cloud. Their revenues are growing fast (2011: $150MM, 2012 $300MM projection: (http://blogs.wsj.com/digits/2012/08/30/workday-discloses-finances-plans-for-founders-control/). They will likely replace things such as SAP/Oracle HR suites.

4. LinkedIn: I mention this since it is one of the few companies that is dual facing -- the consumer facing part that allows users to create the profiles and the enterprise facing part that allows companies to improve their hiring efficiencies. Additionally, they have figured out how to make the financial model work on both ends (a $20/month individual subscription vs likely a 6-figure enterprise license, an educated guess). They have managed to grow their revenues at a very healthy pace and have maintained their $10B valuation. It will likey replace companies such as Taleo, Dice.com, Monster etc.

5. Cloudera/MapR/HortonWorks: These companies are trying to make the core Hadoop software digestible for the large enterprises. I expect them to be valued quite highly. They are likely in the fray to severly curtail revenues for storage and database companies such as Oracle, SAP, EMC etc.

6. Stripe/Square: These are the new breed of startups that have embraced mobile devices natively and will likely be successful in displacing significant business from Visa, MasterCard etc.

So, where does this lead in the future?


I expect a large number of new companies to flourish. They will need to know how to use already successful technologies and marry that with an interesting enterprise problem to create substantial value.

If you know of any more emerging companies with this theme, I would love to hear about that!

Monday, December 13, 2010

What is Retail Analytics?

I loosely refer to analytics used by retailers or the retail goods
manufacturers to run their operations efficiently. This definition is
inclusive of all departments within a retail organization as well as
across all of the retailers and manufacturers.


For example, the finance group will be more interested in forecasting
the overall revenues and margins for the business based upon past
history. Merchandising division would be more interested in similar
numbers broken out by each of the merchandising categories that they
manage. The catch here is that these analyses may not be consistent
with one another. For example, one would expect that summing up the
merchandisers' forecast would equal that to the finance
dept. forecasts -- but they don't!

Each of these analytics is served by some combination of internal
teams, boutique consulting and other established service providers. We
will look at the important players in another blog post.



How can retail analytics be classified?

We can loosely classify these analytics into three major buckets --
supply side analytics, demand side analytics, and consumer
analytics. Each of them is elaborated below


What are supply side analytics?

Supply side analytics refers to those analytics that affect the supply
of goods, typically related with inventory and supplier management and
other associated activities as explained below.


Inventory replenishment decisions: The main decisions for
inventory management are determining what products to reorder, in
what quantities and at what times. There is a large amount of
downstream uncertainty in terms of merchandising plans, variation
in demand induced by external factors such as weather, overall
economy, holiday periods etc. to make these decisions very challenging.


Initial order quantities (for long lead-time and seasonal
items). This refers primarily to fashion or seasonal goods where
the time from concieving a new design to delivering the final
product is fairly long (6-12 months). To complicate matters, many
of the products in this sector are obsolete at the end of the
season, so getting the initial order quantities is crucial. If
you order too much, the final margins will be squeezed due to the
liquidation pressure towards the end of the season. On the other
hand, under ordering would lead to premature sellout and loss of
additional revenues. Apart from determining the initial order
quantities, this problem could also be controlled downstream by
better design of markdown and promotion events.

Store allocation and replenishment strategies: Frequently, the
finished products are delivered to warehouses or distribution
centers. This is then followed by allocation and delivery over
time to various stores. Roughly speaking, the goal is to stock
the warehouses close to the amount that the stores served by it
are expected to sell. Again, if there's an imbalance it leads to
loss revenues or lower margins due to increased cost of
liquidation or cross selling. The optimization approach deals
with both the upfront stocking of the warehouses, subsequent
decisions about allocating and scheduling of trucks to minimize
various costs.

Other supply side decisions
There are some other decisions taken on the supply side that don't
fall into one neat category. Some of them are outlined below:

Supplier portfolio management: Each business unit (retailer or
the manufacturer) have many dozens of suppliers. These suppliers
are chosen for a number of reasons -- some provide unique
products, others are used to spread the demand, yet others are
used for geographical reasons. There are ways to optimize this
portfolio based upon various criteria such as meeting unexpected
demand spikes quickly, minimizing long term costs and
administrative expenses.

Hedging: Currency, Commodities, etc With the increasing pace of
globalization, it becomes imperative not only to manage the
currency risk, but also the commodity risk. Commodity prices
can fluctuate considerably and lead to unexpected cost increases
in the finished products, so identifying and controlling for the
most important commodities does indeed become very
important. This primarily applies to the manufacturers that
assemble the finished items from the raw materials, but it could
come into play unexpectedly. For example, Southwest airlines had
a significantly lower cost base due to long term crude oil
futures.


What are the demand side analytics?

Demand side analytics closely mirrors the end consumer thinking,
except that it is applied in aggreage across consumers and
geographical regions.

It starts with trying to understand what the consumer needs are (what
is she looking for?) and subsequent pricing and promotion decisions based upon
consumer affordability or the marketplace dynamics. Further, when the
consumer is in the store, how does he navigate the store? How does the
consumer interact with the stores (coupon usage, end-cap displays,
center-store displays, center-aisle displays etc).

Finally, understanding the consumption patterns would help retailers
design their merchandising to promote up-sell and cross-sell
opportunities.



These can then be further divided into following groups:

Assortment optimization: Assortment refers to the collection of
products/services carried by a retailer. In most cases the
fundamental decision that a merchandiser faces is how to change
the assortment to realize better gains - what products to carry? when to
carry them? what locations to carry them?. She is influenced by
various parties -- vendors who want to gain marketshare through a
particular retail channel, internal marketing group that
identifies consumer needs based upon past history, surveys, or
market intelligence. Merchandising strategy of selling private
label brands (e.g. a retailer's own version of cola) vs national
brands (e.g. Coke or Pepsi), internal targets for sales and
profits, consideration of store footprints (The amount and type
of space available vary by the store format and size -- number
and type of aisles in larger stores is more than in the smaller
stores)

Pricing optimization: The most fundamental financial decision is
to set the prices for the products being sold. This has become an
advanced art in itself with various retailers practicing various
strategies -- everyday low prices, Hi-Low prices, dollar-bins
etc. More specifically this refers to setting the everyday price
(or the base price) for fast moving consumer goods (FMCG) and
setting the initial price for the fashion goods. Setting of price
is a very complex process due to the number of SKUs in a category
and the number of stores being served. Also, the optimal price would
vary not only by the geographical region for the store-location, but also
by the assortment carried at the store and prices set by the
competing retailers on similar items. For a national chain with
1000 stores and a medium category of 500 items could have to potentially deal
with 500,000 price variables, each of which could take on up to
100 different price points -- this is where mathematical optimization
plays a lead role in quickly finding the best set of prices.
Promotion optimization: Promotions are typically short term price
reductions in conjunction with the appropriate marketing message
(email message, newspaper flier etc). Determining what subset of
products to promote, how deep to run the promotion, how long to
run a promotion and how frequently to carry out promotions within
a category make this an order of magnitude more difficult than
the pricing decisions. Further, this is a very tactical activity
that requires frequent adjustments. In many cases this also
involves protracted negotiations with the manufacturers.

Clearance/Markdown optimization:
This usually applies to product that are towards the end of their
lifecycle -- either due to normal seasonal changes (swimming
clothes at the end of summer) or due to introduction of newer
products (introducing iphone 4 would cause iphone 3GS to fall
into this category). The primary goal here is to run down the
inventory with the least amount of discounts taken from the price
point. Since there are usually other ways in which the leftover
inventory can be liquidated (shipped back to the manufacturer,
sold to other outlets that specialize in markdown inventory
etc), decision making becomes complicated.

Shelf/Planogram optimization: The theory is that if related
(or frequently co-purchased) items
are located in close vicinity, it is easier for customers to find
them and drive even additional sales (think about the batteries
placed close to many gadgets and toy aisles). Apart from this
consideration, there are other requirements such as carrying all
of the assortment, giving premium placement to some subset of
products, etc. Retailers also think carefully about the facings
-- whether the broader side of the product is visible or just the
much thinner side. An excellent example is the books placed with
the cover page facing consumers or the binding side facing the
consumers.

Display optimization
Apart from stocking the regular shelves, select set of products
can also be placed in other prominent locations. These are
usually located in the high traffic/visibility areas such as at
the end of the aisles (end-cap displays), center of the store
(center-store displays), checkout aisles (checkout displays), in
the middle of the aisles (center-aisle displays). The number of
such fixtures is limited and it also varies by store format -- so
optimally utilizing them leads to higher yields.

Check-out stand optimization (impulse buys)
Although we have mentioned the checkout displays in the previous
section, they deserve separate attention by themselves. They have
the captive audience as customers wait in line and are usually
stocked with impulse buys (candy, magazines, etc) that also
happen to have higher yields.

Feature/Advertisement optimization
Feature or advertisement refers to messaging provided to the
customers. Traditionally this was restricted to the print media
(newspapers, mail flyers, magazines), radio and
television. However, with the advent of the internet this sector
has exploded (or splintered) with the availability of email
messages, paid-search advertisement (Google), social media
advertising, banner advertisement (rich media), coupon sites
etc. Advent of smart phones is fueling a new generation of
capabilities that includes personalized delivery of features
based upon geo-location and other preferences. A newer set of
aveneues have opened up through the placement of audio/video
display devices in stores, at gas stations, in elevators. These
new media differ from the traditional media in one important way
-- advertisement can be targetted more specifically and the
responses can be measured directly in some cases.

What are consumer specific analysis?
Since the demand is driven by an individual consumer, the premise
is that by cultivating better understanding of the consumers,
their habits and preferences, one can cater much better to their
needs and realize bigger benefits. For example, a retailer could
identify that a significant portion of their consumers demand
organic food items, which the retailer is not carrying. In other
cases, such analysis can uncover customer needs that are not served
by a particular retailer (e.g. a customer buying baby food from a
store but not buying diapers).


The Customer Relationship Management (CRM) groups have
traditionally conducted very detailed research to capitalize on
these findings. Some of the major initiatives are listed below.

Determine consumer preferences and tailor offerings
This refers to identifying gaps in assortment (both products and
services) that a substantial group of customers would be interested
in. After analysis, the retailer can decide to selectively expand
in these areas. One great example is the addition of bakery
sections in many of the grocery retail stores. Such analysis can be
key for very seasonal verticles, where catering to changing tastes
of consumers leads to higher sales/margins.



1-1 specialized offers (upsell/cross-sell opportunities)
As one starts conducting this detailed analysis, it is clear to see
that customers are actually quite different from one another --
there are varying levels of brand loyalty, retailer loyalty and
category loyalty. One could influence consumer behavior at the
individual level by identifying the needs and price sensitivity and
providing good value. For example, providing a customized coupon on
bulk purchase -- 10% off of a purchase of $50 or higher in December.



Loyalty and credit card programs
The only way to understand customer behavior is to have access to
detailed transactions over time. This is achieved by enrolling
customers into a loyalty or credit card program (usually a
retailer-branded credit card) since it allows for identification of
customers and the ability to track them over time.

Although the programs are voluntary, it is easy to achieve customer
penetration upwards of 90% by properly incentivizing the
customers. This trove of information is just being mined for
marketing and sales purposes. I should point out that there is some
resistance to this due to privacy concerns, which will hopefully
get resolved to strike a good balance between privacy and allowing
for mining of insights.



Coupon and other offer programs
The whole purpose of doing customer level analysis and marketing is
to influence their behavior. And, price is the most important lever
to influence the behavior. Providing the right coupons for an
individual or a group of customers (large or small) can lead to
increased sales without sacrificing the margins. This could be done
selectively to exclude the occassional shoppers who only
buy on deep discounts.



How does online retailing impact consumer insights?

With internet technologies, not only is it easy to set up an
ecommerce store and serve customers, but it also becomes very easy
to track everything that the customer does -- how frequently does
the customer return to the website? How much time does she spend
browsing before making a purchase? How many different categories
has he browsed? All these learnings can be further used to provide
real-time offers to potential customers -- converting them from
mere browsers into actual purchasers. Social media is probably
going herald a new era of customer marketing that is just about
unfolding now.


Are there any other analytics?

Localization:
Although this covered to some degree in some of the other sections
(assortment, prices), this is important enough to be called out
separately. In a nut shell, the emphasis is on understanding the
differences in consumer behaviour across various stores and
tailoring the regional stores to better serve the constituent
consumers. On the assortment side, this would reflect a wider
variety of ethnic (or special interest) items to reflect the local
population. Pricing and promotions also need to be tailored
specificially to reflect consumer choices (e.g. senior discount
days, or, promotion of canned beans around Cinco-de-Mayo)



Store Location & Operations
Retailers are continuously evaluating their needs in terms of
opening or closing stores, expanding or remodeling of existing
stores. These are large capital expenditures, so getting an
accurate read on the future demand based upon a possible decision
is very useful. For example, if a retailer wants to open 5 new
stores, then they can find optimal locations of opening the stores
based upon demographic information, market data, competitor
activity etc. Similar analysis would be helfpul in determining
which stores should add a new department (a bakery, a florists
stand etc).

Operations refers to streamlining and optimizing the store operations
-- for example how and when to restock shelves, how to design process
for efficiency. This is mentioned more for completion than any
specific analytics application.

Wednesday, September 29, 2010

What makes for a good inference?

What is inference and why does it matter?

Data Proliferation [source]


We live in a sea of data and information, but rarely does information
translate to  insight! What the  domain experts look for the most is the
ability to make better decisions based upon insights (beer and diaper
story comes to mind -- although it is an urban legend). A better
example is the economist experiment:

  • Offer A: online Subscription for $60,
  • Offer B: Print only subscription for $120
  • Offer C: Print and online subscription for $120

The insight here was that people will switch to print+online offer as
it is perceived to be the best offer. The key is in providing a set of
choices that results into the desired outcome.

In analytics, we try hard to find such insights and help influence
business decisions. However, very frequently the decisions taken are
very mundane -- Should the large size of the cereal box be 20oz or 21oz?


What are commonly used inference methods?

So, how do analysts go from data to insights?
If you come from statistics background, you would fall back on the concept of
hypothesis testing. You propose a hypothesis, try to find the odds
that the hypothesis holds (frequently the null hypothesis being
rejected). The idea is simple but elegant. Suppose you observe a
person who is 99 years old and then try to answer the question: what
is the expected life span? Clearly you can't say its 99 years as you
don't know if that is an outlier or the norm.

What are some of the drawbacks of statistical inference methods?

Hypothesis makes this approach precise -- it can tell you how the odds
change as a function of the number and type of your
observations. However, a lot of us frequently forget about the
assumptions that go into hypothesis testing and as a result the
conclusions may not be reliable (Does your data really have normal
error distribution? Is your data really independent/uncorrelated as
required -- for example if your data is some type of timeseries it
probably has a fair degree of autocorrelation).


I have an interesting anecdote for this. We were testing a set of
models which had a couple of hundred coefficients. We found that about
5% were flagged as non-zero with a p-value of 5%. If you were to
blindly accept all the significant outcomes based on the p-values you
would lose the big picture. 5% p-value says that you can expect 5%
type-2 errors (reject the null hypothesis incorrectly) and that is
what we found in the data. But, wait that is not sufficient. We also
checked the values and found them to be close to zero (0.1, 0.05 etc)
and hence it made sense to conclude that the current formulation
wasn't finding any significant coefficients.

Below is another example for using a t-test on two populations. This is usually used for determining if there are significant differences between two datasets. This could have many different applications -- clinical trials for testing the efficacy of a drug, A/B testing of offers both in the online and offline worlds etc.

The simulation below demonstrates that it is possible to conclude incorrectly if the results are based on a single statistical measure (the p-value). Confidence interval provides yet another measure that has the nice property of convergence as the sample size increases.

First example is on a simulation of randomly generated uniform [0,1] (or U(0,1) random variables. The second example uses standard normal (N(0,1)) random variable with mean = 0 and variance = 1.


CI and p-values for simulated data



We generate two samples from the same population (uniform [0,1] i.i.d
random variables) and then run a t-test with the null hypothesis that
the means are equal. A low p-value (<10% or 5%) would cause us to
reject the null hypothesis and conclude that the means are likely
different.

As you can see, even as the sample size increases the p-value tends to
bounce around (including some cases where it dips below 10% threshold
-- not shown here). On the other hand the confidence interval does a
much better job at concluding that they are identical. Notice how
quickly the spread narrows.

Now the same experiment run on N(0,1). I have added points for the sample means, which are indeed close to 0 (the expected value).




If we pick a p value threshold of 10%, we will reject the null
hypothesis in some of the cases. Apart from the chance occurrence,
when you have large sample sizes, the variances could be small enough
to give you this result.

So, how do I know this is indeed the case? I used random number
generator to generate two sets with the _same_ parameters (uniform
random numbers U[0,1] in the first example and N(0,1) random numbers in the second)

Next, I ran a simulation where I generated two independent vectors of 1000 N(0,1) variables and computed the p-value and confidence intervals. The figure shows that the confidence interval and sample mean (yellow) estimate is indeed clustered around 0, but the p-value (green) is bouncing around quite a bit. Subsequent figure shows the histogram of the p-values from the same experiment. As expected, we find that 5% of the p-values are less than 0.05!



How can you improve inference techniques?


What we need is higher degree of confidence in our
inferences. However, since we are reducing a large amount of data to a
few outcomes, it becomes necessary to look at multiple things. As one
gains experience, it is easy to refine the key metrics and focus on
the right subset.

1. Most statistical packages (R, SAS) report a number of useful
statistics when estimating a model -- please understand what each of
them mean and look at them carefully. Look at the p-values,
t-statistics, AIC/BIC values, deviance values, reduction from null
deviance, degrees of freedom etc

2. The model estimation is influenced by data and in some cases more
so by some data points than others. Once you understand this, looking
for outliers and leverage (shows relative contribution of the data
points to the coefficient estimates) becomes the norm.

3. Confidence bands/intervals: This gives you a much better feel for
the possible range of values that the outcome can take. For example,
in an election if one of the candidate is leading by 55% vs 45% and
the margin of error (difference between the midpoint and one of the
ends of the confidence interval) is 6%, then the results could reverse
the direction during the final outcome.


4. Withhold some data from your training set and run the estimated
models on this withheld set to see how well it can fit that data --
this is one of the most important test you can run to ensure the
validity of your model. There are many creative ways of doing this:
cross-validations, leave-one-out method, pick specific data-points for
withholding etc.

5. Bring in the problem context at hand to see what are the important
problem related metrics that one can use. In the retail sector, the
most important metric one can use is the forecast accuracy for the
demand or the sales. In the clinical trial area it could be a
measurement of related outcomes -- longevity, survival time etc.

6. Find out the basic approaches advocated by the academic or
industry practitioners (and Google search is your friend!) -- see what
works for others and what doesn't. If nothing else, it will expedite
your model development and validation process.

7. Train models on simulated data. This is a great way to ensure that
there are no coding/interpretations issues. You should be able to
simulate fairly complex data and should be able to recover all the
driving parameters that went into the model. In fact, this can serve
as a very good automated test suite to validate model enhancements.

Friday, September 3, 2010

Technology and Business

Why worry about technology?

Througout our history, technology has been a big driver at increasing productivity. Apart from that, it can sometime drastically change the way we do things. For example, horses were the best mode of transportation mankind had for nearly 3000 years until the Model-T came by (railroads powered by the steam engines if you look back).

source
Horse-Carriage based firestation, circa 1901 MA source


The latest evolution in technology is using computers and internet to better conduct business. Thanks to these technologies, one can do many more things these days and do it from almost anywhere (thanks to the now ubiquitous smartphones). This enormous computing power and universal accessibility allows us to do many more things very effectively. The down side is that people become addicted to their devices (crackberry) and could negatively impact their social relationships.

If one is afraid to pioneer, someone else will and the results could
be catastrophic. Pioneer doesn't necessarily being the first to adapt,
but it does mean adapting it quickly to one's own unique
circumstances. Technology landscape is littered with such examples,
Microsoft with Word, Excel and Internet Explorer, Apple with their
first GUI based operating system (Xerox Parc was the precursor),
online auctioneer EBay , FaceBook

Who are some of the innovators in leveraging technology?

When it comes to using pure software technology, following companies stand out for what they have done and how they have revolutionized their respective businesses (There are a large number of startups that are innovating very rapidly, to be visited in a future post)

Google: They pioneered the concept of quality vs quantity and managed to keep the costs extremely low by hiring the very best people and using commodity hardware. They were so successful at this that they forbid any employee from revealing the number of computers they had in the datacenters -- currently rumored to be around one million. They ended up inventing new paradigms in computing to allow for the internet scale (map-reduce technology for robust distributed storage that also doubles as fast retrieval mechanism -- Hadoop is the open source version of this technology). On the front end, they evolved a very rich user experience using AJAX (via javascript and python)

Googleplex, Mountain View, CA source

Paul Buchheit, an early Google engineer and initial author of Gmail, has interesting take on Google: On Gmail, AdSense...

Amazon: In many ways they are similar to Google and have been a pioneer in using service-oriented-architecture (SOA). This allows for multiple business logic/analytics components to be decoupled. For example, on a typical amazon webpage different services serve out the product details, pricing information, availability/inventory information, reviews information. This allows them to tweak their webiste very quickly (granted that their UI is not the sleekest, but for that you need Apple type focus). They are bringing this same level of rigor in their order fulfillment and distribution centers. Their technologies has spawned a new dimension in current computing -- cloud computing. You can store, manage data, and run application on on-demand hardware, and cheaply.

Interesting article describing the culture and innovation at Amazon.
An interesting look at Amazon.com's history

Netflix: They have been using technology to improve the customer experience of enjoying movies. They started off by leveraging the postal service as a delivery mechanism, but the bar-coded envelopes with DCs located in all major areas was all possible due to technology advances. Their innovative usage for movie recommendations and now streaming directly to the computer/TV ensures their technological and business dominance. Hollywood vidoe has closed and blockbuster is on the verge of closing. Note that blockbuster tried very hard to copy netflix and were innovative enough in trying to use their physical store footprint. However, netflix advantage was too much! Netflix achieves by cultivating a culture that works well for them. For example, their vacation policy is to take as much vacation as needed :-)

UPS: We seldom realize how the commercial shippers have changed the landscape. Remember the time when you placed a mail order and then waited...and waited to recieve the shipment. Providing all the data online and through the web meant that users had visibility, but more importantly that data allowed them to plan their logistics and fleets in almost real time. They were able to look ahead use predictive analytics in a very efficient way. They got so good that they can manage supply chain for some businesses!

Samsung: 20 years ago almost no one knew about Samsung in the US. However, they decided to make "great and relevant design" as the center piece of their long term strategy (starting in 1995) and have managed to become the leader in many areas (As of 2010, they have the largest marketshare in LCD television in the US, well ahead of Sony). Their pairing up with IDEO, establishment of a large scale design institute in Seoul, and establishing many other satellite design facilities allowed them to execute on the vision. They also ended up devising new ways of managing design, conception, prototyping, and production-rollout. A pointer to this change is the number of design awards Samsung has won internationally including the Industrial Design Excellence Awards (IDEA) -- more than 100 since 2000. [Ref: BusinessWeek; 12/6/2004, Issue 3911, p88-96]

They have managed to grow their global revenues 6-fold in the last 10 years or so from $20B USD to over $120B USD, actual trajectory charted below:

SAMSUNG: Revenue and Operating Income Growth
[Graph produced in R]

  
How to best use technology?

Technology has to be understood and then utilized. Startups are incubators for many new ideas -- some will succeed but many will also fail. So, there's always something new to try out . We should always be on the lookout for new ideas and ways to test them quickly and cheaply (fail fast philosophy). Below are some thoughts
  • Improved/faster access information enabling faster decision making: Reporting and BI are supposed to deliver this, but we need to be thinking about newer ways of surfacing this information. Traditionally RDBMS/EDW with SQL have been the main tools, but with internet-scale data we should think about newer technologies and access mechanisms (Map-Reduce, noSQL options, distributed computing on commodity hardware etc.) We will look at this landscape more closely in a future posting.
  • Using technology for streamlining processes and increasing productivity
  • Using technology to learn about the website, customers preferences and making changes -- this is usually achieved by the web-log analysis or running various A/B tests
What to watch out for?

Technology promises many gains, but the field is rife with projects that don't deliver, are over budget, or simply not efficient. In most cases this reflects lack of understanding and using technology than the technology itself. In other words, we need to adopt some of the same mindset as the succesful software companies. Here's one way of trying to doing it:
  • Startup vs large company: what works in a startup may not apply to a larger organization for a variety of reasons (e.g. mission critical systems have different requirements than a web2.0 website). This is an important consideration in the choice of tools and technologies.b
  • Prototype vs Production: it's not a question of either, but of doing both effectivelyl. Prototyping allows you you to lay out proofs of concept quickly, but then taking it to production quality takes a much bigger effort. The best analogy is an architect's rendering vs a structural engineer's detailed plans for a building. In order to make this effective, you need people who are comfortable doing both in an interchangeable way -- I think owning the process from concept, to prototype, to productization leads to much better outcomes. The reason why productization is a bigger effort is because of additional requirements around performance, scalability, integration with other systems, other requirements such as security/authentication.
  • Software development and testing methodology: As a saying goes, bugs and lines of codes go hand in hand. Instead of fighting the bugs, it is much better to adopt methodologies that will minimize the introduction of bugs (modular architecture, decoupling of components such as business logic, presentation layer, and data interfaces), methodologies that will catch bugs quickly (iteration on full system design), and using tools/techniques that help with this productivity (automated testing ensures that the software is tested as soon as it is written, bug tracking systems to have very good visibility into the number of defects and overall quality of the software)
  • Choice of software and hardware tools: There are many options available with varying degrees of freedom (underlying code you can change), varying total cost of ownership (TCO), various stacks (open source LAMP stack, java stack, MS based .NET or VB stack). There is no magic bullet and we have to be guided by able and experienced architects in the choice of these tools. From the business side, we should help the process by specifying the acceptable behavior based upon business needs -- e.g. report should load in under 5 seconds, it should support at least 50 simultaneous business users, the peak load occurs on Sun night and is X units of jobs/work started
  • People, people, people: The quality of people doing work and leading the effort is by far the most important component. If you have the best people and _trust_ them, they will tend to deliver above expectations. Some of the processes and reducing interruption from non-development parties could aid this process too. I am a big believer in individual empowerment, good incentives, and communication to generate superior results. I think that spending time in understanding individual strengths, project needs (for matching the right people) and structuring the team are very valuable in growing a productive team.

Wednesday, August 18, 2010

What is the Current State of Retail?

Why is retailing important?

Retailing has been a big part of economies for a long time now. It has evolved considerably in the last 2000 years, starting out as simple services, to organized marketplaces, to sophisticated supply-chain and logistics (the silk route, ocean voyages etc). Apart from the historical context, most modern societies rely upon retailing for survival. We do not produce all of our requirements and have to rely upon retailing to satisfy various needs starting from food, clothing, banking etc.

Retailing is pretty big, making p somewhere between 15%-30% of the total GDP in the US (and comparable in other first world countries). The higher end estimate includes traditional retailers, food and food service providers, and motor and motor vehicle parts providers. Growth of each the major retail components along with the economy is displayed below.



Retail and GDP growth in US, copyright




How has retailing evolved over the ages?

The earliest markets (or trades) developed based upon the barter system and then evolved to using silver or gold rings, bars, and coins. Introduction of money changes markets and trading forever, leading to new institutions and ways of managing money (banking system, insurance system, shares and public markets, now the Wall Street)

They were frequently located at convenient locations such as the town centers or near the coast to allow for easy and quick service.

Many of these older characteristics do persist in various markets in the world today.


The department stores as we know them originated in the 19th century. The earliest reference I could find is for the LE Bon Marche  in Paris and John Lewis Newcastle (Bainbridge) department store founded in England, both opened in 1838.


Traditional Fish Market source



Retailing appears to be a mundane area, but it is full of innovation and evolution. As a result modern day retailing presents a very different look and behavior in terms of cost structure, service model, and assortment breadth. In terms of near term evolution, retailing was a dominated by independent stores and merchants, at most housed in a common facility to aid trade. This gave rise to specialization in terms of producers, transporters, wholesellers/traders, consumer outlets.


In the last 50 years or so the landscape underwent more transformations and we started seeing chains of retail outlet spread out over large geographical areas. Some examples have become Iconic -- WalMart, Krogers etc.

Over time, these chains started becoming specialized and focused on certain areas (now referred to as verticals) -- grocery, drugstores, mass merchandisers, apparel, shoes, furniture, kitchen, sports and outdoors, children and toys, electronics and entertainment etc



How has internet influenced retailing?

Finally, with internet explosion, we see a new breed of retailers -- e-tailers. This gave a new opportunity to the independent retailers because of the low barriers to entry (setting up a website is a lot cheaper than a physical store) and the advantage of using technology for improved productivity. Slowly most brick and mortar retailers are catching up on this front and have a significant presence on the web. However, only a tiny amount of their total business is due to e-tailing. Consequently, they are not the innovators.


Amazon has been the trailblazer in this area starting out really early (1996?) with selling books. Their revenues for year 2009 at $24B represents a very high and sustained growth rate since their founding. There are many other fine examples in this area:

Zappos.com: The pioneering shoe retailer acquired by Amazon.com in 2009 for close to $1B.

Overstock.com: Operating as a big clearing house for the surpluses through the internet channel

Newegg/TigerDirect: Started out as specialty outlets that have now grown considerably to provide a wide range of products and services within the electronics and appliances sector


The end result of this evolution means that retailing landscape is extremely competitive. Operating margins for retailers are usually low (typically around 5%) compared to other sectors such as finance and software (25% or higher).

Operating Margins of Retailers vs Others. copyright


Stiff competition means that relatively small advantages lead to growth vs perishment. During each business cycle, a number of retail chains have collapsed, to be replaced in many cases with new entrants. The 800 pound gorilla dominating the landscape is, of course, Walmart. Almost every retailer needs to compete against Walmart for survival, which leads to very interesting strategies by the different retailers.


WalMart usually comes to dominate each of the segment that they enter wiping out much of their competition. That happened to the music stores (Tower Records etc) when WalMart and other stores started carrying music. Best Buy and WalMart played a big part in the demise of Circuit city.

To their credit, WalMart has revolutionized the global supply chain and distribution network. I consider that as another milestone in the retailing evolution.

How are social networking, web 2.0 influencing retailing?

With the continued evolution in the web technologies, a new type of channel and product assortment is being introduced. Electronic goods and virtual items. Apple itunes represents the biggest success story with the sale of music and custom applications. Internet games such as Farmville, virtual-life setups such as secondlife, video game communities connected by internet (xbox, wii, etc) allows people to buy completely virtual goods with the real money (Farmville revenues are in excess of $100MM).

Finally, the recent popularity of social networking is spawning a new breed of enablers: Woot.com, groupon.com, gilt.com  etc that allow for the experience of collective and social buying.We will look at this area in more detail in a future post.


Next, we will explore each of the retailing verticals in detail.

Saturday, August 14, 2010

Why this blog?

I have a keen interest combining aspects of retailing, applying analytics and technology for business problems,  and organizing people for best long term outcomes.


Why Retail?

Retailing is ubiquitous. Retailing is huge. Almost everyone can relate to the retailing experience in one form or another.

Why Analytics and Technology?

1990s witnessed a revolution in many areas due to computer technology (ERP systems, supply-chain systems, financial planning systems etc). Then, with the advent of the internet it went another revolution giving rise to entities such as Amazon etailer and Apple iTunes.

At the same time analytics was witnessing tremendous growth and applications -- starting with airline reservation and pricing system to the auction based analytics in the Google AdSense/AdWords system. We are living in an era where technology and analytics can be combined to drive better results.


Why People?

At the end of the day, what makes any enterprise successful is the people. Some groups of people (and companies) are more successful than others -- I want to understand this so that we can build or be part of successful teams.


What's different about this blog?

The goal is to present posts that reflect thoughts and supporting evidence derived from applying analytics and technology. I also hope to have invited posts that reflect a similar spirit.

What's next?

We will start off with an understanding of the retailing landscape and its brief history through time. Then, we will look at the technology landscape in retailing.

Like it? Hate it? Need Changes?

Please leave a comment. How else can I improve it :-)