Today, brands are collecting a lot of data about their customers. This includes digital data such as browsing behavior on their website, in-store transactions, social media interactions, second party data, Ad response data, app behavioral data and so on.

Assuming all this data is available to a brand in one single datastore, how could all this data be used to predict the value of a customer?

In this post, we will take a hypothetical training dataset with a fairly large number of data attributes that represent customer activity, interactions and transactions with a brand. This dataset also contains a known "customer lifetime value" CLTV against each data record.

Using this dataset we will train a learning model that will be able to predict the value of a new customer based on their collected data.

The whole process can be split into 4 main steps:

  • Understanding the data
  • Preparing the data
  • Building the learner
  • Making the prediction

I plan to write this post over the span of a few weeks. Subscribe to get notified on updates to this post.

Understanding the Data

We will start with importing the required packages and defining some commonly used functions in this example.

In [127]:
#Import libraries
import pandas

from IPython.display import display, HTML
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.offline as offline
import plotly.graph_objs as go
import common_func as mymodule 

offline.init_notebook_mode()

#initialize your own modules. Comment this if you have no modules
display(mymodule.init())

#Define presentation attributes
TEXT_FONT='Open Sans'
CODE_FONT='monospace'
sns.set_style({'font.family': CODE_FONT})
COLOR = '#2980B9'
SECONDARY_COLOR = '#111'
NUMERICS = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']



#Define utility functions
def display_as_table(t): 
    return pandas.DataFrame(t);

Load the training dataset into a dataframe.

In [128]:
dataset = pandas.read_csv("../input/train.csv") 

Load the test dataset into a dataframe and drop the ID column which we do not want to include as part of our training or prediction.

In [129]:
dataset_test = pandas.read_csv("../input/test.csv")
ID = dataset_test['id']
dataset.drop('id',axis=1,inplace=True)
dataset_test.drop('id',axis=1,inplace=True)

#set the target column, the customer lifetime value that we will predicting
TARGET = 'cltv'

Take a look at some sample data by transposing the view so each attribute is shown as a row below.

In [130]:
dataset.head(5).transpose()
Out[130]:
0 1 2 3 4
cat1 A A A B A
cat2 B B B B B
cat3 A A A A A
cat4 B A A B B
cat5 A A B A A
cat6 A A A A A
cat7 A A A A A
cat8 A A A A A
cat9 B B B B B
cat10 A B B A B
cat11 B A B A A
cat12 A A B A B
cat13 A A B A A
cat14 A A A A A
cat15 A A A A A
cat16 A A A A A
cat17 A A A A A
cat18 A A A A A
cat19 A A A A A
cat20 A A A A A
cat21 A A A A A
cat22 A A A A A
cat23 B A A B B
cat24 A A A A A
cat25 A A A A A
cat26 A A A A A
cat27 A A A A A
cat28 A A A A A
cat29 A A A A A
cat30 A A A A A
... ... ... ... ... ...
cat102 A A A A A
cat103 A A B A A
cat104 I E E E D
cat105 E E F E E
cat106 G I H I K
cat107 J K F K G
cat108 G K A K B
cat109 BU BI AB BI H
cat110 BC CQ DK CS C
cat111 C A A C C
cat112 AS AV C N Y
cat113 S BM AF AE BM
cat114 A A A A A
cat115 O O I O K
cat116 LB DP GK DJ CK
cont1 0.7263 0.330514 0.261841 0.321594 0.273204
cont2 0.245921 0.737068 0.358319 0.555782 0.15999
cont3 0.187583 0.592681 0.484196 0.527991 0.527991
cont4 0.789639 0.614134 0.236924 0.373816 0.473202
cont5 0.310061 0.885834 0.397069 0.422268 0.704268
cont6 0.718367 0.438917 0.289648 0.440945 0.178193
cont7 0.33506 0.436585 0.315545 0.391128 0.247408
cont8 0.3026 0.60087 0.2732 0.31796 0.24564
cont9 0.67135 0.35127 0.26076 0.32128 0.22089
cont10 0.8351 0.43919 0.32446 0.44467 0.2123
cont11 0.569745 0.338312 0.381398 0.327915 0.204687
cont12 0.594646 0.366307 0.373424 0.32157 0.202213
cont13 0.822493 0.611431 0.195709 0.605077 0.246011
cont14 0.714843 0.304496 0.774425 0.602642 0.432606
cltv 2213.18 1283.6 3005.09 939.85 2763.85

131 rows × 5 columns

Display the number of records and columns in this dataset.

In [131]:
dataset.shape
Out[131]:
(188318, 131)

When we now see the stats of this dataset, we find that the count of records for each continuous variable is 188318, which is the same as the number of records seen above. It means no values are missing, which is good.

In [132]:
display_as_table (dataset.describe().loc['count'])
Out[132]:
count
cont1 188318.0
cont2 188318.0
cont3 188318.0
cont4 188318.0
cont5 188318.0
cont6 188318.0
cont7 188318.0
cont8 188318.0
cont9 188318.0
cont10 188318.0
cont11 188318.0
cont12 188318.0
cont13 188318.0
cont14 188318.0
cltv 188318.0

Continuous attributes

Let us store the list of continuous variables into a list. We will want to observe this data further.

In [133]:
dataset_cont = dataset.select_dtypes(['float64','int64'])
cont_cols = dataset_cont.columns.tolist()

Now let's see the skewness in the continous variables. We can see that the loss column shows the highest skewness of ~3.8

In [134]:
display_as_table(dataset_cont.skew())
Out[134]:
0
cont1 0.516424
cont2 -0.310941
cont3 -0.010002
cont4 0.416096
cont5 0.681622
cont6 0.461214
cont7 0.826053
cont8 0.676634
cont9 1.072429
cont10 0.355001
cont11 0.280821
cont12 0.291992
cont13 0.380742
cont14 0.248674
cltv 3.794958

We then see the distribution of the continuous variables against each other, to see if any patterns emerge, using a violin plot.

In [135]:
for i in range(len(cont_cols)-1):
    sns.violinplot(y=cont_cols[i], data=dataset_cont,color=COLOR)
    plt.show()  

The following patterns emerge:

  • cont1 has many values near 0.5
  • cont2 has a greater density around certain values which is causing several spikes.
  • cont3 shows a high density of values near the median. The remaining values seem to be concentrated just below the first quartile and just above the third quartile.
  • cont5 has a very high density of values around 0.3
  • cont6 and cont7 both have the highest density of values around the first quartile, with cont7 having a much higher density of values around the first quaritle than cont6
  • cont8 is shaped similar to cont5, except that the values seem more variably dispersed than cont5
  • cont9 and cont1 appear to be identical in distribution with cont9 having many values nar 0.4
  • cont11 and cont12 seem to be almost identical in their distribution
  • cont13 has values concentrated around the first quartile and third quartile, with the first quartile ara having a large and more uniform density distibution
  • cont14 has a an almost identical pattern above and below the median

When we plot the target variable 'cltv' we see that the values range from 0 to over 120,000.

In [136]:
sns.violinplot(y=cont_cols[14], data=dataset_cont,color=COLOR)
plt.show()  

Since these values are several orders of magnitude larger than the other continuous variables, we will need to normalize this so that any modeling techniques over this dataset will work much better. All values of the loss variable are positve, and therefore we can use a logarithmic normalization process.

In [137]:
#Apply log (1+x) to the cltv column and assign it to a new column
dataset[TARGET] = np.log1p(dataset[TARGET])
sns.violinplot(data=dataset,y=TARGET,color=COLOR)  
plt.show()

Now we can see that cltv_transformed is in a similar order of magnitude as the other attributes. Now let us quantify some of the correlations we saw between some of the attributes.

In [138]:
data_corr = dataset_cont.corr()

This produces a matrix of correlation between each attribute against the others.

In [139]:
display(data_corr)
cont1 cont2 cont3 cont4 cont5 cont6 cont7 cont8 cont9 cont10 cont11 cont12 cont13 cont14 cltv
cont1 1.000000 -0.085180 -0.445431 0.367549 -0.025230 0.758315 0.367384 0.361163 0.929912 0.808551 0.596090 0.614225 0.534850 0.056688 -0.010237
cont2 -0.085180 1.000000 0.455861 0.038693 0.191427 0.015864 0.048187 0.137468 -0.032729 0.063526 0.116824 0.106250 0.023335 -0.045584 0.141528
cont3 -0.445431 0.455861 1.000000 -0.341633 0.089417 -0.349278 0.097516 -0.185432 -0.417054 -0.325562 0.025271 0.006111 -0.418203 -0.039592 0.111053
cont4 0.367549 0.038693 -0.341633 1.000000 0.163748 0.220932 -0.115064 0.528740 0.328961 0.283294 0.120927 0.130453 0.179342 0.017445 -0.035831
cont5 -0.025230 0.191427 0.089417 0.163748 1.000000 -0.149810 -0.249344 0.009015 -0.088202 -0.064967 -0.151548 -0.148217 -0.082915 -0.021638 -0.011355
cont6 0.758315 0.015864 -0.349278 0.220932 -0.149810 1.000000 0.658918 0.437437 0.797544 0.883351 0.773745 0.785144 0.815091 0.042178 0.040967
cont7 0.367384 0.048187 0.097516 -0.115064 -0.249344 0.658918 1.000000 0.142042 0.384343 0.492621 0.747108 0.742712 0.288395 0.022286 0.119799
cont8 0.361163 0.137468 -0.185432 0.528740 0.009015 0.437437 0.142042 1.000000 0.452658 0.336588 0.302381 0.315904 0.476402 0.043539 0.030508
cont9 0.929912 -0.032729 -0.417054 0.328961 -0.088202 0.797544 0.384343 0.452658 1.000000 0.785697 0.608000 0.626656 0.642028 0.074154 0.014456
cont10 0.808551 0.063526 -0.325562 0.283294 -0.064967 0.883351 0.492621 0.336588 0.785697 1.000000 0.702896 0.713812 0.707876 0.041808 0.020236
cont11 0.596090 0.116824 0.025271 0.120927 -0.151548 0.773745 0.747108 0.302381 0.608000 0.702896 1.000000 0.994384 0.466247 0.047293 0.099806
cont12 0.614225 0.106250 0.006111 0.130453 -0.148217 0.785144 0.742712 0.315904 0.626656 0.713812 0.994384 1.000000 0.478677 0.050267 0.098724
cont13 0.534850 0.023335 -0.418203 0.179342 -0.082915 0.815091 0.288395 0.476402 0.642028 0.707876 0.466247 0.478677 1.000000 0.047543 -0.004022
cont14 0.056688 -0.045584 -0.039592 0.017445 -0.021638 0.042178 0.022286 0.043539 0.074154 0.041808 0.047293 0.050267 0.047543 1.000000 0.019298
cltv -0.010237 0.141528 0.111053 -0.035831 -0.011355 0.040967 0.119799 0.030508 0.014456 0.020236 0.099806 0.098724 -0.004022 0.019298 1.000000

In the above matrix we want to find which attribute pairs are highly correated (greater than a particular correlation threshold). correlation threshold ranges from 0.0 to 1.0. Closer the value to 1.0, the higher the correlation.

We identify all attributes that are correlated. We also exclude pairs with correlation coefficients = 1 because it means the same attribute.

In [140]:
THRESHOLD = 0.85
data_correlated = data_corr[(data_corr > THRESHOLD) & (data_corr < 1) ]

Let's also drop all rows (axis = 0) and columns (axis = 1) that dont have any correlated data to make the information more readable. It produces a matrix of attribute pairs that are highly correlated.

In [141]:
data_correlated.dropna(axis=0,how='all').dropna(axis=1,how='all').fillna('')
Out[141]:
cont1 cont6 cont9 cont10 cont11 cont12
cont1 0.929912
cont6 0.883351
cont9 0.929912
cont10 0.883351
cont11 0.994384
cont12 0.994384

This matches some of our findings from the violin plots. As an example we identified cont11 and cont12 are highly correlated. And the correlation coefficient between cont11 and cont12 is 0.994384. Which is very high.

Attributes that are highly correlated give us the opportunity to reduce the number of features in our learning model since correlated features influence outcomes in a similar manner and we could use one instead of both. Based on the table above, we can safely remove either cont11 or cont12; cont1 or cont9 and cont6 or cont10.

Categorical Attributes

Find the possible categorical columns by plotting the columns and the number of unique values per column.

In [142]:
#Define a dataframe that will hold the plot values

df_cat = pandas.DataFrame(columns=['attribute',
                            'num_unique_values','unique_values'])
df_cat['num_unique_values'] = df_cat['num_unique_values'].astype(int)

#populate the dataframe with the non numeric attributes and 
#the number of unique values 
row = 0
df_nonnumeric = dataset.select_dtypes(exclude=NUMERICS)
#Get the list of possible categorical columns
categorical_cols = df_nonnumeric.columns

for col in categorical_cols:                                   
        df_cat.loc[row] = [col,len(dataset[col].unique()),
                           dataset[col].unique()]
        row = row+1
        
layout = go.Layout(
    autosize=False,    
    height=2000,  
    font=dict(family=CODE_FONT),
    xaxis=dict(
        title='Number of unique values',
        titlefont=dict(     
            family=TEXT_FONT,
            size=16,
            color=COLOR
        ),        
    ),
    yaxis=dict(
        title='Attribute',
        titlefont=dict(  
            family=TEXT_FONT,         
            size=16,
            color=COLOR 
        ),             
    ),    
    annotations=[
        dict(x=xi,y=yi,
             text=str(xi),
             xanchor='left',
             yanchor='center',
             showarrow=False,
        ) for xi, yi in zip(df_cat['num_unique_values'], 
                df_cat['attribute'])]
)
 
data = [
    go.Bar(
        y=df_cat['attribute'], 
        x=df_cat['num_unique_values'],
        orientation = 'h',
        marker = dict(   
        color=COLOR,            
        line = dict(
            color = SECONDARY_COLOR,
            width = 1)
    )
    )
]

fig = go.Figure(data=data,layout=layout)
offline.iplot(fig,show_link=False)

We can see here that cat1 through cat88 have under 5 labels. cat99 through cat116 have many labels. Here are what these labels look like.

In [143]:
df_cat
Out[143]:
attribute num_unique_values unique_values
0 cat1 2 [A, B]
1 cat2 2 [B, A]
2 cat3 2 [A, B]
3 cat4 2 [B, A]
4 cat5 2 [A, B]
5 cat6 2 [A, B]
6 cat7 2 [A, B]
7 cat8 2 [A, B]
8 cat9 2 [B, A]
9 cat10 2 [A, B]
10 cat11 2 [B, A]
11 cat12 2 [A, B]
12 cat13 2 [A, B]
13 cat14 2 [A, B]
14 cat15 2 [A, B]
15 cat16 2 [A, B]
16 cat17 2 [A, B]
17 cat18 2 [A, B]
18 cat19 2 [A, B]
19 cat20 2 [A, B]
20 cat21 2 [A, B]
21 cat22 2 [A, B]
22 cat23 2 [B, A]
23 cat24 2 [A, B]
24 cat25 2 [A, B]
25 cat26 2 [A, B]
26 cat27 2 [A, B]
27 cat28 2 [A, B]
28 cat29 2 [A, B]
29 cat30 2 [A, B]
... ... ... ...
86 cat87 4 [B, C, D, A]
87 cat88 4 [A, D, E, B]
88 cat89 8 [A, B, C, E, D, H, I, G]
89 cat90 7 [A, B, C, D, F, E, G]
90 cat91 8 [A, B, G, C, D, E, F, H]
91 cat92 7 [A, H, B, C, D, I, F]
92 cat93 5 [D, C, A, B, E]
93 cat94 7 [B, D, C, A, F, E, G]
94 cat95 5 [C, D, E, A, B]
95 cat96 8 [E, D, G, B, F, A, I, C]
96 cat97 7 [A, E, C, G, D, F, B]
97 cat98 5 [C, D, A, E, B]
98 cat99 16 [T, D, P, S, R, K, E, F, N, J, C, M, H, G, I, O]
99 cat100 15 [B, L, I, F, J, H, C, M, A, G, O, N, K, D, E]
100 cat101 19 [G, F, O, D, J, A, C, Q, M, I, L, R, S, E, N, ...
101 cat102 9 [A, C, B, D, G, E, F, H, J]
102 cat103 13 [A, B, C, F, E, D, G, H, I, L, K, J, N]
103 cat104 17 [I, E, D, K, H, F, G, P, C, J, L, M, N, O, B, ...
104 cat105 20 [E, F, H, G, I, D, J, K, M, C, A, L, N, P, T, ...
105 cat106 17 [G, I, H, K, F, J, E, L, M, D, A, C, N, O, R, ...
106 cat107 20 [J, K, F, G, I, M, H, L, E, D, O, C, N, A, Q, ...
107 cat108 11 [G, K, A, B, D, I, F, H, E, C, J]
108 cat109 84 [BU, BI, AB, H, K, CD, BQ, M, G, BL, L, AL, N,...
109 cat110 131 [BC, CQ, DK, CS, C, EB, DW, AM, AI, EG, CL, BS...
110 cat111 16 [C, A, G, E, I, M, W, S, K, O, Q, U, F, B, Y, D]
111 cat112 51 [AS, AV, C, N, Y, J, AH, K, U, E, AK, AI, AE, ...
112 cat113 61 [S, BM, AF, AE, Y, AX, H, K, L, A, J, AK, N, M...
113 cat114 19 [A, J, E, C, F, L, N, I, R, U, O, B, Q, V, D, ...
114 cat115 23 [O, I, K, P, Q, L, J, R, N, M, H, G, F, A, S, ...
115 cat116 326 [LB, DP, GK, DJ, CK, LO, IE, LY, GS, HK, DC, M...

116 rows × 3 columns

Preparing the Data

In order to apply ML algorithms, we will need to convert the categorical attribute labels to numeric data. There are two ways to do this. The first one is the One Hot Encoding method.

In [144]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

labels={}
cats = []

for col in categorical_cols:                             
        train = dataset[col].unique()
        test = dataset_test[col].unique()
        #Do an OR operation to get distinct labels across both sets
        labels[col] = list(set(train) | set(test))

for col in categorical_cols:
    label_encoder = LabelEncoder()
    label_encoder.fit(labels[col])
    feature = label_encoder.transform(dataset[col])
    num_labels = dataset.shape[0]
    #Make the feature array an array of 1D arrays, initialized with
    #the label at the position of the array
    #e.g. if feature array is [0,0,1] , then it's reshaped to [[0],[0],[1]]
    feature = feature.reshape(num_labels, 1)
    n_values=len(labels[col])
    onehot_encoder = OneHotEncoder(sparse=False,n_values=len(labels[col]))
    feature = onehot_encoder.fit_transform(feature)
    cats.append(feature)

#make a 2D array
encoded_cats = np.column_stack(cats)
print(encoded_cats.shape)

dataset_encoded = np.concatenate(
    (encoded_cats,dataset_cont),axis=1)

print(dataset_encoded.shape)
(188318, 1176)
(188318, 1191)

The second approach is to use pandas get_dummies to convert the labels to dummy indicators

In [145]:
dataset_encoded_df = pandas.concat([pandas.get_dummies(df_nonnumeric),
                            dataset_cont],axis=1)
dataset_encoded_df.shape
Out[145]:
(188318, 1154)

We have now created a dataset with all numeric values by encoding categorical values. We will split the dataset into training and testing datasets (Coming Soon)

In [146]:
# Generate a random sample boolean array
msk = np.random.rand(len(dataset_encoded_df)) < 0.8
# Apply the mask to the dataframe so it returns the "True" rows
train = dataset_encoded_df[msk]
# Apply the mask to the dataframe so it returns the "False" rows
test = dataset_encoded_df[~msk]

print ("Total population: ", len(dataset_encoded_df))
print("Records in Training set: ", len(train))
print("Records in Test set: ", len(test))
Total population:  188318
Records in Training set:  150974
Records in Test set:  37344

Building the model

Linear regression

The cltv column becomes our Y and all other feature columns become our X

In [152]:
 import statsmodels.formula.api as sm

feature_cols = [col for col in dataset_encoded_df.columns 
    if col not in [TARGET]]

res = sm.ols(y=dataset_encoded_df[TARGET], 
             x=dataset_encoded_df[feature_cols])
print(res.summary_as_matrix)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-152-a4eddc8e7b99> in <module>()
      5 
      6 res = sm.ols(y=dataset_encoded_df[TARGET], 
----> 7             x=dataset_encoded_df[feature_cols])
      8 print(res.summary_as_matrix)

TypeError: from_formula() missing 2 required positional arguments: 'formula' and 'data'
In [ ]: