Today, brands are collecting a lot of data about their customers. This includes digital data such as browsing behavior on their website, in-store transactions, social media interactions, second party data, Ad response data, app behavioral data and so on.

Assuming all this data is available to a brand in one single datastore, how could all this data be used to predict the value of a customer?

In this post, we will take a hypothetical training dataset with a fairly large number of data attributes that represent customer activity, interactions and transactions with a brand. This dataset also contains a known "customer lifetime value" CLTV against each data record.

Using this dataset we will train a learning model that will be able to predict the value of a new customer based on their collected data.

The whole process can be split into 4 main steps:

Understanding the data
Preparing the data
Building the learner
Making the prediction

I plan to write this post over the span of a few weeks. Subscribe to get notified on updates to this post.

Understanding the Data¶

We will start with importing the required packages and defining some commonly used functions in this example.

#Import libraries
import pandas

from IPython.display import display, HTML
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.offline as offline
import plotly.graph_objs as go
import common_func as mymodule 

offline.init_notebook_mode()

#initialize your own modules. Comment this if you have no modules
display(mymodule.init())

#Define presentation attributes
TEXT_FONT='Open Sans'
CODE_FONT='monospace'
sns.set_style({'font.family': CODE_FONT})
COLOR = '#2980B9'
SECONDARY_COLOR = '#111'
NUMERICS = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']



#Define utility functions
def display_as_table(t): 
    return pandas.DataFrame(t);

Load the training dataset into a dataframe.

dataset = pandas.read_csv("../input/train.csv")

Load the test dataset into a dataframe and drop the ID column which we do not want to include as part of our training or prediction.

dataset_test = pandas.read_csv("../input/test.csv")
ID = dataset_test['id']
dataset.drop('id',axis=1,inplace=True)
dataset_test.drop('id',axis=1,inplace=True)

#set the target column, the customer lifetime value that we will predicting
TARGET = 'cltv'

Take a look at some sample data by transposing the view so each attribute is shown as a row below.

dataset.head(5).transpose()

Display the number of records and columns in this dataset.

dataset.shape

(188318, 131)

When we now see the stats of this dataset, we find that the count of records for each continuous variable is 188318, which is the same as the number of records seen above. It means no values are missing, which is good.

display_as_table (dataset.describe().loc['count'])

Continuous attributes¶

Let us store the list of continuous variables into a list. We will want to observe this data further.

dataset_cont = dataset.select_dtypes(['float64','int64'])
cont_cols = dataset_cont.columns.tolist()

Now let's see the skewness in the continous variables. We can see that the loss column shows the highest skewness of ~3.8

display_as_table(dataset_cont.skew())

We then see the distribution of the continuous variables against each other, to see if any patterns emerge, using a violin plot.

for i in range(len(cont_cols)-1):
    sns.violinplot(y=cont_cols[i], data=dataset_cont,color=COLOR)
    plt.show()

The following patterns emerge:

cont1 has many values near 0.5
cont2 has a greater density around certain values which is causing several spikes.
cont3 shows a high density of values near the median. The remaining values seem to be concentrated just below the first quartile and just above the third quartile.
cont5 has a very high density of values around 0.3
cont6 and cont7 both have the highest density of values around the first quartile, with cont7 having a much higher density of values around the first quaritle than cont6
cont8 is shaped similar to cont5, except that the values seem more variably dispersed than cont5
cont9 and cont1 appear to be identical in distribution with cont9 having many values nar 0.4
cont11 and cont12 seem to be almost identical in their distribution
cont13 has values concentrated around the first quartile and third quartile, with the first quartile ara having a large and more uniform density distibution
cont14 has a an almost identical pattern above and below the median

When we plot the target variable 'cltv' we see that the values range from 0 to over 120,000.

sns.violinplot(y=cont_cols[14], data=dataset_cont,color=COLOR)
plt.show()

Since these values are several orders of magnitude larger than the other continuous variables, we will need to normalize this so that any modeling techniques over this dataset will work much better. All values of the loss variable are positve, and therefore we can use a logarithmic normalization process.

#Apply log (1+x) to the cltv column and assign it to a new column
dataset[TARGET] = np.log1p(dataset[TARGET])
sns.violinplot(data=dataset,y=TARGET,color=COLOR)  
plt.show()

Now we can see that cltv_transformed is in a similar order of magnitude as the other attributes. Now let us quantify some of the correlations we saw between some of the attributes.

data_corr = dataset_cont.corr()

This produces a matrix of correlation between each attribute against the others.

display(data_corr)

In the above matrix we want to find which attribute pairs are highly correated (greater than a particular correlation threshold). correlation threshold ranges from 0.0 to 1.0. Closer the value to 1.0, the higher the correlation.

We identify all attributes that are correlated. We also exclude pairs with correlation coefficients = 1 because it means the same attribute.

THRESHOLD = 0.85
data_correlated = data_corr[(data_corr > THRESHOLD) & (data_corr < 1) ]

Let's also drop all rows (axis = 0) and columns (axis = 1) that dont have any correlated data to make the information more readable. It produces a matrix of attribute pairs that are highly correlated.

data_correlated.dropna(axis=0,how='all').dropna(axis=1,how='all').fillna('')

This matches some of our findings from the violin plots. As an example we identified cont11 and cont12 are highly correlated. And the correlation coefficient between cont11 and cont12 is 0.994384. Which is very high.

Attributes that are highly correlated give us the opportunity to reduce the number of features in our learning model since correlated features influence outcomes in a similar manner and we could use one instead of both. Based on the table above, we can safely remove either cont11 or cont12; cont1 or cont9 and cont6 or cont10.

Categorical Attributes¶

Find the possible categorical columns by plotting the columns and the number of unique values per column.

#Define a dataframe that will hold the plot values

df_cat = pandas.DataFrame(columns=['attribute',
                            'num_unique_values','unique_values'])
df_cat['num_unique_values'] = df_cat['num_unique_values'].astype(int)

#populate the dataframe with the non numeric attributes and 
#the number of unique values 
row = 0
df_nonnumeric = dataset.select_dtypes(exclude=NUMERICS)
#Get the list of possible categorical columns
categorical_cols = df_nonnumeric.columns

for col in categorical_cols:                                   
        df_cat.loc[row] = [col,len(dataset[col].unique()),
                           dataset[col].unique()]
        row = row+1
        
layout = go.Layout(
    autosize=False,    
    height=2000,  
    font=dict(family=CODE_FONT),
    xaxis=dict(
        title='Number of unique values',
        titlefont=dict(     
            family=TEXT_FONT,
            size=16,
            color=COLOR
        ),        
    ),
    yaxis=dict(
        title='Attribute',
        titlefont=dict(  
            family=TEXT_FONT,         
            size=16,
            color=COLOR 
        ),             
    ),    
    annotations=[
        dict(x=xi,y=yi,
             text=str(xi),
             xanchor='left',
             yanchor='center',
             showarrow=False,
        ) for xi, yi in zip(df_cat['num_unique_values'], 
                df_cat['attribute'])]
)
 
data = [
    go.Bar(
        y=df_cat['attribute'], 
        x=df_cat['num_unique_values'],
        orientation = 'h',
        marker = dict(   
        color=COLOR,            
        line = dict(
            color = SECONDARY_COLOR,
            width = 1)
    )
    )
]

fig = go.Figure(data=data,layout=layout)
offline.iplot(fig,show_link=False)

We can see here that cat1 through cat88 have under 5 labels. cat99 through cat116 have many labels. Here are what these labels look like.

df_cat

Preparing the Data¶

In order to apply ML algorithms, we will need to convert the categorical attribute labels to numeric data. There are two ways to do this. The first one is the One Hot Encoding method.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

labels={}
cats = []

for col in categorical_cols:                             
        train = dataset[col].unique()
        test = dataset_test[col].unique()
        #Do an OR operation to get distinct labels across both sets
        labels[col] = list(set(train) | set(test))

for col in categorical_cols:
    label_encoder = LabelEncoder()
    label_encoder.fit(labels[col])
    feature = label_encoder.transform(dataset[col])
    num_labels = dataset.shape[0]
    #Make the feature array an array of 1D arrays, initialized with
    #the label at the position of the array
    #e.g. if feature array is [0,0,1] , then it's reshaped to [[0],[0],[1]]
    feature = feature.reshape(num_labels, 1)
    n_values=len(labels[col])
    onehot_encoder = OneHotEncoder(sparse=False,n_values=len(labels[col]))
    feature = onehot_encoder.fit_transform(feature)
    cats.append(feature)

#make a 2D array
encoded_cats = np.column_stack(cats)
print(encoded_cats.shape)

dataset_encoded = np.concatenate(
    (encoded_cats,dataset_cont),axis=1)

print(dataset_encoded.shape)

(188318, 1176)
(188318, 1191)

The second approach is to use pandas get_dummies to convert the labels to dummy indicators

dataset_encoded_df = pandas.concat([pandas.get_dummies(df_nonnumeric),
                            dataset_cont],axis=1)
dataset_encoded_df.shape

(188318, 1154)

We have now created a dataset with all numeric values by encoding categorical values. We will split the dataset into training and testing datasets (Coming Soon)

# Generate a random sample boolean array
msk = np.random.rand(len(dataset_encoded_df)) < 0.8
# Apply the mask to the dataframe so it returns the "True" rows
train = dataset_encoded_df[msk]
# Apply the mask to the dataframe so it returns the "False" rows
test = dataset_encoded_df[~msk]

print ("Total population: ", len(dataset_encoded_df))
print("Records in Training set: ", len(train))
print("Records in Test set: ", len(test))

Total population:  188318
Records in Training set:  150974
Records in Test set:  37344

Building the model¶

Linear regression¶

The cltv column becomes our Y and all other feature columns become our X

 import statsmodels.formula.api as sm

feature_cols = [col for col in dataset_encoded_df.columns 
    if col not in [TARGET]]

res = sm.ols(y=dataset_encoded_df[TARGET], 
             x=dataset_encoded_df[feature_cols])
print(res.summary_as_matrix)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-152-a4eddc8e7b99> in <module>()
      5 
      6 res = sm.ols(y=dataset_encoded_df[TARGET], 
----> 7             x=dataset_encoded_df[feature_cols])
      8 print(res.summary_as_matrix)

TypeError: from_formula() missing 2 required positional arguments: 'formula' and 'data'

	0	1	2	3	4
cat1	A	A	A	B	A
cat2	B	B	B	B	B
cat3	A	A	A	A	A
cat4	B	A	A	B	B
cat5	A	A	B	A	A
cat6	A	A	A	A	A
cat7	A	A	A	A	A
cat8	A	A	A	A	A
cat9	B	B	B	B	B
cat10	A	B	B	A	B
cat11	B	A	B	A	A
cat12	A	A	B	A	B
cat13	A	A	B	A	A
cat14	A	A	A	A	A
cat15	A	A	A	A	A
cat16	A	A	A	A	A
cat17	A	A	A	A	A
cat18	A	A	A	A	A
cat19	A	A	A	A	A
cat20	A	A	A	A	A
cat21	A	A	A	A	A
cat22	A	A	A	A	A
cat23	B	A	A	B	B
cat24	A	A	A	A	A
cat25	A	A	A	A	A
cat26	A	A	A	A	A
cat27	A	A	A	A	A
cat28	A	A	A	A	A
cat29	A	A	A	A	A
cat30	A	A	A	A	A
...	...	...	...	...	...
cat102	A	A	A	A	A
cat103	A	A	B	A	A
cat104	I	E	E	E	D
cat105	E	E	F	E	E
cat106	G	I	H	I	K
cat107	J	K	F	K	G
cat108	G	K	A	K	B
cat109	BU	BI	AB	BI	H
cat110	BC	CQ	DK	CS	C
cat111	C	A	A	C	C
cat112	AS	AV	C	N	Y
cat113	S	BM	AF	AE	BM
cat114	A	A	A	A	A
cat115	O	O	I	O	K
cat116	LB	DP	GK	DJ	CK
cont1	0.7263	0.330514	0.261841	0.321594	0.273204
cont2	0.245921	0.737068	0.358319	0.555782	0.15999
cont3	0.187583	0.592681	0.484196	0.527991	0.527991
cont4	0.789639	0.614134	0.236924	0.373816	0.473202
cont5	0.310061	0.885834	0.397069	0.422268	0.704268
cont6	0.718367	0.438917	0.289648	0.440945	0.178193
cont7	0.33506	0.436585	0.315545	0.391128	0.247408
cont8	0.3026	0.60087	0.2732	0.31796	0.24564
cont9	0.67135	0.35127	0.26076	0.32128	0.22089
cont10	0.8351	0.43919	0.32446	0.44467	0.2123
cont11	0.569745	0.338312	0.381398	0.327915	0.204687
cont12	0.594646	0.366307	0.373424	0.32157	0.202213
cont13	0.822493	0.611431	0.195709	0.605077	0.246011
cont14	0.714843	0.304496	0.774425	0.602642	0.432606
cltv	2213.18	1283.6	3005.09	939.85	2763.85

	count
cont1	188318.0
cont2	188318.0
cont3	188318.0
cont4	188318.0
cont5	188318.0
cont6	188318.0
cont7	188318.0
cont8	188318.0
cont9	188318.0
cont10	188318.0
cont11	188318.0
cont12	188318.0
cont13	188318.0
cont14	188318.0
cltv	188318.0

	0
cont1	0.516424
cont2	-0.310941
cont3	-0.010002
cont4	0.416096
cont5	0.681622
cont6	0.461214
cont7	0.826053
cont8	0.676634
cont9	1.072429
cont10	0.355001
cont11	0.280821
cont12	0.291992
cont13	0.380742
cont14	0.248674
cltv	3.794958

	cont1	cont2	cont3	cont4	cont5	cont6	cont7	cont8	cont9	cont10	cont11	cont12	cont13	cont14	cltv
cont1	1.000000	-0.085180	-0.445431	0.367549	-0.025230	0.758315	0.367384	0.361163	0.929912	0.808551	0.596090	0.614225	0.534850	0.056688	-0.010237
cont2	-0.085180	1.000000	0.455861	0.038693	0.191427	0.015864	0.048187	0.137468	-0.032729	0.063526	0.116824	0.106250	0.023335	-0.045584	0.141528
cont3	-0.445431	0.455861	1.000000	-0.341633	0.089417	-0.349278	0.097516	-0.185432	-0.417054	-0.325562	0.025271	0.006111	-0.418203	-0.039592	0.111053
cont4	0.367549	0.038693	-0.341633	1.000000	0.163748	0.220932	-0.115064	0.528740	0.328961	0.283294	0.120927	0.130453	0.179342	0.017445	-0.035831
cont5	-0.025230	0.191427	0.089417	0.163748	1.000000	-0.149810	-0.249344	0.009015	-0.088202	-0.064967	-0.151548	-0.148217	-0.082915	-0.021638	-0.011355
cont6	0.758315	0.015864	-0.349278	0.220932	-0.149810	1.000000	0.658918	0.437437	0.797544	0.883351	0.773745	0.785144	0.815091	0.042178	0.040967
cont7	0.367384	0.048187	0.097516	-0.115064	-0.249344	0.658918	1.000000	0.142042	0.384343	0.492621	0.747108	0.742712	0.288395	0.022286	0.119799
cont8	0.361163	0.137468	-0.185432	0.528740	0.009015	0.437437	0.142042	1.000000	0.452658	0.336588	0.302381	0.315904	0.476402	0.043539	0.030508
cont9	0.929912	-0.032729	-0.417054	0.328961	-0.088202	0.797544	0.384343	0.452658	1.000000	0.785697	0.608000	0.626656	0.642028	0.074154	0.014456
cont10	0.808551	0.063526	-0.325562	0.283294	-0.064967	0.883351	0.492621	0.336588	0.785697	1.000000	0.702896	0.713812	0.707876	0.041808	0.020236
cont11	0.596090	0.116824	0.025271	0.120927	-0.151548	0.773745	0.747108	0.302381	0.608000	0.702896	1.000000	0.994384	0.466247	0.047293	0.099806
cont12	0.614225	0.106250	0.006111	0.130453	-0.148217	0.785144	0.742712	0.315904	0.626656	0.713812	0.994384	1.000000	0.478677	0.050267	0.098724
cont13	0.534850	0.023335	-0.418203	0.179342	-0.082915	0.815091	0.288395	0.476402	0.642028	0.707876	0.466247	0.478677	1.000000	0.047543	-0.004022
cont14	0.056688	-0.045584	-0.039592	0.017445	-0.021638	0.042178	0.022286	0.043539	0.074154	0.041808	0.047293	0.050267	0.047543	1.000000	0.019298
cltv	-0.010237	0.141528	0.111053	-0.035831	-0.011355	0.040967	0.119799	0.030508	0.014456	0.020236	0.099806	0.098724	-0.004022	0.019298	1.000000

	attribute	num_unique_values	unique_values
0	cat1	2	[A, B]
1	cat2	2	[B, A]
2	cat3	2	[A, B]
3	cat4	2	[B, A]
4	cat5	2	[A, B]
5	cat6	2	[A, B]
6	cat7	2	[A, B]
7	cat8	2	[A, B]
8	cat9	2	[B, A]
9	cat10	2	[A, B]
10	cat11	2	[B, A]
11	cat12	2	[A, B]
12	cat13	2	[A, B]
13	cat14	2	[A, B]
14	cat15	2	[A, B]
15	cat16	2	[A, B]
16	cat17	2	[A, B]
17	cat18	2	[A, B]
18	cat19	2	[A, B]
19	cat20	2	[A, B]
20	cat21	2	[A, B]
21	cat22	2	[A, B]
22	cat23	2	[B, A]
23	cat24	2	[A, B]
24	cat25	2	[A, B]
25	cat26	2	[A, B]
26	cat27	2	[A, B]
27	cat28	2	[A, B]
28	cat29	2	[A, B]
29	cat30	2	[A, B]
...	...	...	...
86	cat87	4	[B, C, D, A]
87	cat88	4	[A, D, E, B]
88	cat89	8	[A, B, C, E, D, H, I, G]
89	cat90	7	[A, B, C, D, F, E, G]
90	cat91	8	[A, B, G, C, D, E, F, H]
91	cat92	7	[A, H, B, C, D, I, F]
92	cat93	5	[D, C, A, B, E]
93	cat94	7	[B, D, C, A, F, E, G]
94	cat95	5	[C, D, E, A, B]
95	cat96	8	[E, D, G, B, F, A, I, C]
96	cat97	7	[A, E, C, G, D, F, B]
97	cat98	5	[C, D, A, E, B]
98	cat99	16	[T, D, P, S, R, K, E, F, N, J, C, M, H, G, I, O]
99	cat100	15	[B, L, I, F, J, H, C, M, A, G, O, N, K, D, E]
100	cat101	19	[G, F, O, D, J, A, C, Q, M, I, L, R, S, E, N, ...
101	cat102	9	[A, C, B, D, G, E, F, H, J]
102	cat103	13	[A, B, C, F, E, D, G, H, I, L, K, J, N]
103	cat104	17	[I, E, D, K, H, F, G, P, C, J, L, M, N, O, B, ...
104	cat105	20	[E, F, H, G, I, D, J, K, M, C, A, L, N, P, T, ...
105	cat106	17	[G, I, H, K, F, J, E, L, M, D, A, C, N, O, R, ...
106	cat107	20	[J, K, F, G, I, M, H, L, E, D, O, C, N, A, Q, ...
107	cat108	11	[G, K, A, B, D, I, F, H, E, C, J]
108	cat109	84	[BU, BI, AB, H, K, CD, BQ, M, G, BL, L, AL, N,...
109	cat110	131	[BC, CQ, DK, CS, C, EB, DW, AM, AI, EG, CL, BS...
110	cat111	16	[C, A, G, E, I, M, W, S, K, O, Q, U, F, B, Y, D]
111	cat112	51	[AS, AV, C, N, Y, J, AH, K, U, E, AK, AI, AE, ...
112	cat113	61	[S, BM, AF, AE, Y, AX, H, K, L, A, J, AK, N, M...
113	cat114	19	[A, J, E, C, F, L, N, I, R, U, O, B, Q, V, D, ...
114	cat115	23	[O, I, K, P, Q, L, J, R, N, M, H, G, F, A, S, ...
115	cat116	326	[LB, DP, GK, DJ, CK, LO, IE, LY, GS, HK, DC, M...