Today, brands are collecting a lot of data about their customers. This includes digital data such as browsing behavior on their website, in-store transactions, social media interactions, second party data, Ad response data, app behavioral data and so on.
Assuming all this data is available to a brand in one single datastore, how could all this data be used to predict the value of a customer?
In this post, we will take a hypothetical training dataset with a fairly large number of data attributes that represent customer activity, interactions and transactions with a brand. This dataset also contains a known "customer lifetime value" CLTV against each data record.
Using this dataset we will train a learning model that will be able to predict the value of a new customer based on their collected data.
The whole process can be split into 4 main steps:
I plan to write this post over the span of a few weeks. Subscribe to get notified on updates to this post.
We will start with importing the required packages and defining some commonly used functions in this example.
#Import libraries
import pandas
from IPython.display import display, HTML
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.offline as offline
import plotly.graph_objs as go
import common_func as mymodule
offline.init_notebook_mode()
#initialize your own modules. Comment this if you have no modules
display(mymodule.init())
#Define presentation attributes
TEXT_FONT='Open Sans'
CODE_FONT='monospace'
sns.set_style({'font.family': CODE_FONT})
COLOR = '#2980B9'
SECONDARY_COLOR = '#111'
NUMERICS = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
#Define utility functions
def display_as_table(t):
return pandas.DataFrame(t);
Load the training dataset into a dataframe.
dataset = pandas.read_csv("../input/train.csv")
Load the test dataset into a dataframe and drop the ID column which we do not want to include as part of our training or prediction.
dataset_test = pandas.read_csv("../input/test.csv")
ID = dataset_test['id']
dataset.drop('id',axis=1,inplace=True)
dataset_test.drop('id',axis=1,inplace=True)
#set the target column, the customer lifetime value that we will predicting
TARGET = 'cltv'
Take a look at some sample data by transposing the view so each attribute is shown as a row below.
dataset.head(5).transpose()
Display the number of records and columns in this dataset.
dataset.shape
When we now see the stats of this dataset, we find that the count of records for each continuous variable is 188318, which is the same as the number of records seen above. It means no values are missing, which is good.
display_as_table (dataset.describe().loc['count'])
Let us store the list of continuous variables into a list. We will want to observe this data further.
dataset_cont = dataset.select_dtypes(['float64','int64'])
cont_cols = dataset_cont.columns.tolist()
Now let's see the skewness in the continous variables. We can see that the loss column shows the highest skewness of ~3.8
display_as_table(dataset_cont.skew())
We then see the distribution of the continuous variables against each other, to see if any patterns emerge, using a violin plot.
for i in range(len(cont_cols)-1):
sns.violinplot(y=cont_cols[i], data=dataset_cont,color=COLOR)
plt.show()
The following patterns emerge:
When we plot the target variable 'cltv' we see that the values range from 0 to over 120,000.
sns.violinplot(y=cont_cols[14], data=dataset_cont,color=COLOR)
plt.show()
Since these values are several orders of magnitude larger than the other continuous variables, we will need to normalize this so that any modeling techniques over this dataset will work much better. All values of the loss variable are positve, and therefore we can use a logarithmic normalization process.
#Apply log (1+x) to the cltv column and assign it to a new column
dataset[TARGET] = np.log1p(dataset[TARGET])
sns.violinplot(data=dataset,y=TARGET,color=COLOR)
plt.show()
Now we can see that cltv_transformed is in a similar order of magnitude as the other attributes. Now let us quantify some of the correlations we saw between some of the attributes.
data_corr = dataset_cont.corr()
This produces a matrix of correlation between each attribute against the others.
display(data_corr)
In the above matrix we want to find which attribute pairs are highly correated (greater than a particular correlation threshold). correlation threshold ranges from 0.0 to 1.0. Closer the value to 1.0, the higher the correlation.
We identify all attributes that are correlated. We also exclude pairs with correlation coefficients = 1 because it means the same attribute.
THRESHOLD = 0.85
data_correlated = data_corr[(data_corr > THRESHOLD) & (data_corr < 1) ]
Let's also drop all rows (axis = 0) and columns (axis = 1) that dont have any correlated data to make the information more readable. It produces a matrix of attribute pairs that are highly correlated.
data_correlated.dropna(axis=0,how='all').dropna(axis=1,how='all').fillna('')
This matches some of our findings from the violin plots. As an example we identified cont11 and cont12 are highly correlated. And the correlation coefficient between cont11 and cont12 is 0.994384. Which is very high.
Attributes that are highly correlated give us the opportunity to reduce the number of features in our learning model since correlated features influence outcomes in a similar manner and we could use one instead of both. Based on the table above, we can safely remove either cont11 or cont12; cont1 or cont9 and cont6 or cont10.
Find the possible categorical columns by plotting the columns and the number of unique values per column.
#Define a dataframe that will hold the plot values
df_cat = pandas.DataFrame(columns=['attribute',
'num_unique_values','unique_values'])
df_cat['num_unique_values'] = df_cat['num_unique_values'].astype(int)
#populate the dataframe with the non numeric attributes and
#the number of unique values
row = 0
df_nonnumeric = dataset.select_dtypes(exclude=NUMERICS)
#Get the list of possible categorical columns
categorical_cols = df_nonnumeric.columns
for col in categorical_cols:
df_cat.loc[row] = [col,len(dataset[col].unique()),
dataset[col].unique()]
row = row+1
layout = go.Layout(
autosize=False,
height=2000,
font=dict(family=CODE_FONT),
xaxis=dict(
title='Number of unique values',
titlefont=dict(
family=TEXT_FONT,
size=16,
color=COLOR
),
),
yaxis=dict(
title='Attribute',
titlefont=dict(
family=TEXT_FONT,
size=16,
color=COLOR
),
),
annotations=[
dict(x=xi,y=yi,
text=str(xi),
xanchor='left',
yanchor='center',
showarrow=False,
) for xi, yi in zip(df_cat['num_unique_values'],
df_cat['attribute'])]
)
data = [
go.Bar(
y=df_cat['attribute'],
x=df_cat['num_unique_values'],
orientation = 'h',
marker = dict(
color=COLOR,
line = dict(
color = SECONDARY_COLOR,
width = 1)
)
)
]
fig = go.Figure(data=data,layout=layout)
offline.iplot(fig,show_link=False)
We can see here that cat1 through cat88 have under 5 labels. cat99 through cat116 have many labels. Here are what these labels look like.
df_cat
In order to apply ML algorithms, we will need to convert the categorical attribute labels to numeric data. There are two ways to do this. The first one is the One Hot Encoding method.
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
labels={}
cats = []
for col in categorical_cols:
train = dataset[col].unique()
test = dataset_test[col].unique()
#Do an OR operation to get distinct labels across both sets
labels[col] = list(set(train) | set(test))
for col in categorical_cols:
label_encoder = LabelEncoder()
label_encoder.fit(labels[col])
feature = label_encoder.transform(dataset[col])
num_labels = dataset.shape[0]
#Make the feature array an array of 1D arrays, initialized with
#the label at the position of the array
#e.g. if feature array is [0,0,1] , then it's reshaped to [[0],[0],[1]]
feature = feature.reshape(num_labels, 1)
n_values=len(labels[col])
onehot_encoder = OneHotEncoder(sparse=False,n_values=len(labels[col]))
feature = onehot_encoder.fit_transform(feature)
cats.append(feature)
#make a 2D array
encoded_cats = np.column_stack(cats)
print(encoded_cats.shape)
dataset_encoded = np.concatenate(
(encoded_cats,dataset_cont),axis=1)
print(dataset_encoded.shape)
The second approach is to use pandas get_dummies to convert the labels to dummy indicators
dataset_encoded_df = pandas.concat([pandas.get_dummies(df_nonnumeric),
dataset_cont],axis=1)
dataset_encoded_df.shape
We have now created a dataset with all numeric values by encoding categorical values. We will split the dataset into training and testing datasets (Coming Soon)
# Generate a random sample boolean array
msk = np.random.rand(len(dataset_encoded_df)) < 0.8
# Apply the mask to the dataframe so it returns the "True" rows
train = dataset_encoded_df[msk]
# Apply the mask to the dataframe so it returns the "False" rows
test = dataset_encoded_df[~msk]
print ("Total population: ", len(dataset_encoded_df))
print("Records in Training set: ", len(train))
print("Records in Test set: ", len(test))
The cltv column becomes our Y and all other feature columns become our X
import statsmodels.formula.api as sm
feature_cols = [col for col in dataset_encoded_df.columns
if col not in [TARGET]]
res = sm.ols(y=dataset_encoded_df[TARGET],
x=dataset_encoded_df[feature_cols])
print(res.summary_as_matrix)