Cross-company customer churn prediction in telecommunication: a comparison of data transformation methods
Authors/Editors
Research Areas
Publication Details
Abstract
Cross-Company Churn Prediction (CCCP) is a domain of research where one
company (target) is lacking enough data and can use data from another
company (source) to predict customer churn successfully. To support
CCCP, the cross-company data is usually transformed to a set of similar
normal distribution of target company data prior to building a CCCP
model. However, it is still unclear which data transformation method is
most effective in CCCP. Also, the impact of data transformation methods
on CCCP model performance using different classifiers have not been
comprehensively explored in the telecommunication sector. In this study,
we devised a model for CCCP using data transformation methods (i.e.,
log, z-score, rank and box-cox) and presented not only an extensive
comparison to validate the impact of these transformation methods in
CCCP, but also evaluated the performance of underlying baseline
classifiers (i.e., Naive Bayes (NB), K-Nearest Neighbour (KNN), Gradient
Boosted Tree (GBT), Single Rule Induction (SRI) and Deep learner Neural
net (DP)) for customer churn prediction in telecommunication sector
using the above mentioned data transformation methods. We performed
experiments on publicly available datasets related to the
telecommunication sector. The results demonstrated that most of the
data transformation methods (e.g., log, rank, and box-cox) improve the
performance of CCCP significantly. However, the Z-Score data
transformation method could not achieve better results as compared to
the rest of the data transformation methods in this study. Moreover, it
is also investigated that the CCCP model based on NB outperform on
transformed data and DP, KNN and GBT performed on the average, while SRI
classifier did not show significant results in term of the commonly
used evaluation measures (i.e., probability of detection, probability of
false alarm, area under the curve and g-mean).