A Predictive Analytical Study on Factors Enhancing Customer Acquisition and Retention

Analytics, New Delhi Institute of Management CRM (Customer Relationship Management) Systems have long been used for strengthening relationships with customers thereby ensuring retention and enhancing business. Data stored in the CRM software can be analyzed to provide deep insights into the customer behavior thus influencing future products and services. Predictive Analytics are a branch of Business Analytics that helps in analyzing the current data, with the help of statistical tools, data mining algorithms, modelling tools, AI or machine learning, to make effective predictions for the future. This paper studies the impact of predictive analytics applied onto the CRM data of the sample Organization (name concealed owing to secrecy issues), which is among the front runners in the Instrumentation Industry in India and has been providing best quality Instruments and allied services through leading edge global technology. This paper examines the significant factors which help in winning a deal by using logistic regression in the reference Organization. Data are obtained from the Customer Relationship Management software provided by the company. The results presented in this paper confirm that the CRM data can be used to predict the probability of winning a deal. It also helps to find factors which are impacting 'Win' or 'Loss' of the opportunity/deal so that businesses can take precautionary measures to avoid potential loss of opportunity. Such analysis is helpful in the creation of new sales tactics, improvement of winning with them. With CRM systems in place, organizations can store customer

to explore this aspect of the subject organization where significant factors have been identified from data stored in the CRM to predict their effect in winning or losing a deal.
Data terms used in the paper are Assigned Person referring to Location Sales Head. Principal Code refers to the partners defined by the company. Lead Sources refer to source from where the lead has been generated such as e-mail, cold calling etc. Detailed data have not been presented in this paper due to security concerns of the organization. The paper has been divide into 3 sections. In section l, authors have introduced importance of the study and choice of variables for analysis.Section 2 focusses on the literature review presenting the work done so far in this area. Section 3 focusses on the methodology used for the study. It presents data used, the models created for predicting the customer orientation. The last section presents the results and the discussions.

Literature Review
A lot of research has been undertaken in the field of CRM and the utility of the CRM data for the organization. This section provides an overview of some of the existing literatures that served as a backdrop for this study. Some of the existing literature and the focal point of their study is presented here.
Mueller in his paper written in 2010 characterized CRM as an extremely dynamic aspect of organizations' businesses. He argued that businesses would need to focus on their customers and proactively should use diverse approaches and steps for effective CRM to gain a competitive edge in their respective domains and industries. Sinkovics and Ghauri in (2009) researched and proved the correlation between the urge to enhance engagement in CRM and variables which could be used for increase in sale such as cost of direct sales, the intensified competition in the global arena and need for information about all the of businesses in general and the behavior of the consumers.
Data mining and Knowledge Discovery in Databases (KDD) are required for picking out the models and the patterns of interests from very large databases. Fayyad and Stolorz in their paper published way back in 1997, had presented an overview of this research area, delineated the basic techniques and briefly explan that they had used some applications relating to analysis of scientific data. Out of all data mining techniques, the most popular model for customer churn has been the Logit model, which has been very effectively used for handling customer churn and thus finds use in the analysis of marketing decisions. Logistic regression is a popular model as it combines simplicity with performance and the estimated parameters are interpretable in terms of odd ratios that help in complete understanding and interpretation of the results. Further, the technique is relatively robust and popular, as is evident from its availability in almost all softwares dealing with statistical methods. In this paper the logistic regression has been applied in addition to linear regression to predict the customer orientation. Spinello and Hames (1997) in their research find out that Wal-Mart collects the point of sale data from its various stores in various countries and sends this data to its warehouse server. Data in the warehouse server is then used for analyzing the customer buying patterns and thereby used for managing the inventory at the local store and also for identifying opportunities.
Piatetsky-Shapiro et al (1995) surveyed a the applications of data mining tools in the industries. They examined areas as fraud detection, manufacturing automation, marketing analytics etc. and how they can be deployed and adopted in the businesses. They found that these had potential values in the industry and were being researched and developed by many researchers. Anand et al. (1996) focused on the requirements of organizations to invest in data mining solutions and tools owing to increase in data size requirements. They found that reliance on the computer programs to identify the patterns with minimal human intervention had become a necessity. They also presented a general structure of data mining which was based on the Evidence Theory that comprised methods for representing data, knowledge, data manipulation and discovery of knowledge. Sahar F. Sabbeh (2018) in the paper compared and analyzed the efficiency of various machine learning algorithms for predicting customer churn. The techniques studied for various categories of learning, and include Discriminant Analysis, Decision Trees, instance-based learning (KNN), Support Vector Machines (SVM), Logistic Regression, and ensemble-based learning techniques.
Eva Ascarza et al. (2017) in their paper have detailed the various metrics used for measuring and monitoring customer retention. They have presented a structure for managing the retentions by exploring the emerging opportunities presented by the newest and upcoming sources of data and the latest techniques of machine learning. Sheetal et al. (2019) in their white paper have provided an overview on how predictive analytics when applied to the big data can aid organizations in optimizing their campaigns and drives for customer acquisition and retention.

Methodology
Data used for analysis have been collected over the two years i.e 2015 to 2017 in the Customer Relationship Management software maintained by the Organization. Data contains details of all the customers and information pertaining to the won and lost opportunities. Table 1.

Summary of the deals is in
The Study of the details of the data in Table 1 (in Appendix) can help point out the significant factors that can contribute towards winning a deal and their respective impacts on predicting the success or failure of the deal. This analysis has been done in the current paper through a statistical method i.e. Logistic Regression using software R, which is known for statistical and predictive analytics.

Logistic Regression Analysis
Logistic regression is a well known and researched statistical tool for analyzing datasets that have one or more independent variables for determining the outcome. The outcome, i.e probability of winning or losing opportunities, is measured with a variable that is dichotomous in nature and has only two possible outcomes. This analysis is conducted when the dependent variable is binary and is given a binary outcome (1/0, Yes/No, True/False) when a set of independent variables exists.

Selection of Variables
The model must include all the relevant variables and it must not begin with number of variables which are more than that justified for the specified observations. More the number of variables better fit will be the model for data. But excessive variables may effect the coefficient and create an over-fit model. On the other hand a complicated model that includes several insignificant variables may have reduced predictive abilities and the results may be difficult to interpret (Yusuff 2012).
Correlation analysis is normally conducted using statistical methods to find out how two highly correlated predictors variables may lead to a problem in Regression Analysis as they may lead to inaccuracy in the analysis. Hence, we have used VIF (Variance Inflation Factors) to remove such variables that are highly correlated to each other.

Validation
We have conducted Validation Analysis to check the suitability of logistic regression analysis. For this, data were first divided into two sets. The first set of data containing 80% samples is used as the main data and is used to finding the coefficient values. The second data set which has the remaining 20% samples is used for validating the main data. Thereafter, once the coefficient values have been obtained from the main data, the probability of each sample from the validated data are calculated. Following formula is used for calculating probability: βo is the value of the intercept coefficient and βp is the value of the coefficient for each factor that contributes to the occurrence.
Finally, the probability of each sample is crossvalidated with the observed probability. With crossvalidation, the percentage of correct cases obtained in the classification is thus obtained Yusuf et al. .

Logistic Regression Model
Various tests, such as model fitting test, parameter estimation and classification were conducted as part of logistic regression analysis. Model fitting test helps to check whether all the variables are appropriate for usage in the logistic regression. We need to remove some variables which are either having NA or based on z-value/p-value and remove variables with very high p-value (>.5) in each set of variables. Variance Inflation Factors (VIF) is the method for identifying collinearity amongst the explanatory variables. Higher the VIF value, the higher is the collinearity. The VIF for a single explanatory variable is obtained with the help of the r-squared value of the regression of that variable against all the other explanatory variables.
In our case we remove correlated variables leaving uncorrelated set of variables for further analysis in order to avoid multicollinearity.
*VIF values >5 are commonly considered as high.

Coefficients:
In the final model, all predictors have been chosen to have significant p-value (<0.01). The following observations can be made from the results.
Location -Vadodra, Actual Revenue and Lead Source-CompanyWebsite have negative impact on winning a deal whereas rest others i.e. Location-Kolkata, Principal Code-MalvernInstrumentsLimited and Elcometer have positive impact on winning a deal.
The ROC Curve presenting the results is in the figure 1:

Insights from Descriptive and Statistical Analysis on data:
After successfully running the logistic regression on the above data, we ran predict function on test data to get the predicted probabilities (0 ≤ p ≤ 1). The predict function is used to calculate the predicted probability or the outcome of the categorical dependent variable which has limited number of categorical values, based on one or more independent variables. When we use the predict function on this model, it will predict the log (odds) of the Y (here, target) variable.
The default threshold score of the prediction probability is 0.5 which is the ratio of 1's and 0's in data. This decides in segregating our result into win/ loss. But at times, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The optimal Cut off function from Information Value package provides ways to find the optimal cut off to improve the prediction.
Below are some insights drawn from the resulted output along with the graphs to visualize the output.
a. Impact of Location on Success Rate : The impact is depicted in Figure 2 Insight-Kolkata has the highest probability of winning a deal, followed by Chennai and Mumbai. Out of all locations Chandigarh has the least chance of grabbing a deal.

b. Impact of Principal Code on Success Rate:
The impact is depicted in Figure 3 Insight -Success Rate is highest when Principal Code is Civil (AM) while it is least in case of Particle Measuring Systems.

c. Impact of Lead Source on Success Rate:
The impact is depicted in Figure 4 Insight -Cold call has the greatest impact on Success rate followed by Customer Contact and Email.

Conclusion
As a result, we could find out that the Location -Kolkata, Vadodara, Principal Code -Malvern Instruments Limited, Elcometer, Website and Revenue play a significant role in winning a deal.
Predictive analytics are used to find application in the context of CRM across various industries like banking, telecommunication, retail, manufacturing, insurance and healthcare.
CRM systems are therefore offering predictive analytics to keep up with the trend. Infor CRM, Salesforce, and Microsoft have all introduced predictive analytics in their latest releases and another major CRM player, SugarCRM has one in the works.
In spite of the increasing trend of applications of predictive analytics in CRM, there is a lack of inclusive literature assessment and a classification system for it. This research provided a classification framework to fulfill this gap in the literature and guide future research. The various dimensions of CRM are customer acquisition, customer attraction, customer retention, customer development and customer equity growth. Predictive analytics are mostly applied in customer retention to predict the customer churn challenges in organizations and to make knowledgeable decision to avoid these problems. In this research predictive analytics have been applied to find the significant factors to enhance customer retention and customer development.
The most predominant predictive technique, logistic regression has been applied in this research. Statistical and advanced tool 'R' is used for programming. The charts are developed in Microsoft Excel only to maintain their simplicity.