{"title": "Churn Reduction in the Wireless Industry", "book": "Advances in Neural Information Processing Systems", "page_first": 935, "page_last": 941, "abstract": null, "full_text": "Churn Reduction in the Wireless Industry \n\nMichael C. Mozer*+, Richard Wolniewicz*, David B. Grimes*+, \n\nEric Johnson * , Howard Kaushansky* \n\n* Athene Software \n\n2060 Broadway, Suite 300 \n\nBoulder, CO 80302 \n\n+ Department of Computer Science \n\nUniversity of Colorado \n\nBoulder, CO 80309-0430 \n\nAbstract \n\nCompetition in the wireless telecommunications industry is rampant. To main(cid:173)\ntain profitability, wireless carriers must control chum, the loss of subscribers \nwho switch from one carrier to another. We explore statistical techniques for \nchum prediction and, based on these predictions. an optimal policy for identify(cid:173)\ning customers to whom incentives should be offered to increase retention. Our \nexperiments are based on a data base of nearly 47,000 U.S. domestic subscrib(cid:173)\ners, and includes information about their usage, billing, credit, application, and \ncomplaint history. We show that under a wide variety of assumptions concerning \nthe cost of intervention and the retention rate resulting from intervention, chum \nprediction and remediation can yield significant savings to a carrier. We also \nshow the importance of a data representation crafted by domain experts. \n\nCompetition in the wireless telecommunications industry is rampant. As many as seven \ncompeting carriers operate in each market. The industry is extremely dynamic, with new \nservices, technologies, and carriers constantly altering the landscape. Carriers announce \nnew rates and incentives weekly, hoping to entice new subscribers and to lure subscribers \naway from the competition. The extent of rivalry is reflected in the deluge of advertise(cid:173)\nments for wireless service in the daily newspaper and other mass media. \n\nThe United States had 69 million wireless subscribers in 1998, roughly 25% of the \n\npopulation. Some markets are further developed; for example, the subscription rate in Fin(cid:173)\nland is 53%. Industry forecasts are for a U.S. penetration rate of 48% by 2003. Although \nthere is significant room for growth in most markets, the industry growth rate is declining \nand competition is rising. Consequently, it has become crucial for wireless carriers to con(cid:173)\ntrol chum-the loss of customers who switch from one carrier to another. At present, \ndomestic monthly chum rates are 2-3% of the customer base. At an average cost of $400 \nto acquire a subscriber, churn cost the industry nearly $6.3 bilIion in 1998; the total annual \nloss rose to nearly $9.6 billion when lost monthly revenue from subscriber cancellations is \nconsidered (Luna, 1998). It costs roughly five times as much to sign on a new subscriber \nas to retain an existing one. Consequently, for a carrier with 1.5 milIion subscribers, reduc(cid:173)\ning the monthly churn' rate from 2% to 1 % would yield an increase in annual earnings of at \nleast $54 milIion, and an increase in shareholder value of approximately $150 million. \n(Estimates are even higher when lost monthly revenue is considered; see Fowlkes, Madan, \nAndrew, & Jensen, 1999; Luna, 1998.) \n\nThe goal of our research is to evaluate the benefits of predicting churn using tech(cid:173)\n\nniques from statistical machine learning. We designed models that predict the probability \n\n\f936 \n\nM. C. Mozer, R. Wolniewicz. D. B. Grimes. E. Johnson and H. Kaushansky \n\nof a subscriber churning within a short time window, and we evaluated how well these pre(cid:173)\ndictions could be used for decision making by estimating potential cost savings to the \nwireless carrier under a variety of assumptions concerning subscriber behavior. \n\n1 THE FRAMEWORK \n\nFigure 1 shows a framework for churn prediction and profitability maximization. \n\nData from a subscriber-on which we elaborate in the next section-is fed into three com(cid:173)\nponents which estimate: the likelihood that the subscriber will churn, the profitability \n(expected monthly revenue) of the subscriber, and the subscriber's credit risk. Profitability \nand credit risk determine how valuable the subscriber is to the carrier, and hence influ(cid:173)\nences how much the carrier should be willing to spend to retain the subscriber. Based on \nthe predictions of subscriber behavior, a decision making component determines an inter(cid:173)\nvention strategy-whether a subscriber should be contacted, and if so, what incentives \nshould be offered to appease them. We adopt a decision-theoretic approach which aims to \nmaximize the expected profit to the carrier. \n\nIn the present work, we focus on churn prediction and utilize simple measures of \nsubscriber profitability and credit risk. However, current modeling efforts are directed at \nmore intelligent models of profitability and credit risk. \n\n2 DATASET \n\nThe subscriber data used for our experiments was provided by a major wireless car(cid:173)\n\nrier. The carrier does not want to be identified, as churn rates are confidential. The carrier \nprovided a data base of 46,744 primarily business subscribers, all of whom had multiple \nservices. (Each service corresponds to a cellular telephone or to some other service, such \nas voice messaging or beeper capability.) All subscribers were from the same region of the \nUnited States, about 20% in major metropolitan areas and 80% more geographically dis(cid:173)\ntributed. The total revenue for all subscribers in the data base was $14 million in October \n1998. The average revenue per subscriber was $234. We focused on multi-service sub(cid:173)\nscribers, because they provide significantly more revenue than do typical single-service \nsubscribers. \n\nWhen subscribers are on extended contracts, churn prediction is relatively easy: it \nseldom occurs during the contract period, and often occurs when the contract comes to an \nend. Consequently, all subscribers in our data base were month-to-month, requiring the \nuse of more subtle features than contract termination date to anticipate churn. \n\nThe subscriber data was extracted from the time interval October through December, \n1998. Based on these data, the task was to predict whether a subscriber would churn in \nJanuary or February 1999. The carrier provided their internal definition of churn, which \nwas based on the closing of all services held by a subscriber. From this definition, 2,876 of \nthe subscribers active in October through December churned-6.2% of the data base. \n\n.. \n---.. \n\nsubscriber \n\ndata \n\nsubscriber \n\nchurn \n\nprediction \n\nsubscriber \nprofitability \nestimation \n\nsubscriber \ncredit risk \nestimation \n\n.. - decision \n-- making \n... . \n\n--- intervention \n\nstrategy \n\nFIGURE 1. The framework for churn prediction and profitability maximization \n\n\fChurn Reduction in the Wireless Industry \n\n937 \n\n2.1 INPUT FEATURES \nUltimately, churn occurs because subscribers are dissatisfied with the price or quality of \nservice, usually as compared to a competing carrier. The main reasons for subscriber dis(cid:173)\nsatisfaction vary by region and over time. Table 1 lists important factors that influence \nsubscriber satisfaction, as well as the relative importance of the factors (J. D. Power and \nAssociates, 1998). In the third column, we list the type of information required for deter(cid:173)\nmining whether a particular factor is likely to be influencing a subscriber. We categorize \nthe types of information as follows. \n\nNetwork. Call detail records (date, time, duration, and location of all calls), dropped \ncans (calls lost due to lack of coverage or available bandwidth), and quality of ser(cid:173)\nvice data (interference, poor coverage). \nBilling. Financial information appearing on a subscriber's bill (monthly fee, addi(cid:173)\ntional charges for roaming and additional minutes beyond monthly prepaid limit). \nCustomer Service. Cans to the customer service department and their resolutions. \nApplication for Service. Information from the initial application for service, includ(cid:173)\ning contract details, rate plan, handset type, and credit report. \nMarket. Details of rate plans offered by carrier and its competitors, recent entry of \ncompetitors into market, advertising campaigns, etc. \nDemographics. Geographic and population data of a given region. \n\nA subset of these information sources were used in the present study. Most notably, we did \nnot utilize market information, because the study was conducted over a fairly short time \ninterval during which the market did not change significantly. More important, the market \nforces were fairly uniform in the various geographic regions from which our subscribers \nwere selected. Also, we were unable to obtain information about the subscriber equipment \n(age and type of handset used). \n\nThe information sources listed above were distributed over three distinct data bases \nmaintained by the carrier. The data bases contained thousands of fields, from which we \nidentified 134 variables associated with each subscriber which we conjectured might be \nlinked to churn. The variables included: subscriber location, credit classification, customer \nclassification (e.g., corporate versus retail), number of active services of various types, \nbeginning and termination dates of various services, avenue through which services were \nactivated, monthly charges and usage, number, dates and nature of customer service calls, \nnumber of cans made, and number of abnormany terminated cans. \n\n2.2 DATA REPRESENTATION \nAs all statisticians and artificial intelligence researchers appreciate, representation is key. \nA significant portion of our effort involved working with domain experts in the wireless \ntelecommunications industry to develop a representation of the data that highlights and \nmakes explicit those features which-in the expert's judgement-were highly related to \nchurn. To evaluate the benefit of carefuny constructing the representation, we performed \n\nTABLE 1. Factors influencing subscriber satisfaction \n\nFactor \n\ncall quality \npricing options \ncorporate capability \ncustomer service \ncredibility I customer communications \nroaming I coverage \nnandset \nolillng \ncost of roaming \n\nImportance Nature of data required for prediction \n21% \n18% \n17% \n17% \n10% \n7% \n4V/o \n3% \n3\"10 \n\nnetwork \nmarket, billing \nmarket, customer service \ncustomer service \nmarket, customer service \nnetwork \napplication \nbilling \nmarKet, billing \n\n\f938 \n\nM C. Mozer. R. Wo/niewicz. D. B. Grimes. E. Johnson and H. Kaushansky \n\nstudies using both naive and a sophisticated representations. \n\nThe naive representation mapped the 134 variables to a vector of 148 elements in a \nstraightforward manner. Numerical variables, such as the length of time a subscriber had \nbeen with the carrier, were translated to an element of the representational vector which \nwas linearly related to the variable value. We imposed lower and upper limits on the vari(cid:173)\nables, so as to suppress irrelevant variation and so as not to mask relevant variation by too \nlarge a dynamic range; vector elements were restricted to lie between --4 and +4 standard \ndeviations of the variable. One-of-n discrete variables, such as credit classification, were \ntranslated into an n-dimensional subvector with one nonzero element. \n\nThe sophisticated representation incorporated the domain knowledge of our experts \nto produce a 73-element vector encoding attributes of the subscriber. This representation \ncollapsed across some of the variables which, in the judgement of the experts, could be \nlumped together (e.g., different types of calls to the customer service department), and \nexpanded on others (e.g., translating the scalar length-of-time-with-carrier to a multidi(cid:173)\nmensional basis-function representation, where the receptive-field centers of the basis \nfunctions were suggested by the domain experts), and performed transformations of other \nvariables (e.g., ratios of two variables, or time-series regression parameters). \n\n3 PREDICTORS \nThe task is to predict the probability of churn from the vector encoding attributes of the \nsubscriber. We compared the churn-prediction performance of two classes of models: logit \nregression and a nonlinear neural network with a single hidden layer and weight decay \n(Bishop, 1995). The neural network model class was parameterized by the number of units \nin the hidden layer and the weight decay coefficient. We originally anticipated that we \nwould require some model selection procedure, but it turned out that the results were \nremarkably insensitive to the choice of the two neural network parameters; weight decay \nup to a point seemed to have little effect, and beyond that point it was harmful, and varying \nthe number of hidden units from 5 to 40 yielded nearly identical performance. We likely \nwere not in a situation where overfitting was an issue, due to the large quantity of data \navailable; hence increasing the model complexity (either by increasing the number of hid(cid:173)\nden units or decreasing weight decay) had little cost. \n\nRather than selecting a single neural network model, we averaged the predictions of \n\nan ensemble of models which varied in the two model parameters. The average was uni(cid:173)\nformly weighted. \n\n4 METHODOLOGY \nWe constructed four predictors by combining each of the two model classes (logit regres(cid:173)\nsion and neural network) with each of the two subscriber representations (naive and \nsophisticated). For each predictor, we performed a ten-fold cross validation study, utilizing \nthe same splits across predictors. In each split of the data, the ratio of churn to no churn \nexamples in the training and validation sets was the same as in the overall data set. \n\nFor the neural net models, the input variables were centered by subtracting the means \nand scaled by dividing by their standard deviation. Input values were restricted to lie in the \nrange [--4, +4]. Networks were trained until they reached a local minimum in error. \n\n5 RESULTS AND DISCUSSION \n\n5.1 CHURN PREDICTION \nFor each of the four predictors, we obtain a predicted probability of churn for each sub(cid:173)\nscriber in the data set by merging the test sets from the ten data splits. Because decision \nmaking ultimately requires a \"churn\" or \"no churn\" prediction. the continuous probability \nmeasure must be thresholded to obtain a discrete predicted outcome. \n\n\fChum Reduction in the Wireless Industry \n\n939 \n\nFor a given threshold, we determine the proportion of churners who are correctly \nidentified as churners (the hit rate), and the proportion of nonchurners who are correctly \nidentified as nonchurners (the rejection rate). Plotting the hit rate against the rejection rate \nfor various thresholds, we obtain an ROC curve (Green & Swets, 1966). In Figure 2, the \ncloser a curve comes to the upper right corner of the graph-lOO% correct prediction of \nchurn and 100% correct prediction of nonchurn-the better is the predictor at discriminat(cid:173)\ning churn from nonchurn. The dotted diagonal line indicates no discriminability: If a pre(cid:173)\ndictor randomly classifies x% of cases as churn, it is expected to obtain a hit rate of x% \nand a rejection rate of (lOO--x)%. \n\nAs the Figure indicates, discriminability is clearly higher for the sophisticated repre(cid:173)\nsentation than for the naive representation. Further, for the sophisticated representation at \nleast, the nonlinear neural net outperforms the logit regression. It appears that the neural \nnet can better exploit nonlinear structure in the sophisticated representation than in the \nnaive representation, perhaps due to the basis-function representation of key variables. \nAlthough the four predictors appear to yield similar curves, they produce large differences \nin estimated cost savings. We describe how we estimate cost savings next. \n\n5.2 DECISION MAKING \nBased on a subscriber's predicted churn probability, we must decide whether to offer the \nsubscriber some incentive to remain with the carrier, which will presumably reduce the \nlikelihood of churn. The incentive will be offered to any subscriber whose churn probabil(cid:173)\nity is above a certain threshold. The threshold will be selected to maximize the expected \ncost savings to the carrier; we will refer to this as the optimal decision-making policy. \n\nThe cost savings will depend not only on the discriminative ability of the predictor, \nbut also on: the cost to the carrier of providing the incentive, denoted Ci (the cost to the \ncarrier may be much lower than the value to the subscriber, e.g., when air time is offered); \nthe time horizon over which the incentive has an effect on the subscriber's behavior; the \nreduction in probability that the subscriber will leave within the time horizon as a result of \nthe incentive, Pi; and the lost-revenue cost that results when a subscriber churns, Ct. \n\n100 \n\n90 \n\nCD \nCii 80 \n.... \nC \n0 :u 70 \nQ) \n\u00b7CD \n.!:;.. 60 \n\"0 \nQ) \n~ 50 \nE \nQ) \n~ 40 \nC .... \n:::J \n~ 30 \nU \nC \n0 \nC \n~ 0 \n\n20 \n\n10 \n\nneural net I sophisticated \nlogit regression lsophisticated \nneural net I naive \nlog~ regression I naive \n\n\"-\n\n\" \n\n\" \n\"-\" , \n, \n\" \n, \n\" \n\\ \" . \\ \n\\ \" \\ \n\\. \\ \n, \n\\, \\ \n'-\n\\ \\ \\ \n\n'-\n\n\\ \n\n\\ \n\n\\ \n-.. \nI \n\\ \n\\ 1 \n\\ 1 \n\\ \\ \n\\I \n\\\\ \n\n0 \n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n70 \n\n80 \n\n90 \n\n100 \n\n% chur~ identified (hit rate) \n\nFIGURE 2. Test-set performance for the four predictors. Each curve shows, for various \nthresholds, the ability of a predictor to correctly identify churn (x axis) and nonchum (y axis). \nThe more bowed a curve, the better able a predictor is at discriminating churn from \nnonchurn. \n\n\f940 \n\nM. C. Mozer; R. Wo[niewicz, D. B. Grimes, E. Johnson and H. Kaushansky \n\nWe assume a time horizon of six months. We also assume that the lost revenue as a \nresult of churn is the average subscriber bill over the time horizon, along with a fixed cost \nof $500 to acquire a replacement subscriber. (This acquisition cost is higher than the typi(cid:173)\ncal cost we stated earlier because subscribers in this data base are high valued, and often \nmust be replaced with multiple low-value subscribers to achieve the same revenue.) To \nestimate cost savings, the parameters Ci' Pi' and C, are combined with four statistics \nobtained from a predictor: \n\nN(pL,aL): \n\nN(pS,aL): \n\nN(pL,aS): \nN(pS,aS): \n\nnumber of subscribers who are predicted to leave (churn) and who actu(cid:173)\nally leave barring intervention \nnumber of subscribers who are predicted to stay (nonchurn) and who \nactually leave barring intervention \nnumber of subscribers who are predicted to leave and who actually stay \nnumber of subscribers who are predicted to stay and who actually stay \n\nGiven these statistics, the net cost to the carrier of performing no intervention is: \n\nnet(no intervention) = [ N(pL,aL) + N(pS,aL) ] C, \n\nThis equation says that whether or not churn is predicted, the subscriber will leave, and the \ncost per subscriber will be C,. The net cost of providing an incentive to all subscribers \nwhom are predicted to churn can also be estimated: \n\nnet(incentive) = [N(pL,aL) + N(pL,aS)] q + [Pi N(pL,aL) + N(pS,aL)] C, \n\nThis equation says that the cost of offering the incentive, Ci' is incurred for all subscribers \nfor who are predicted to churn, but the lost revenue cost will decrease by a fraction Pi for \nthe subscribers who are correctly predicted to churn. The savings to the carrier as a result \nof offering incentives based on the churn predictor is then \n\nsavings per churnable subscriber = \n\n[ net(no intervention) - net(incentive)] / [N(pL,aL) + N(pS,aL)] \n\nThe contour plots in Figure 3 show expected savings per churnable subscriber, for a \nrange of values of q, Pi, and C\" based on the optimal policy and the sophisticated neural(cid:173)\nnet predictor. Each plot assumes a different subscriber retention rate (= I-Pi) given inter(cid:173)\nvention. The \"25% retention rate\" graph supposes that 25% of the churning subscribers \nwho are offered an incentive wiII decide to remain with the carrier over the time horizon of \nsix months. For each plot, the cost of intervention (q) is varied along the x-axis, and the \naverage monthly bill is varied along the y-axis. (The average monthly biII is converted to \nlost revenue, C\" by computing the total biII within the time horizon and adding the sub(cid:173)\nscriber acquisition cost.) The shading of a region in the plot indicates the expected savings \nassuming the specified retention rate is achieved by offering the incentive. The grey-level \nbar to the right of each plot translates the shading into dollar savings per subscriber who \nwill churn barring intervention. Because the cost of the incentive is factored into the sav(cid:173)\nings estimate, the estimate is actually the net return to the carrier. \n\nThe white region in the lower right portion of each graph is the region in which no \ncost savings will be obtained. As the graphs clearly show, if the cost of the incentive \nneeded to achieve a certain retention rate is low and the cost of lost revenue is high, signif(cid:173)\nicant per-subscriber savings can be obtained. \n\nAs one might suspect in examining the plots, what's important for determining per(cid:173)\n\nsubscriber savings is the ratio of the incentive cost to the average monthly bill. The plots \nclearly show that for a wide range of assumptions concerning the average monthly bill, \nincentive cost, and retention rate, a significant cost savings is realized. \n\nThe plots assume that all subscribers identified by the predictor can be contacted and \n\noffered the incentive. If only some fraction F of aII subscribers are contacted, then the esti(cid:173)\nmated savings indicated by the plot should be multiplied by F. \n\nTo pin down a likely scenario, it is reasonable to assume that 50% of subscribers can \nbe contacted, 35% of whom will be retained by offering an incentive that costs the carrier \n\n\fChum Reduction in the Wireless Industry \n\n941 \n\n600 \n\n500 \n\n400 \n\n300 \n\n200 \n\n200 \n\n25% retention rate \n\n35% retention rate \n\n250 \n~2oo \n:0 \n~150 \nC \n0 \nE 100 \n0> \n1; \n\n50 \n\n0 \n\n0> \n> \n \n> \n \n