Â
This essentially translates to about 1054 features so now I'm talking analytics all right which should interest you guys I'm gone from marketing to analytics right now so what are this translate to solve an analytics problem now I have thousand and fifty-four features about all these guys and I need to figure out what are these features that drive different conversion behaviors okay and the features I'm talking about the demographics, for example, whatever age region they belong to you sex ethnic ethnicity and tons of another information right the credit information could be the fact that they've taken the mod gauge they probably moved a house and there are six months what is their repayment ratio author information could be the fact that they own X number of cars or the fact that they own some specific type of a car in the market or then they're in the market looking to buy a new car exit.
I said alright these are all relevant information when you're trying to predict whether he will become an auto insurance customer for me now our problem though is basically we need to figure out the ring label in terms of a condition probability does it look like a condition already problem right so essentially it boils down to a condition probability which is the base theorem but how do I approach this problem is you would first decide a performance window let's say in the last twelve months off.
Let's say I already have close to autumn I 228 million records in the last one year let's say about 600,000 have life and they bought the life in this last 12 month time period and at the start of the twelve-month time period they were already holding an auto so the the Dom cross-sell obviously you guys understand once across cellars right so what we do is we trying to cross-sell life to an existing auto policy consume on so hence which is why we put all these filtering conditions among the existing to 28 million you won't marry that or you know prior transaction history which we have which talks about with all-state as a company how many folks have actually bought how many folks are already born policy at the start of the auto policy at the start of the 12 years and the last 12 months have also bought a life so that becomes my outcome or target population on whom I want to build my look-alike profiles for because the rest of the guys actually would have to look a lot look similar to this right.
So somebody who's bought an auto before twelve months and bought life in the last one yeah how does he look like in terms of histogram X his credit history his auto market affinity that that is the problem they returned solved and this obviously becomes a classification problem because what you try to solve here is you try to model for the event whether he will buy or not buy right so it becomes the one is a binary outcome classification we build classification models and you can obviously start with Logitech or dick have you guys finished your class various methodologies to how to solve classification problem discriminate analysis that would mostly be for a feature reduction you can probably use that because you have thousand fifty four features and you want to reduce that features to a useable sizable number you can apply discriminant but logistic make sense right so essentially any three it would make sense now I think what we've done is we've gone beyond trees.
We've looked at higher form trees in terms of bagging and boosting which is random forests which you must avoid you must avoid GBM is a gradient boosting then you there is GBM on steroids which are called X G boost and then there is SVM and the highest form was neural networks right we've not applied neural network but I think it's a computationally very intensive problem so we've not applied neural but I think we applied pretty much everything else to build our models so what we're doing today is we're building C at this point my modeling team and the US business team do not focus so much on trying we don't want an intuitive solution like a tree will give you a set of rules right a logistics would give you the weight edge of the coefficients and the direction in which the coefficients can influence you're final our top right to say.
Let's say mortgage pops up as one of the significant factors when you build your blood logistic and you have our beta j-- a very coefficient of whatever positive coefficient and you know that the or at the higher the mortgage the more the chances that he will buy insurance that how you will interpret it right now but we don't care about that all we care about is what's your so there are different model fit parameters right and I think that for any classification problem you obviously look at and you will build out your countries in matrices you understand confusion matrix right and you would look at their the PA you see the area under the curve IDRC go so ROC is like the way you see is the very generalized form of how good a model predicts.
You know the binary outcome is and what we do as well from a business point of view we translate that so the outcome of this model is basically a probability score great which says this is the probability that he will convert or not convert now so you would assign so if we everybody the entire 228 so this is probably built on a smaller you know subset sample subset, not the entire six and a million six 1000 you can probably take 100,000 and build out the model you would do your training missus testing to make sure you see parameters of training mister test are very similar we're not all fitting and all that right once we do that I want to translate this in my entire population to see if I build out essentially apply the model on my entire I won't build it on my entire to donate.
I will build out on among the two tonight who's had an auto policy because that's your cross cell filter criteria to say that among 228 let's say 100 million already have Auto now to the this is my target population because I want across the life to so on 100 million I get I run my model again and essentially I apply the model and I will get probability scores assigned to each of those 100 million records everybody gets a problem score right and next what we do is we build out a dice I'll plot which is nothing but we rank order these guys is based on their you know probability scores.
Then we say I divide them into groups of 10 because that's what I Silas and then you say from so typically it'll look like this so the random is nothing but it's the existing probability that of the hundred million how many in the past have had life and let's say that's about three percent right and but my top decile second the cell and third the cell all of it is about three percent which is giving you were champs and conversion six percent and so this three are my best three deciles that I want to target from the operationalizing point of view so that's how the business makes use of the output or model so the so intense in terms of operationalizing this is how things would happen so we would decide if this is my top three segments now the model obviously so when you build a GBM right.
Post a Comment
Please do not enter ant spam link in the comment box