When we are going to build models from a population by taking a sample we are not going to take many trials, okay but to prove the theory is a sufficiently large number of trials that large can be a hundred or five hundred or thousand or you can take it ten thousand also okay. In the end, our objective is to prove this theory okay from a practical standpoint number of trials is not important for us why several trials are not important for us from a practical standpoint because we are going to take one sample and build the model from a practical standpoint what is the important question for us in that sample how many observations should be there how close or how close it should be to the population estimate.
I'm depending upon how close it should be to the population estimate my n will be determined right okay central limit theorem I don't think so I have put it in a Python notebook so I'll use spider and I will ask you, people, also to use a spider, okay and if you are not comfortable with spider what you can do is copy-paste in python notebook now just open the file I'll wait for you to open the file and I would like all of you to execute the code believe me this thing was taught to us during school and college days but unfortunately, at that time our professors did not have computers as such, they could not simulate this because we have the power of computers today what Linda Berg Levy discovered long back probably 50 60 years back or more I don't know what he discovered using whatever sophistication of technology was available it's a huge admiration and they could do without the power of computers okay.
Today what we are going to do is simulate and find it out so I would request all of you to run it and believe me yay this would not be taught with that much clarity anywhere okay all of you have the file open yeah so first thing we requires three packages I'm just going to import those packages my data file is in E Drive there is an income expense data file which I have already shared so having set my working directory I'm going to import that data file I am NOT going to run the head command because it does not show data nicely so I'm just going to double click and open the data which seen all of you I hope you have been able to import the file so what I have given is a sample data from survey ok 50 observations of household survey where we asked the household what is their monthly household income what is their monthly household expense how many family members are there are they staying on a rent or EMI are they paying what is the highest qualified member in the family how many earning members are there in the family these are some basic details details that we have captured ok.
Number of observations are only 50 so from 249 which means 50 because index starts with zero so we can keep moving down so all of these things are not required where we'll go somewhere I would have put a comment central limit theorem somewhere around line number 120 I have put a comment okay the other way is I think that would be much better all of you put the laptop flaps down take a first I'll run the code so your focus will be on the screen and then you all ensure you run the code because it will validate with 50 30 people here also that when you run on your laptops and each of you say ok I got the same number it will be a further validation right later we go out of focus ah da da more sample data set I am creating an empty data frame okay now here I have got a loop what is this loop doing for I in a range of 1 to 1000 one between thousand times this loop is going to execute ok how the thousand one will be excluded I am taking a random sample of size 30 so thousand iterations I will run each iteration I will take thirty random observations from population of 50 observations.
Then I'll keep at appending them one below other which means after the loop is executed I'll have thousand into 30 that is 30 thousand observations yes so I'm running this no yes sir so that attrition replace was set to every time you see I want only 50 of the variation but usually sampling for the sample is second sample is independent of the first third is independent of the second start again like that sample can have some 50 from 50 I am taking a sample of 30 then again from 50 I am taking a sample of 30 then again 50 I take a sample of 30 so 30 30 30 I have taken thousand times so how many samples have got thousand each sample has got how many 30 so thousand into thirty thirty thousand so I got a dataset heavy thirty thousand records okay so I've got that dataset ready so this is what my dataset looks like see here is the sample number okay see sample number then sample number two sample number three force and each sample as for 30 observation the sample number is to indicate they are part the observations are part of which sample in this kind of repeated value.
Now I am running a group by at a sample number level I am grouping five numeric columns I took only the numeric columns because the non numeric like highest qualification cannot be done for that you have to go into proportions so this part of the code is going to do the group by done so how many records here I will have thousand observations now for each of the numeric variable at a sample number level I've got the mean of the income mean of the experience mean of the family members mean of me basically and what mean of say I still I would mean I got me not mean of me because there are adult sample level now I have done mean of me I'll explain what happened this was my data set at four at sample level I what mean then I'm taking all these thousand organization again computing the mean so that becomes my mean of me for expense I again computed mean of mean right so I'll have only five variables right why why I'll have only five because these are only one two three four and five numeric columns which I took right I'm just resetting the index and giving sample variable s mean name.
So that it does it looks slightly better to let's so whatever now at a population level for the entire population I am computing the mean now 50 observation so directly I'm computing the mean for the entire data and I'm resetting index and then I'm giving the names as P where variable and P mean okay so these three lines of code I execute so these are population variable names and at a population variable level I've got a mean I am now concatenating the sample mean data set this data set I am concatenating with the population mean data set having concatenated I've got sample mean.
I am dividing it by population mean this number should come to what one they should come to one why the ideal case because what we are saying is mean of mean is equal to the population mean so this should come to one okay so I got a new data set SP means sample population means and I will remove all of this fellow so here you see this is sample variable name this is the population variable name this is this number and ratio the objective of keeping both of columns is to ensure that I've merged the columns properly right.
إرسال تعليق
Please do not enter ant spam link in the comment box