Inferential Statistics in Data Science

Image
  Experiment  →Uncertain situations, which could have multiple outcomes. A coin toss is an experiment. Outcome  → result of a single trial. So, if "head" lands, the outcome of the coin toss experiment is “Heads” Event  → one or more outcomes from an experiment. “Tails” is one of the possible events for this experiment. Basic Probability Chance of something happening, but in the academic term “likelihood of an event or sequence of events occurring”. for example Tossing a coin Rolling a dice Conditional Probability Probability of an event occurring given that another event has already occurred. for example Picking 3 blue balls from a box has 5 red and 5 blue balls. The probability of picking the first blue ball is 5/10 = 1/2. We’re left with 9 balls in total. So the probability of picking the second blue ball is 4/9. Similarly picking the 3rd blue ball from the box is 3/8. The final probability is 1/2 * 4/9 * 3/8 = 0.08333 or 8.3%. Probability Density function and Prob...

Hypothesis Testing in Data Science


  • Other names are AB testing, Confirmative Analysis, and significance testing.
  • Generally, population parameters (standard deviation, maximum, minimum, and so on) are unknown in real-time.
  • However, we do have hypotheses about what the true values are.
  • Hypothesis testing is a bunch of methods to evaluate the hypothesis about the population parameter based on the available sample parameters.
  • There are 2 terms in the hypothesis, they are null hypothesis and alternate hypothesis.


Null Hypothesis (H0):

  • A general statement about the population parameters which assumed to be true unless strong proof for the opposite statement.
  • The default statement is that there is no difference between the measured phenomenon, there is an association among groups.

Alternate Hypothesis (H1 or Ha):

  • Just the opposite of the null hypothesis.
  • the default statement is that there is a difference between the groups.

Testing types:

T-Test

Also called Student’s T-Test. If the sample size is less than 30 and the population variance is unknown we should use a t-test.

  • One-Sample T-Test
  • Two sample T-Test
  • Paired sample T-Test (Dependant T-Test)

One-Sample T-Test:

 

  • To check whether the mean of the population is significantly different from the sample mean or not.
  • Here the H0 is “there is no difference” and the H1 is “there is a significant difference”.
  • P-value should be less than 0.05 (5%) to accept (failed to reject) the alternate hypothesis.
  • To perform one-sample t-test in python we can use scipy.stats.ttest_1samp()

Note: dataset for the following program will be available in https://drive.google.com/file/d/1SD44pF68zADNS-0OuXCn_J_E2andQUV7/view?usp=sharing

  • By this result, we can come to the conclusion that the sample data is good and there is no difference between the sample mean and population mean.

Two-Sample T-Test:

  • To check if there is any relationship or association between two independent groups or not.
  • Here the H0 is “there is no difference”(association is there) and the H1 is “there is a significant difference” (no association).
  • P-value should be less than 0.05 (5%) to accept (failed to reject) the alternate hypothesis.
  • To perform two sample t-test in python we can use scipy.stats.ttest_ind()

Note: dataset for the following program will be available in https://drive.google.com/file/d/1SD44pF68zADNS-0OuXCn_J_E2andQUV7/view?usp=sharing

  • By this result, we can come to the conclusion that the salesprice of 1st and 2nd floor is not related (no relationship between the tested 2 independent variables).

Paired Sample T-Test:

  • This test is conducted between to dependent variables to check the difference in data.
  • Here the H0 is “there is no difference” and the H1 is “there is a significant difference”.
  • To perform paired sample t-test in python we can use scipy.stats.ttest_rel()


  • by this result, we can tell that this weight loss program is effective without any doubt.

Z-Test

If the sample size is greater than 30 and population variance is known we should use a z-test. And it is the same as the T-test. Data that is chosen for this z test should have data points that are independent of each other.

  • One Sample Z-Test
  • Two sample Z-Test

One Sample Z-Test:


  • Same as one sample T-Test but here we know the population variance and population standard deviation. And the data size should be more that 30.
  • To check whether the mean of the population is significantly different from the sample mean or not.
  • Here the H0 is “there is no difference” and the H1 is “there is a significant difference”.
  • to perform one sample z-test in python we can use statsmodels.stats.weightstats.ztest()


  • By this result, we can come to the conclusion that the sample data is good and there is no difference between the sample mean and population mean.

Two Sample Z-Test:


  • Same as two sample T-Test but here we know the population variance and population standard deviation. And the data size should be more that 30.
  • To check if there is any relationship or association between two independent groups or not.
  • Here the H0 is “there is no difference”(association is there) and the H1 is “there is a significant difference” (no association).
  • P-value should be less than 0.05 (5%) to accept (failed to reject) the alternate hypothesis.
  • To perform two sample z-test in python we can use statsmodels.stats.weightstats.ztest()

  • By this result, we can come to the conclusion that the sales price of the 1st and 2nd floor is not related (no relationship between the tested 2 independent variables).

F-Test

Also called ANOVA test.

  • Two sample t-test or two sample z-test can able to validate the hypothesis containing only 2 groups at a time. But if we want to perform with more than 2 groups, there this ANOVA test will give its hand to do that.
  • ANOVA determines whether the means of 3 or more groups are different or not.
  • It uses F-Test to test the equality of the means.
Note: here the degree of freedom is calculated differently for numerator and the denominator. (df for between groups = no. of groups -1 and df for within groups = Total no. of observations - no. of groups)

One-way ANOVA:

  • This test is conducted when we have only one independent variable and other as dependent variables.
  • This is implemented in scipy by as f_oneway().

Two-way ANOVA:

  • We perform one-way ANOVA between one independent and other dependent variables but in this two-way ANOVA, we will perform between two independent variables and two plus groups.
  • It’s the extension of one-way ANOVA.
  • It will not tell which variable is dominant. if we want to know about that we can perform Post-hoc testing.

Comments

  1. sir,
    What are the advantages of version control?

    ReplyDelete
    Replies
    1. It makes it easier to keep track of your entire project file collection. It enables you to create multiple versions of your project file. and you can access any of your older versions whenever you want.

      Delete
  2. its amazing , thank you for yours effort to let readers to know hypothesis

    ReplyDelete
  3. Nice.... Thanks for your efforts

    ReplyDelete

Post a Comment