Inferential Statistics in Data Science

Image
  Experiment  →Uncertain situations, which could have multiple outcomes. A coin toss is an experiment. Outcome  → result of a single trial. So, if "head" lands, the outcome of the coin toss experiment is “Heads” Event  → one or more outcomes from an experiment. “Tails” is one of the possible events for this experiment. Basic Probability Chance of something happening, but in the academic term “likelihood of an event or sequence of events occurring”. for example Tossing a coin Rolling a dice Conditional Probability Probability of an event occurring given that another event has already occurred. for example Picking 3 blue balls from a box has 5 red and 5 blue balls. The probability of picking the first blue ball is 5/10 = 1/2. We’re left with 9 balls in total. So the probability of picking the second blue ball is 4/9. Similarly picking the 3rd blue ball from the box is 3/8. The final probability is 1/2 * 4/9 * 3/8 = 0.08333 or 8.3%. Probability Density function and Prob...

Descriptive Statistics in Data Science

Measure of Central Tendency

  1. Measure of Spread
  2. Dependence

Measure of Central Tendency:

Mean → Average of a set of data points.

Median → Middle element of data points which are sorted in ascending order.

Mode → A data point that appeared the most number of times out of a set of data points.


Measure of Spread:

Standard Deviation (SD) → Average distance between mean and each data points.

Variance → Measure of how far each value in the data set is from the mean (Square of SD).

Range → Maximum value minus Minimum value from a set of data points.

Percentile → Representation of position of a value in a dataset (dataset should be sorted in ascending).

Quartiles (Q1, Q2, Q3) → Divide a complete data set into 4 Quarters (dataset should be sorted in ascending). Q1, Q2, and Q3 are the 25, 50, and 75 percentile of the dataset. Q2 is the median value of the dataset (fig 1).

fig 1: Quartiles and Percentiles
Interquartile Range (IQR) → Q3 - Q1

fig 2: Boxplot
Outliers → additional observations below MINIMUM and above MAXIMUM are Outliers. There are many ways to mathematically represent or define outliers. One such method is using IQR.

Skewness:

Measure of asymmetry of the probability distribution of the dataset.
  • Positive Skewness
  • Negative Skewness
  • Undefined (No Skewness)
Positive Skewness → Right skewed. Tail on the right side of the distribution longer than the left side. Mean is always greater than the median and the median is always greater than the mode.
Negative Skewness → Left skewed. Tail on the left side of the distribution longer than right side. Mean is always lesser than the median and the median is always lesser than the mode.
No Skewness → Mean, Median and Mode are equal. One side of the curve is like a mirror image of other side.
Note: If skewness is less than -1 or greater than 1, the distribution is highly skewed. If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed. If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.

Kurtosis (k)→ Related to normal distribution, kurtosis is a measure of tail of probability distribution.

  • Mesokurtic → Zero (k=0)
  • Leptokurtic → Thick tail (k>0)
  • Platykurtic → Thin tail (k<0)

Note: Skewness is used to measure the degree of symmetricity whereas kurtosis is used to measure the degree of presence of outliers in distribution.

Dependence:

Correlation (+1.0 t0 -1.0) → Any statistical relationship between two random variables or bivariate data.

  • Positive correlation
  • Negative correlation
  • No correlation

Positive Correlation (0 to 1) → If one variable gets larger other variable also gets larger. A correlation coefficient of 1 means that for every positive increase of 1 in one variable, there is a positive increase of 1 in the other

Negative Correlation (-1 to 0) → If one variable gets larger other variable gets smaller. A correlation coefficient of -1 means that for every positive increase of 1 in one variable, there is a negative decrease of 1 in the other.

No Correlation (0) → One variable’s change will not affect other variable. No relation ship between 2 variables. Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related.

Thank you
Ram Thiagu

Linkedin https://www.linkedin.com/in/ram-thiagu/ 

Comments

Post a Comment

Popular posts from this blog

Hypothesis Testing in Data Science