5 statistical concepts you need to know before your next data science interview

0 0 6 minutes read

5 statistical concepts you need to know before your next data science interview

In my own data science job search journey, I was lucky enough to have many company interview opportunities.

When meeting real people, these interviews were a combination of technology and behavior, and I also got my own series of assessment tasks.

Through this process, I have conducted extensive research on what questions are usually raised in data science interviews. These are concepts you should not only be familiar with, but also know how to interpret them.

1. p value

Image of the author

When running statistical tests, you will usually have null hypothesis H0 and alternative hypothesis H1.

Suppose you are conducting an experiment to determine the effectiveness of certain weight loss medications. Group A took placebo and group B took the medication. You then calculate the average pounds lost in each group over six months and want to see if the weight loss in Group B is statistically significantly higher than Group A. In this case, the null hypothesis that H0 would be the statistical difference in mean LBS loss of H0 between the two groups was not statistically significant, meaning that the drug had no actual effect on weight loss. H1 is very different, and Group B loses more weight due to the medication.

review:

H0: Average Pound Loss Group A = Average Pound Loss Group B
H1: Average Pound Loss Group A

You will then do a t-test to compare the methods that get the p-value. This can be done in Python or other statistics software. However, before you get the p-value, you will first choose the alpha(α) value (aka significance level) of comparing P to P.

The typical α value selected is 0.05, which means that the probability of type I error (say there is a difference in the mean when there is no) is 0.05 or 5%.

If your P value is Alpha, you will not be able to reject your empty assumption.

2. z-score (and other outlier detection methods)

The z-score is a measure of how far a data point is from the mean and is one of the most common outlier detection methods.

To understand the Z score, you need to understand basic statistical concepts, such as:

Meaning – Average of a set of values
Standard deviation – A propagation measure between values in the dataset related to the mean (also the square root of the variance). In other words, it shows how far apart the values in the dataset are from the mean.

The z-score value of a given data point is 2, indicating that the value is higher than the mean by 2 standard deviations. A z-score of -1.5 indicates that the value is below the mean 1.5 standard deviation.

Usually, z score > 3 or data points

Outliers are a common problem in data science, so it is important to know how to identify them and deal with them.

To learn more about some other simple outlier detection methods, check out my article on z-score, IQR, and modified Z-score:

3. Linear regression

Linear regression is one of the most basic ML and statistical models, and understanding this is essential to succeed in any data science role.

At a high level, linear regression is intended to simulate the relationship between independent variables and dependent variables and attempt to use independent variables to predict the value of the dependent variable. It does this by fitting a “best fit line” to the dataset, which minimizes the sum of the squared difference between the actual and predicted values.

An example is an attempt to model the relationship between temperature and electricity. When measuring the electricity consumption of a building, it usually causes temperature to affect usage because since electricity is often used for cooling, as the temperature rises, the building will use more energy to cool its space.

Therefore, we can use regression models to model this relationship, where the independent variable is temperature and the dependent variable is consumption (because usage depends on temperature and vice versa).

Linear regression will output the equation of y = mx+b in format, where m is the slope of the line and b is the y intercept. To predict Y, you insert the X value into the equation.

Regression has 4 different assumptions about the underlying data, which can be remembered by the acronym line:

L: Linear relationship Between the independent variable x and the dependent variable y.

I: Independence of residues. The residues do not affect each other. (The residual is the difference between the line and the value predicted by the actual value).

N: Normal distribution Residue. The residuals follow a normal distribution.

E: Equal differencesResidues across different x values.

In terms of linear regression, the most common performance metric is R², which tells you that the proportion of variance in the dependent variable can be explained by independent variables. A R² of 1 indicates a perfect linear relationship, while a R² of 0 indicates that the dataset has no predictive power. Good R² tends to be 0.75 or higher, but it also depends on the type of problem you are trying to solve.

Linear regression is different from correlation. Relevance Between two variables, you can have a numeric value between -1 and 1, which tells you the strength and direction of the relationship between the two variables. return Give you an equation that predicts future values based on the best fit line of past values.

4. Central limit theorem

The Central Limit Theorem (CLT) is a basic concept in statistics, which states that regardless of the original distribution of the data, the distribution of the sample mean will become larger as the sample size becomes larger, and as the sample size becomes larger, the sample mean will be close to the normal distribution.

A normal distribution (also known as a bell curve) is a statistical distribution with the mean value of 0 and the standard deviation of 1.

CLT is based on these assumptions:

Data is independent
The level of data populations is limited
The sampling is random

Sample size ≥30 is usually considered as the minimum acceptable value for CLT to maintain true. However, as the sample size increases, the distribution looks more and more like a bell curve.

CLT allows statisticians to use normal distributions to infer population parameters even if the basic population is abnormally distributed. It forms the basis of many statistical methods, including confidence intervals and hypothesis tests.

5. Overfitting and inadequate

When the model insufficient,It does not correctly capture patterns in training data. Therefore, it not only performs poorly on training datasets, but also on invisible data.

How to know if the model is lowered:

This model has high errors on trains, cross-validation and test sets

When the model overwhich means it has learned training data too carefully. Essentially, it has remembered the training data and is very good at predicting it, but when predicting new values, it cannot see the data in a general way.

How to know if the model is overfitted:

The model has low errors throughout the train set, but there are high errors in the test and cross-validation sets

also:

Inadequate models have high bias.

Overfitting models have great differences.

Finding a good balance between the two is called Bias change tradeoffs.

in conclusion

This is by no means a comprehensive list. Other important topics for review include:

Decision tree
Type I and Type II Errors
Chaos Matrix
Regression and classification
Random Forest
Train/test split
Cross-validation
ML Lifecycle

Here are some of my other posts covering many basic ML and statistical concepts:

It is normal to feel overwhelmed when looking back on these concepts, especially if you haven’t seen many concepts since your school’s data science course. But it’s more important to make sure you’re up to date on what’s most relevant to your own experience (e.g., if it’s your major, the basics of time series modeling) and just have a basic understanding of these other concepts.

Also, remember that the best way to explain these concepts in an interview is to use an example and when you are talking in your case, please make the relevant definitions from the visitor. This will also help you remember everything better.

Thank you for reading

Contact me on LinkedIn
Buy me coffee to support my work!
I now provide 1:1 data science tutoring, career coaching/guidance, writing advice, resume reviews and more about Topmate!