Data Science is an emerging field that does not belong to a particular subject or study. Instead, data science requires knowledge of different fields of science and research like mathematics, statistics, programming, etc.
Statistics play an essential role in data science as most of the analysis in data science is done with statistical abilities. In today’s article, we are going to talk about the requirement of statistics in the data science field. We will discuss 5 major topics of statistics that a data scientist must acquire.
Following are the topics an aspiring data scientist must know:
- Data Properties
- Conditional Probability
- Multivariate Analysis
- Univariate Analysis
Let’s discuss them one by one briefly:
Data is nothing but a set of values that can give us information. Whilst dealing with data we focus on two properties of data:
a. Central Tendencies: Central Tendency is described as a number or point that can represent the center or middle set of given data. It can be represented in terms of mean, median, and mode.
b. Dispersion: Dispersion is described as the extent/ degree to which the given data is distributed around the central tendency chosen. It can be represented in terms of range, variance, and standard deviation.
You must remember the Bayes’ theorem from your school days that worked on conditional probability. Conditional probability tells us the relation between the probabilities of two events occurring. As you know the field of data science is diverse with predictions being an important part of it.
It is pretty simple as it involves only one dimension and thus only one variable. It is usually done to summarize the results that you have gotten. Univariate analysis is usually represented as frequency Distribution Tables, bar charts, histograms, pie charts.
Suppose you are searching for precautions regarding a disease, for say, take cold. Then now you have a count of people who take medicines, home remedies, drink special herbs, wear woolens, etc. anything against cold. The dimension here is the same “cold” count of precautions is different.
When there are more than two dimensions involved in the study, we can call it multivariate analysis. It has a wider scope and is majorly measured on three factors namely:
In the initial stages, one might get confused in covariance and correlation but they have a significant difference.
Taking the example in previous points, suppose we had more than two diseases, you got your multivariate analysis.
MECE [Mutually Exclusive Collectively Exhaustive]
Mutually Exclusive means when only one of the concerned events happens and collectively exhaustive means that it is a guarantee that at least one of the concerned events will happen.
Don’t get confused! See this example: Suppose you have dice. Now, what are the possible outcomes? 1, 2, 3, 4, 5, and 6. Only one of these outcomes can happen so not more than one side can be shown at a time. Hence they are mutually exclusive. Now, aside from these 6 numbers, is there a probability of another number taking place as an outcome? Or is there a probability that once you roll the dice, no outcome shall come? NO! One of these possible outcomes will happen. Hence they are completely exhaustive.
To conclude, all these 5 topics have different scopes and uses in data science. As an aspiring data scientist, one must know all these topics to form a basic understanding of how to perform an analysis of the given data.