Blog
During exploratory data analysis (EDA), one would often check the descriptive statistics of the data, handle missing values before modeling the data. Descriptive statistics are statistics that summarize the characteristics of a data often with a single value such as mean, standard deviation, variance, etc. It is very important that descriptive statistics be evaluated during EDA and before modelling.
Descriptive statistics are however not enough in describing data and in deciding the type of machine learning model to use in modelling the data.
Ascombe’s Quartet is a set of four data each consisting of two variables x and y, and 11 observations. This dataset created by Francis Anscombe in 1973 is very popular because it shows the importance of visualizing and graphing data before analysis and building machine learning models. The four data each has approximately the same simple statistics but they surprisingly look different when plotted. Let’s examine these data using python so as to see the importance of data visualization.
First, we will import the python libraries we need for this analysis, and then scrap the Ascombe’s quartet data from Wikipedia.
We will then have a view of the four data. They are as shown below:
Next, we will check the datatype of the data before going further with our analysis.
Since all the columns are object datatype, and the first row of the data contains string of ‘x’ and ‘y’, we will drop the first row and then convert the datatype to float. The implementation and output are as shown below:
For ease of analysis, we will split the data into two datasets, X and y where X will contain each x variable as a column, and y will contain each y variable as a column. Data X and y are as shown below:
The descriptive statistics considered when work with the Asombe’s quartet are:
We implemented these statistics in python and they are shown below:
From the descriptive statistics, we see that they are all the same.
However, when we visualize the data, we see a different story.
It turns out that though these data have the same descriptive statistics, they are actually distributed differently and have different graphical patterns.
Fitting models on data without visualizing can lead to heavy bias as we can see in the plots above. For example, x2 and y2 could better be modeled with a polynomial curve.
Descriptive statistics are very important but cannot in themselves tell the whole story about data. Data visualizing is very important and should be done before modelling. It may give us a clue on the kind of model to use in modelling our data so we wouldn’t build a biased model, it may also reveal outliers or influential points in our data.
See Also: Hypothesis Testing, Linear Regression Simplified, Logistic Regression Explained, Regression Analysis: Interpreting Stata Output, Understanding the Confusion Matrix
What an amazing post! I always look forward to reading your posts. They are so engrossing and well-written. The way you consider your readers' preferences and leave a hint of personally relatable experiences is simply exceptional. There are always a few lines in your posts that stick with me and keep me hooked on your writing. Reading one of your blogs immediately makes visitors want to read another and share it. This post was wonderful, and please continue to share such blogs in the future. I believe that my expertise in the field in which you write will be very beneficial to you. You can get a sense of my writing style by browsing through my high-quality content
Your one-stop website for academic resources, tutoring, writing, editing, study abroad application, cv writing & proofreading needs.