Exploratory Data Analysis

“If you know the enemy and know yourself, you need not fear the result of a hundred battles.” is popular quote from Sun Tzu, the author of Art of War. Even though the quote doesn’t sound relevant to everyday life these days, it still can be used for almost everything. Especially for building regression models, it needs to be fully applied. The meaning of knowing yourself and the enemy is that if people have all the tools and knowledge about self and the goal, then they will achieve the goal. Exploratory Data Analysis is first step to knowing about enemy/target and self. Regression is an algorithm to come up with optimized line to fit the observed data. Since it is all based on the data what we inputted through the machine to come up with the line of the best fit, the data determines the quality of the regression model. It is one of the most important process for building the regression model. This article will discuss about the process of Exploratory Data Analysis (EDA).

The definition of EDA is an approach/philosophy for data analysis that employs a variety technique to maximize insights into a data set, uncover underlying structure, extract important variables, detect outliers and anomalies, test underlying assumptions, develop parsimonious model, and determine optimal factor settings, according to Engineering Statistics Handbook. It’s one previous step to prepare the data analysis. In summary, first step is to do cleaning data, second is defining relationships between features and variables, third step is coming up with the visualization.

For the Module 1 project, the ultimate goal is to coming up with the multiple linear regression model. Since it will be “Linear” model, the EDA should be focused on proofing linearity assumptions, and finding irregular data among the features. There could be null values, Nan, special case strings or etc. It’s very important to find this during EDA process so that the crunching number part won’t be headache later on. It will be impossible to come up without the model without well-cleaned data.

First step.

In order to achieve those things that are listed previously, what does data scientist have as tools to deal with the data frame? They can start from the loading the data in to the Data Frame using pandas library.

df = read.csv()

By using command above, it will show how many rows and column, name of the features, and data type. We can get feeling for quantitative information about the raw data. How about null, or Nan or messed up data points?

df.isna().sum()

That command will show anything with Nan values, missing values, or weird string values within the data set. It will also display the number of those N/A cases. Once those null or weird values were found, it also needs to be replaced in order to run the analysis or put through the model. If not, code won’t work and it will pit out the error. How about duplicated data?

df[df.duplicated(subset=’id’)]

Above line will take care of that problem. It will find the total number of the duplicated items, and will display the numbers of them. I personally found 177 duplicated data surprisingly. Duplicated was not expected at all from the beginning.

Step two.

Step to is defining and undermining the relationship between the variables or features. Relationship between Single or multiple regression is very important because the model could not mean anything after all the effort of examining and cleaning data. Correlation and multicollinearity will make the model invalid. It basically counts same feature twice and makes it very skewed towards certain features or variables. So, how do we detect these types of problems? We could simply run the model for OLS to check for p-values for the features. Normally, p-values greater than 0.5 means statistically insignificant. So, those features with the high p-values could be dropped to make a better model. The coding for the running OLS model is listed below.

The results are like the above. P-values are displayed under P>ltl. There’s no p-value greater than the 0.05 from the list.

Step three.

Step three is visualization. Many different types of charts, or graphs can be used with the built-in python libraries, or it can be loaded up. Such as matplotlib. There’re many different ways to check, but generating heat map is the easy way to do.

The lighter area is the one with the high correlation between the features. One thing to pay attention is that normalizing and scaling the data might change the correlation between the features. It is better to get rid of the features after the standardization, scaling and normalization. Scatter plots or histograms also can be used to check to get to take a look at the data as well. Distribution and linearity can be verified through the graphs.

After all the above three steps, the raw data was turned into cleaned data. It is all ready to be crunched to come up with models.