20. Tips¶
This chapter contains some general tips for performing the data analysis. These are based on my own learnings, and will evolve over time.
20.1. Exploratory data analysis¶
- Understand all the variables in their data set.
- Separate out factor variables and numerical variables.
- Distinguish between response variables and independent variables.
- Look at the histogram of numerical variables.
- If the histogram doesn’t look normal, see if some data transformation can make the histogram look so.
- Look at the Pearson correlations between numerical variables. Categorize them between very weak, weak, moderate, strong, very strong correlations.
- Compute Spearman correlation between a factor variable with other numerical/factor variables.
- Compute factor-wise box plots of numerical variables. Examine them to see if the box plots for different levels of a factor are significantly different.
Handling NA data
- Make sure that you look at raw data and identify the patterns used for entering NA values. It can be NA, na, blank space, *, etc.
- Count the number of NA entries in each column of data set.
- Identify variables with very high NA percentage. Consider if you should totally eliminate the variable from further data analysis.
- If there are very few NA entries, one approach can be to eliminate the corresponding rows.
- One way of filling NA values is by computing median / mean of the corresponding variable and using that value in all NA slots for that variable.
- Alternatively, one can use the non-NA entries in the variable and fit a linear / non-linear model for that variable from other variables which have good quality data. Then, one can use this model to predict the NA entries.
- Make sure that your data-set is cleaned of NA values before serious modeling is done.