Exploratory data analysis (EDA) is used to analyze and investigate data sets and summarize their main characteristics numerically as well as visually.

The primary aim of exploratory data analysis is to examine the data for distribution, outliers and anomalies to direct specific testing of your hypothesis. It helps look at the data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data and find interesting relations among the variables.

10 Quick Steps in Data Exploration and Preprocessing

  1. Identification of variables and data types: This step helps in understanding whether the variable is numeric or categorical. Within Numeric we can check if its Discrete (results from counting for example – number of employees in a line of business) or Continuous (the number which can take any value such as daily expenses of a household). Within Categorical we can check if the variable is Ordinal (if the categories can be ordered logically for example – rating from customer on product satisfaction) or Nominal (this has levels that cannot be ordered such as Gender-Male, Female).
  • Analyzing the basic metrics: This includes understanding the data and what are the possible measurement strategies that could be employed to measure those variables.
  • Non-Graphical Univariate Analysis: This process numerically explores the data by looking at summary statistics of each variable. Summary statistics provide various measures such as minimum, maximum, mean, 25 percentile score, 50th percentile score, 75th percentile score, count etc.
  • Graphical Univariate Analysis: This includes detail study of each variable used in the analysis. The variables are graphically explored using histogram, box plot etc. to understand the data distribution.
  • Bivariate Analysis: This includes taking 2 variables at a time and assessing their correlation. It also includes mapping of each independent variable against dependent variable to see if it can influence the dependent variable significantly.
  • Variable transformations: Transformation is a mathematical operation that changes the measurement scale of a variable. This is usually done to make dataset useable with a particular statistical test or method. Many statistical methods require data that follow a particular kind of distribution, usually a normal distribution.
  • Missing value treatment: Missing data in your data set can reduce the power / fit of a model or can lead to a biased model. It can lead to wrong prediction or classification. Missing values can be imputed using various methods or algorithms. Basic imputation includes replacing missing values with mean, median or mode depending upon the data type and distribution.
  • Outlier treatment: Outliers are extremely low/high values in your data set. It usually is calculated using box plot and values outside the range of < Q1-1.5*IQR or >Q3+ 1.5*IQR are considered as outliers commonly. Having said this, outliers are very sensitive and must be carefully excluded, included or imputed. This becomes an easy task if you have a strong domain knowledge and know the metrics used for analysis.
  • Correlation Analysis: Correlation analysis is used to quantify the degree to which two variables are related. Through the correlation analysis, you evaluate correlation coefficient that tells you how much one variable change when the other one does. Correlation analysis provides you with a linear relationship between two variables.
  • Dimensionality Reduction: Dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.There are many ways to reduce the dimension. One most popular way is by application of Principal Component Analysis (PCA).

 

Leave a Comment

What Our Student Say

Data Science Course Feedback

Karuna is an extremely knowledgeable teacher who cares about her students and puts a ton of preparation into her courses and materials. Karuna's approach to teaching data science is logical, well-structured and accessible. I highly recommend undertaking classes from her.

Abhijit Chowdhury

Course Feedback

Attended Karuna's online sessions and was very happy with her way of teaching specially for those doubt clearing sessions and her availability when ever required. I would highly recommend her for end to end data analytics knowledge and excellent teaching skills.

Debdutt Pandey

Capstone Project – Facebook comments

It was a great experience understanding the whole architecture of the project starting from the very beginning to the end of the project and the assistance provided along the way. It made the journey so easy that I couldn't believe I was able to complete the project within time and with great results. I would definitely recommend these sessions to anyone interested in data science field.

Deepak Goel
Data Patrons

Advance Statistics, Predictive Modelling, Data Mining, Time Series

Hi everyone, I would like to share my learning experience and outcome on the courses i had taken from Data Patrons led by Karuna. To start with Karuna, she is simply the best Mentor I had come across. She has a strong teaching experience in the field of Data Science. Karuna is very approachable and always available to clarify any doubts beyond your mentor sessions.  I am very fortunate to get trained by Karuna on the following Data Science topics: Advance Statistics, Predictive Modelling, Data Mining, Time Series. The training content is really good and it deals with real world data science problems. I had taken few courses outside, but this one is well structured and very relative to the work I do. I have gained in-depth knowledge, which helped me to transition my career from BI to Data Science with 2x Salary growth.

Dinesh Kumar Ravichandran
Data Patrons

Intro to Python for Data Science, Basic & advanced analytics, Data Mining, Predictive Modelling, Time Series Analysis, Machine Learning, Tableau

Karuna ma'am is a great teacher. She teaches any complex concept in a way that's easily understandable for us. She tries to keep the classes really engaging and interesting. I developed a deep interest in data science after attending her classes. She has also helped me in progressing in my career by giving very helpful tips and advice. She's an excellent guide and mentor. I hope her knowledge and experience help many more in the future and wish her all the very best!    

Ravisekhar R