Know Thy Data: Some Key Steps Before the Data Dive

Posted by HR Analytics on Wednesday, January 16, 2013 Under: Data
A critical but rarely recognized step in the data analysis is the step before the actual analysis: the preparation of the dataset.  It may be an inherent part of the data analysis but it is often overlooked, understated, and underestimated.  We often hear about the effort required for the data analysis and we conjure up in our head rigorous statistical analysis but the step before the actual analysis of carefully and thoroughly reviewing the data set can be easily missed, overlooked, and inadvertently skipped.  Before you begin the data analysis, it is good practice to ensure that you devote some time getting to know your data.  You can actually end up spending half of the data analysis preparing the data for the analysis.  

The data preparation step consists of several chores.  First, you have to familiarize yourself with the dataset’s overall structure and contents.  This may involve knowing what tool you are going to use to analyze the data.  Is it an Excel, CSV, text, SPSS, or SAS file?  Next, do an inventory of the data set and assess the quality of the data, identifying what variables are available for sound analysis.  Familiarizing yourself with the dataset can help you in your report planning and in the development of your hypothesis and research questions as well.  As you dive into the data set, you might already have in mind specific questions or hypotheses and that you are just seeking to pluck the answers out of the data.  But your probing might actually be multifaceted because as you examine the variables, you might actually also develop interesting new queries along the way.  Exploration of your data set can prompt you to develop other inquiries and hence incite you to discover and investigate other relevant and important questions you might have not thought of.  

Each dataset is unique so consider looking at your data and doing some data profiling.  Data profiling is a very important step that is also underestimated, understated, and overlooked.  Data profiling is essentially the method of making an assessment of the dataset.  It’s a helpful step especially for analysts trying to estimate the level of commitment they need to make so that they can provide a better and more realistic time of delivery to their boss.  

Data profiling usually involves the following steps:

    • Explore the data set and gather information about the data and this could be as simple as eyeballing the dataset
    • Check the quality of the data (do you have a lot of missing responses, skipped items, etc)
    • Run some basic statistics to describe the data (some descriptive stats)
    • Inventory the variables
      • See what would make sense to analyze or what to include/exclude in the analysis
      • See what format the data entries are in (numeric or text characters)
    • Consider the response rate and consider that against the population size
    • Eyeball the overall structure of the data
      • What are in the columns; What are in the rows
      • Would it be helpful to compute the totals or the averages for the columns or rows 
    • Check the distribution of the data  
      • In Excel, you can use the filter option or pivot tables; if you have SAS, you can do some quick and easy procedures such as Proc Freq, Proc Means, Proc Plot, etc.
      • Examine the variability in demographics of the participants; break it down by various meaningful demographic variables
        • For example, do you have more males than females (these might be important factors when it comes to the analysis)
    • Assess how many missing values and what the impact of that could be.  There is a lot more to this and this merits its own separate discussion so I’ll save that for another blog.  

These are just some things to be aware of before diving into the data analysis.  These steps are underemphasized because they are often common sense.  But these efforts are absolutely critical and should be appropriated due diligence.  Initial checks will allow you to discover a possible need to do some further probing or some more advanced statistical analysis to make sure the dataset you are working with is clean and reliable.  And that is important because the rest of your analytical efforts is relying on these initial steps. So take the time and the effort to know thy data before diving into it.


This screenshot below is a quick example of data profiling a small dataset in SAS. You'll also see how visualization can be of great value (and really how easy it is to generate). Here, plotting allows for identifying the range, the distribution, and differences in the cholesterol numbers.  Click here to enlarge. 
 


In the comments section below, please feel free to share how much time and effort you devote to prepping the data before the actual analysis? And what do these efforts entail?

In : Data 


Tags: data profiling 
comments powered by Disqus

Copyright © 2012 Analytics-HR. All Rights Reserved   |   Privacy Policy  |  Contact    
Make a Free Website with Yola.