Clean, clean, clean before you crunch those big data sets

Anyone interested in how data science might transform education should read The Dirty Little Secret of Big Data Projects.  David Dietrich, an impressive data geek consultant at EMC’s education unit who’s been involved with a big data lab at MIT, wrote that 80% of your time on a data project will be spent on the tedious, unsexy task of cleaning up the data. Often, people are so excited to start crunching their data that they end up with wrong answers because they haven’t cleaned up and prepared their data properly.

If you want to try some data cleaning at home, Dietrich suggests that unsophisticated types (such as myself) should tinker around with these two tools: Open Refine (formerly Google Refine) and Data Wrangler (from Stanford).


POSTED BY Jill Barshay ON April 22, 2013

Your email is never published nor shared.

You must be logged in to post a comment.