Google Refine is another open source project from Google that deserves attention -- quite a lot of it. This is a great tool for cleaning messy data sets and for performing advanced operations on your data, such as transformation from one format to another or data augmentation.
Google Refine is a new tool from Google you will love, if you need to work with data sets. Sure, there are many programs to work with data sets -- even Microsoft Excel or Open Office Calc can do lots of tricks for you and make your data crunching lives easier but if you have a really messy set of data, full of all sorts of inconsistencies, then Excel and Calc will waste you lots of time (it could be days or weeks, if the data set is huge) till you get your data in usable shape.
Of course, Microsoft Excel and Open Office Calc aren't meant to be professional data tools and you can't expect from them miracles in handling your data. But what to do when you have GBs of data and you need to clean it, preferably as soon as possible? Turn to Google Refine, of course!
Freebase Gridworks Grew Into Google Refine
The concept behind Google Refine isn't new. In fact, even (lots of) the code and the implementation aren't new because Google Refine is the reincarnation of Freebase Gridworks. Google bought the open source Freebase Gridworks in July and just a few months later, the first Google-branded version of the software is out for grabs. Google Refine started right as version 2.0 and the source, as well as binary downloads for many platforms, can be obtained from here.
The Sky Is the Limit for Data Operations with Google Refine
For everybody, who has dealt with raw data messes, Google Refine is just a blessing. The screencasts explain marvelously some of the common operations you can perform with Google Refine. Cleaning messy data is just one function you will like; data transformation (from one format into another) and data augmentation are two other groups of features you will appreciate in Google Refine.
Google Refine is definitely a very advanced tool but it is easy to use and even non-techies could use it to perform complex data manipulation tasks. There is Undo for all operations, so you can quickly get brave.
Some of the other nice features in Google Refine 2.0 are that you can use it to extend your current data set with data from external sources, such as Web services, and your data will still be intact after all these operations.
You can also link records from Google Refine to external databases, for instance Freebase. All these nice features make Google Refine a very useful tool for everybody, who has any operations with data sets on the agenda.