Open Refine is a tool for working with messy data. It is an open source and freely available desktop application that includes several possibilities for clearing data as well as transforming data from one format to another. This can be useful when working with data from the ’live’ web and when processing data in near real-time.
For instance, in TANT-Lab we work a lot with text-data from the web. Such data will often include spelling mistakes and other irregularities that we need to work around. One of the functions in Open Refine that helps us with this task is the possibility of working with clustering algorithms that are capable of collapsing similar datapoints into one.
An example of such a use comes from a TANT-Lab project on visualizing discussions about the Danish schooling system on Facebook (for more on this project click here). It turned out that similar discussions and speaker-perspectives where referred to in several ways across the dataset. The picture below illustrates how Open Refine were used to identify such irregularities in near real-time. More specifically, it shows that key collision methods were used to spot how the world ’leader’ was written in multiple ways and how Refine was used to collapse all these datapoints into one.
Another interest that guides our work in TANTLab is the possibility to follow the ’native’ language on the web. For instance, this means that part of the challenge in the school-project was to isolate hashtags from Facebook discussions and visualize them. Since Facebook does not allow a tool for doing that, we used Open Refine to turn a column with full data posts into a column containing only hashtags. As shown in the picture below, we simply wrote a command that did the job for us on all data rows simultaneously.
These are just two examples of how Open Refine have helped our work in the TANT Lab. However, the tool has many other possibilities and new ones are probably coming along as we write this.