The Hypertext Corpus Initiative (or Hyphe) is a webcrawler developed by the SciencesPo médialab to provide students and researchers with an easy means of mapping the web.

Last modified: 13.04.2015




Documentation and instructions of use 



Example: Mapping wind energy controversies on the web

In a project funded by the Danish Council for Strategic Research, we were tasked with mapping controversies around wind turbine projects as they unfold online. One of the strategies we adopted was to build a corpus of websites dedicated to the issue of wind power (protesters and advocates alike). We began with a smaller number of known starting points in different countries. The starting points were websites dedicated entirely to wind energy, that were already known to us. By harvesting the hyperlinks provided by these starting points we were pointed to a collection of neighbouring of websites which we could then curate qualitatiely so as only to keep new websites that we deemed dedicated to wind power as well. 

We went through several iterations of this process, eventually ending on with 758 websites from 6 different countries dedicated entirely to the issue of wind energy. In the process of curating this collection of websites we also tagged them according to a range of criteria that we deemed pertinent to the controversy. One of them being whether a website could be said to be pro, con or neutral towards wind energy. Below we see how the corpus divides neatly into two overall regions that pro (orange) and con (grey) wind energy respectively. This suggests that both proponents and opponents of wind energy tend to link primarily to their peers rather to their adversaries. 

You will also notice that there are smaller clusterings within these overall regions. These are national clusters and it means that German wind proponents, for instance, are far more connected to their Swedish or Danish peers, than they are to their local German adversaries. 

Besides harvesting hyperlinks from the websites Hyphe also indexes all their textual content as html. This allows us to run queries on the web corpus to see where certain issues or actors are being talked about. We built an initial issue dictionary of 47 queries that could subsequently be represented visually on the map. Below it is the issue of stray voltage which turns out to be more or less exclusively a concern of the wind protesters, in particular in Canada (sites towards the top).

A full report on the methods and datasets used in this project has been published as:

Munk, A. K. (2014). Mapping Wind Energy Controversies Online: Introduction to Methods and Datasets. Available at SSRN 2595287.