Word clouds for text mining?

Updates: July 2014 – word clouds, or maybe even wordclouds, are still with us, and making a little more sense in the big world of data. See Suprageography’s London Words, preparatory work for a Big Data and Urban Informatics workshop in Chicago. Gosh! Sep 2014: calligrams are back! July 2013: review of Textal (word cloud text analysis app).

Word clouds, who needs them.  I’ve previously considered them as a sort of cherry on the cake, but in the days of data visualisation are word clouds actually harmful or simply the ‘mullet of the Internet’?

Can you really use word clouds for serious visualisation (presentation/explanatory) or text mining (exploratory)? The second graphic for critique on the #datavis MOOC was a word cloud from the New York Times, At the National Conventions, the words they used.

For consideration:

  1. Is the graphic really ‘functional’ in the sense of facilitating basic, predictable tasks (comparing, relating variables etc)?
  2. Is it interactive enough? How could we improve its navigation?
  3. How would you improve its design? And what about its content? Should we include something else in the mix, more copy, different headlines (yes!), other related variables etc?

Summary of Alberto’s summary:

  1. What methodology was used to select the words?
  2. Navigation – confusing; does it make sense at all? Eg it is not clear  that if you click on a bubble the display will highlight the parts of the text where those words are mentioned (need to scroll to see).
  3.  The words are presented out of context. A word can mean different things depending on who says it, and on what other words surround it. Better to visualise the words as networks of relationships.
  4. Are bubbles an inadequate way of representing the data? They work for the big picture (popular words here, less popular words there), are fun and provide a nice looking first layer of information, but are not helpful to rank the words or make meaningful comparisons. Use in addition a different kind of graphic using a vertical or horizontal scale (bar graphs, slope graphs, scatter plots to see if there is a relationship between Democrat and Republican use of words).

From participants:

  • more ‘prepared data’ would be a powerful enhancement on a traditional word cloud – eg select different policy areas (health, privacy, education) to see a new set of bubbles, combine with selectable speaker sets (the candidates, their spouses, vice-presidential candidates)
  • what does it really mean if a word is used more often? the most important concepts (eg economy) are used almost equally, leaving you none the wiser; the democrats use millionaires 7 times (republicans 0), does that mean that the democrats are working for or against the rich people? the republicans use the word fail more often, what does that really tell me?
  • split the graphic split into two areas – where the parties are mostly saying the same thing and where they differ significantly
  • word clouds only make sense in showing the amount of something – the Democrats say ‘family’ more times the Republicans – what does that mean? out of context the words don’t mean much and can leave quite a bit open to misleading conjecture
  • word clouds for attention, summary and discovery – a word cloud gives an overview of the contents of a set of results
  • counting words is not analysis – it is the first step to analysis in qualitative methods
  • who is this for? too detailed for a casual viewer, and as a research tool would work better as a database driven webpage
  • in essence, the idea of displaying the frequency of words is interesting, but doesn’t give enough information to lead to objective insights

A couple of Wordles popped up in my feedreader last week:

  • ALT-C 2011 wordled – Sarah Horrigan wordled her own tweets from the UK’s learning technology conference and was happy to see a healthy balance between the two, plus terms such as education, listening and students sticking out more than the names of tools or technology specifics
  • Political slogans – in the run-up to Thursday’s general election in Denmark Kaas & Mulvad have wordled candidates’ campaign slogans. In the red corner emerge stem (vote), velfærd (welfare) and ansvar (responsibility), while the blues come out fighting with livet (life) and pengene (money).

Wordle is often used in event wrap-ups as a visualisation of the most tweeted words with a particular hashtag, but does a word cloud really tell us anything? I can’t really decide if the results are useful or just ‘cool’. Any creativity lies mainly in manipulating the results to suit.

Word cloud tools:

  • Textal app | review
  • Infomous – text visualisation tool, see below
  • TagCrowd (show frequencies next to words, group similar words)
  • Tagxedo  – “word clouds with styles”, create word clouds from any URL, Twitter ID etc. Update, Jan 2012: I’ve used Tagxedo to make a beagle shaped word cloud for my @beaglechat account
  • Tagul – lets you assign links to words in the cloud and other fancy things. Update, June 2012: puffin word cloud from IslandGovCamp
  • Textal – also finds pairs (words that were generally paired with the word selected) and collocates (words which the word was usually found next to), which can be exported as text files and sent to an email account; review
  • Tweet Cloud and Tweetstats both allow you to make a word cloud from your own tweets. Tweetstats also lets you Wordle your tweets.

More word clouds:

Infomous is a text visualisation tool which lets you create word clouds from a URL, RSS feed, Twitter user name or search. It’s used by The Economist for topics most commented on. Unlike most word cloud tools it’s interactive, allowing you to navigate the content on display by clicking on individual words, and offering stabs at topic importance (through word size), relationships (through connecting lines) and concepts (through groups).

Here is a link to the default cloud for this blog – WordPress.com strips out the embed code : ( There are configuration options, which I will take a look at in. due. course.

(HT: Alan Cann.)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s