LitLong Edinburgh: exploring the literary city

Edinburgh has just celebrated its 10th anniversary as UNESCO city of literature (Facebook | Twitter). The original city of literature, here’s Edinburgh’s literary story and details of tours and trails (guided | self guided | virtual – a bit lacking in the maps department, mind). Edinburgh is also home to the Scottish Poetry Library (Facebook | Twitter), the world’s first purpose built institution of its kind, it says here, and the Scottish Storytelling Centre (Facebook | Twitter), ditto, adjacent to John Knox House. Not forgetting the Book Festival (Facebook | Twitter), the “largest festival of its kind in the world“. 

The UK has one other city of literature, Norwich (see City of stories), and further literary cities include Dublin (great writers museum), and, pleasingly, Dunedin (about). Update: Nottingham has a bid in! If But I know this city! (tweets | David Belbin | report) is anything to go by, it should be successful. And there’s even Literary Dundee (@literarydundee).

I suspect not entirely coincidentally, 30 March saw the launch of LitLong (@litlong), the latest output from the AHRC funded Palimpsest project (@LitPalimpsest) at the University of Edinburgh (see Nicola Osborne’s liveblog and #litlonglaunch, esp @sixfootdestiny). An “interactive resource of Edinburgh literature” currently based around a website with an app to come launched for iOS, LitLong grew out of the prototype Palimpsest app developed three years ago, taking a multidisciplinary team 15 months to build – geolocating the literature around a city is no trivial matter! See about LitLong for some of the issues.

550 works set in Edinburgh have been mined for placenames from the Edinburgh Gazetteer, with snippets selected for “interestingness” and added to the database, resulting in more than 47,000 mentions of over 1,600 different places. The data can be searched by keyword, location or author, opening up lots of possibilities, such as why is Irvine Welsh’s Embra further north than Walter Scott’s Edinburgh? Do memoir writers focus on different areas than crime writers? See too Mapping the Canongate.

Part of the point of Palimpsest is to allow us to explore and compare the cityscapes of individual writers, as well as the way in which literary works cultivate the personality of the city as a whole.

On the down side, while there is a handful of contemporary writers in the mix, the majority of the content necessarily comes from copyright free material available in a digitised corpus, ie old stuff they made you read at school. Plus search results can be rather overwhelming (339 hits for the Grassmarket) – filters for genre, time period, might be an idea. However the data is to be made available enabling interested parties to play around as they wish, with open source code and data resources on GitHub.

I’ve had a look at the data around Muriel Spark, who would surely be delighted to be considered contemporary. The prime of Miss Jean Brodie (1961) has a section set in Cramond, near where I grew up. Drilling down using the location visualiser quickly brings us to:

“I shouldn’t have thought there was much to explore at Cramond,” said Mr. Lloyd, smiling at her with his golden forelock falling into his eye.

Searching the database brings up three pages of Cramond results to explore, including 17 Brodie snippets. Note that here you can filter by decade or source.

A search for Cammo, even closer to home, brought up a quote from Irvine Welsh’s Skagboys, although the map shown was different depending on which tool I used:

Edinburgh is a city of trees and woods; from the magnificence of the natural woodlands at Corstorphine Hill or Cammo, to the huge variety of splendid specimens in our parks and streets, Alexander argued, a pleasing flourish to his rhetoric. — Trees and woodlands have an inherent biodiversity value, whilst providing opportunities for recreation and environmental education.

location visualiser map - quill not in park

location visualiser map – quill in back gardens rather than the “natural woodlands” #picky

database search map - not Cammo!

database search map – not Cammo!

At the other end of the scale a search for ‘Bobby’ brings up 72 snippets from Eleanor Atkinson’s book, that’s a lot to handle…TBH I don’t really want them, I want a nice map of locations mentioned in the book, or at least a list, to create my own Greyfriars Bobby trail. At the moment it’s not possible to switch between the text and the map from the location visualiser, although you can do this snippet by snippet from the database search.

As things stand LitLong feels like an academic project rather than a user friendly tool – some use cases might be an idea.Hopefully the same approach will be applied to other cities in due course.

Updates coming thick and fast…the Toronto Poetry Map and, also from Toronto but rather broader, Places of poems and poets based on Representative Poetry Online; from Stanford Literary Lab, Mapping emotions in Victorian London, maps 167 places named in 4,363 literary passages in 1,402 books by 741 authors (background | paper); uses Historypin and Amazon Mechanical Turk

#smwbigsocialdata: getting social at CBS

On 27 February the boffins at Copenhagen Business School (aka the Computational Social Science Laboratory in the Department of IT Management) opened their doors for Social Media Week with Big social data analytics: modelling, visualization and prediction. This was the second time CSSL has participated in #smwcph, with their 2014 workshop (preso) looking at social media analytics. See also my post on text analysis in Denmark.

Wifi access was not offered, resulting in only 19 tweets, but as many of these were photos of the slides I’m not really complaining. Also no hands-on this year, all in all a bit of a lacklustre form of public engagement.

Ravi Vatrapu kicked off the workshop with a couple of definitions:

  • What is social? – involves the other; associations rather than relations, sets rather than networks
  • What is media? – time and place shifting of meanings and actions

The CSSL conceptual model:

model

  • social graph analytics – the structure of the relationships emerging from social media use; focusing on identifying the actors involved, the activities they undertake, the actions they perform and the artefacts they create and interact with
  • social text analytics – the substantive nature of the interactions; focusing on the topics discussed and how they are discussed

It’s a different philosophy from social network analysis, using fuzzy set logic instead of graph theory, associations instead of relations and sets instead of social networks.

Abid Hussain then presented the SODATO tool, which offers keyword, sentiment and actor attribute analysis on Twitter and Facebook (public posts only, uses Facebook Graph API). Data from (for example) a company’s wall can be presented in dashboard style, eg post distribution by month.

Next, Raghava Rao Mukkamala explored social set analytics for #Marius and other social media crises. Predictions (emotions, stock market prices, box office revenues, iphone sales) can be made based on Twitter data.

Benjamin Flesch’s Social Set Visualizer (SoSeVi) is a tool for qualitative analysis. He has built a timeline of factory accidents and a corpus of Facebook walls for 11 companies, resulting in a social set analysis dashboard of 180 million+ data points around the time of the garment factory accidents in Bangladesh.

The dashboard shows an actor’s engagement before, during and after the crisis (time), which can also be analysed over space (how many walls did they post on). Tags are also listed, allowing text analysis to be undertaken.

Niels Buus Lassen and Rene Madsen then outlined some of their work with predictive modelling using Twitter. You have to buy into #some activity being a proxy for real world attention, ie Twitter as a mirror of what’s going on out in the market – a sampling issue like any other. Using a dashboard driven by SODATA they classify tweets using ensemble classifiers, such as iPhone sales from 500 million plus tweets containing the keyword “iphone” (see CBS news story | article in Science Nordic).

They also used a very cool formula I nearly understood.

Last up, Chris Zimmerman gave an overview of CSSL’s new Facebook Feelings project, a counterpart to all those Twitter happiness studies. A classification of 143 different emotions on Facebook, based on mood mining from 12 million public posts, yikes. “Feeling excited” was the most popular feeling by far. Analysis can be done and correlations made on any number of aspects of the data, with an active | passive axis in addition to the positive | negative axis used in sentiment analysis. Analysis by place runs into the usual issue – only 5% of data has locality data.

Overview slides currently available from the URL below…

#corpusmooc (and text analysis) linkage

Latest: Journalistic representations of Jeremy Corbyn in the British press

Updates: never forget – Sentiment analysis is opinion turned into code; see Stanford Named Entity Tagger, which has three English language classification algorithms to try, and a list of 20+ Sentiment Analysis APIs. Next up: Seven ways humanists are using computers to understand text, Semantic maps at DH2015Sensei FP7 project: making sense of human- human conversation (Gdn case study), Stanford Literary Lab uses digital humanities to study why we feel suspense. Donald Trump’s tweets analysed. Pro-Brexit articles dominated newspaper referendum coverage.

Updates: just came across culturomics via a 2011 TEDx talk – no, stay…two researchers who helped create the Google Ngram Viewer analyse the Google Books digital library for cultural patterns in language use over time. See the Culturomics site, Science paper etc. Critique: When physicists do linguistics and Bright lights, big dataEMOTIVE, sentiment analysis project at Lboro…Laurence Anthony reviews the future of corpus toolsSentiment and  semantic analysisanalysing Twitter sentiment in the urban context and againWisdom of the crowd, research project from inter alia Demos and Ipsos MORI, launches with a look at Twitter’s reaction to the autumn statement

Aha, a links post…I’ve got links on text analysis and related all over the shop – see the category and tags for text mining and sentiment analysis on this blog for starters, in particular #ivmooc 4: what? and #ivmooc 2: burst detection, plus Word clouds for text mining. Here’s a broadly corpus related haul.

Projects:

Tools:

Corpora:

There’s no shortage of cases. Here’s a selection with particular appeal, either due to subject matter or methodology:

Blogs, Twitter…The dragonfly’s gaze looks at computational approaches to literary text analysis, with a nice post listing repositories and exploring file formats.

Project #marius and infostorms

(Post copied from Danegeld blog, 4 Feb 2015.)

Update, 28 Feb 2015: I gave #SMWZOOSHITSTORM a wide berth as it would just make me cross, although the CBS team commented at another Socal Media Week CPH event that the story keeps on giving. The event did yield up:

Updates: 2 April 2014: story in Berlingske on the research, plus perspectives of the day from Denmark and RoW. The CPH Post, who had their own Marius fool, reported that the Jobindex spoof was pulled at around noon due to complaints, but it still seems to be there…9 April: the Zoo’s comms guy tells his story…a peer reviewed article on the saga, Marius, the giraffe: a comparative informatics case study of linguistic features of the social media discourse, was presented at the ACM’s CABS 14 conference (abstract)

A team at Copenhagen Business School has taken a look at the use of social media around Copenhagen Zoo’s recent giraffe story:

See also Tableau visualisations and the timeline of events.

Research questions:

  • how did the conversation amplitude evolve?
  • where did negative sentiment originate and how did it evolve/spread?
  • who were the main actors – for some #sna see slides 20-23; Twitter bios showed a lot of vegans, activists etc (slide 19), well organised on #some
  • what types of posts and events instigated the issue online?
  • how did CPH Zoo handle the event on social channels and how did the social media storm affect their presence? – posted both in English and Danish on its Facebook and very successful in terms of check-ins, likes etc, but commentary very negative, mainly English (slide 24-26)
  • how did other organisations deal with the crisis?

Over 80% of the data came from Twitter. Highest buzz rate: 332 posts/minute, with a second short lived spike at 20K tweets/hr re the second Marius. 50% of tweets were retweets – a reflection of sentiment?

Twitter offered a more direct reflection of events, in terms of volume and sentiment, and also demonstrated a more drastic reaction to network prestige factors from activists and celebs. Discourse on Facebook was different –  a more closed environment, with feelings expressed to family and friends and maybe the Zoo.

95% of the global conversation was in English, with Danish detected in only 2,220 posts. Differences in the Danish subset are particularly interesting (slide 11) – Twitter and Facebook only share 50% of the conversation – does mainstream media play a larger role in Danish society? Fewer RTs – #some used more to express oneself than to share information? But sentiment is also more neutral (slide 17), with more negative sentiment on Facebook (apart from that viral photo in support of the Zoo; ?Twitter penetration in Denmark lower, large subset of politicians, media etc).

Radian6 used for analysis, but came up short – pretty hopeless for the Danish data subset, and its automatic sentiment coding was “either super safe or super crap” (slide 16), neutral heavy, often failing to detect negative sentiment. 50 corporate communications students at CBS hand coded some data with rather different results. Much discussion over what is positive or negative in this case. Now starting to analyse YouTube comments.

Was #marius an infostorm? Infostorms, a new book from two researchers in Denmark (one chairman of the Danish Nudging Network), explores whether #some “amplifies irrational social behaviour and can manipulate minds and markets” (see press release).

Denmark’s utilitarian approach towards animals is out of step with the English speaking world in particular. Some rather less robustly scientific articles have been sighted lately, and this is a topic it will be interesting to track in the future. Here’s my collection of notable #marius stories for the record:

An image in support of the Zoo’s Director went viral on my Facebook timeline at least, and an ill advised tweet from actor Pilou Asbæk, one of the hosts for Eurovision 2014 in Copenhagen, went viral on Facebook (traces of both now deleted), but it is to be hoped that organisations representing Denmark are sensitive to the issues:

Copenhagen’s visitor card

Text analysis in Denmark

As part of Social Media Week Copenhagen Business School (CBS) held an event on Social media analytics: concepts, models, methods and tools (aka #smwcbsdata) on 21 February. I couldn’t go, but made it to Computing feelings: Danish approaches to sentiment analysis on 28 February. Both events are also interesting as examples of Danish academics engaging with the public.

Computing feelings kicked off with a 15 minute keynote from Bernardo Huberman (HP Labs) on #some and the attention economy, which I missed, followed by Anders Søgaard (Center for Sprogteknologi, KU; paper).

The panel Q&A shed some light on Infomedia, who do press scanning for companies, aka media monitoring. Takes me back…they employ 40 students as coders, so from a business perspective it would make sense to automate as much as possible. Broad word/phrase searches can bring up as much as 80% noise (recall vs precision), but they also need to monitor for topics, eg market performance. Can sentiment analysis help? Relevant for factual articles, eg stock rise/fall? Topic extraction can look for new topics coming up, as well as picking up topics which aren’t there (negative keywords), which can also help with the don’t know what you don’t know issue. Infomedia aim for as high a level of accuracy as possible, but is it ever possible to get better then 85% accuracy?

I also missed Dan Hardt’s intro, but I’m guessing it was similar to his #smwcph preso, with an introduction to the two tools and datasets used by the bulk of the six 10 minute presentations:

  • Project #marius –  presented by Chris Zimmerman (@socialbeit | @zimme2cj); see Project #marius and infostorms
  • Ensemble classifiers – René Madsen (Infomedia) & Niels Buus Lassen (Evalua) on combining classifier algorithms by embedding, running in parallel or as a chain, with the aim of improving recall and precision; if you know your domain you can train algorithms to perform really well; looked at Trustpilot reviews; presented on predictive modelling (preso) at #smwcph, see article
  • From simple to more advanced sentiment lexicon approach – Anders Boje Larsen (@anders_boje) uses a sentiment lexicon approach – human judgement, not machine learning (qv Dan’s slides); words indexed/tagged from +5 to -5 and then run on text; 7K words indexed, now created dictionaries for decrementers and incrementers, invertors, stop words, increases accuracy; see article and Sociallytic tool
  • three presos on domain adaptation, ie using a classifier trained in one domain on another; Unigram bag of words model, ie all the words, then find top 10 features per class (ie top 10 words for positive, negative, neutral); 1-gram vs 3-gram, ie n-grams; f-score results; movie reviews (Scope, 60% neutral) different from company reviews (Trustpilot, 80% positive); create a vector from features, a string of 1s and 0s; logistic regression classifier; domain specific features, cf self driving car and a ditch –> Infomedia needs client specific algorithms for each domain

The limitations of sentiment analysis:

  • classifying text as positive, negative or neutral is not an exact science, plus doesn’t include sarcasm, emotional state etc (may end up as neutral by default)
  • DK parsers not there yet – POS analysis OK, semantic stuff lacking
  • 90% accuracy possible for both Danish and English – but how do you find and determine accuracy?
  • systems could usefully create a flag for “hooman needs to look at this”
  • linguistical challenges include negation, intensifiers, irony
  • sentences with attitude verbs, modals, helping verbals are difficult to parse
  • modal sentences could be removed as they are not reality anyway…
  • cf not – changes meaning!

The events were mastermind by the CBS Competitiveness Platform (@CBSCompete | Facebook), who held a similar ‘open’ event last year (Workshop on big data and language technology), with the Dept of IT Management’s Dan Hardt and Ravi Vatrapu.

Some local cases:

Mapping #some

Update, Feb 2015: Tourists v locals: city heat maps showing geolocated tweets; tourists in CPH can be found in the city centre and at the airport, duh…but interesting concept! Here’s more…

Eric Fisher (Flickr | Twitter):

Cue #SoMe klaxon! Week 4 of #mapmooc looked at social media as spatial data, how social media can be used with maps, advantages and pitfalls…and just how easy it actually is to plot it on a map.

On Twitter few tweets are geotagged.  We’re up to a grand total of three in the #mapmooc TAGS archive – two by me plus:

But not:

See the difference in @asudell‘s stream:

Drew

#vandymaps are also having issues:

Seems that tweets made with the web client only get geolocation information (coordinates) in TAGS if they are tagged individually, but not if the user has merely added location in Settings, which TAGS doesn’t collect (htow about the vanilla Twitter API?). OTOH mobile apps, with inbuilt GPS, _do_ offer geocoordinates simply when location is turned on. At least I think that’s right – thanks to @derekbruff and @asudell for sorting this out!

(Update: @derekbruff has set up a #vandymaps archive, and is investigating geotagging tweets. Checking the #mapsmooc archive reveals that two of my own tweets, where I added location via the Twitter Web client, are the only ones with data in the geo_coordinates field. I’ve extracted the data from the user_lang field and will take a closer look PDQ.)

But even a small set of tweets can offer potentially interesting results – see What’s happening in our vicinity from Field Office (an arts project currently going on in CPH) – a snapshot of geotagged tweets using the Streamd.in app, plus the Esri Public Information Map, in the week’s mapping assignment. This shows the real time effects of extreme weather events and other natural disasters, including geotagged social content from Twitter, Flickr, and YouTube. As noted in the forums however this is a rather blunt instrument with a poor signal to noise ratio.

Tweetmap Alpha is a further tool to filter geotagged tweets. As we know geotagging and privacy kinda go together. GeoSocial Footprint looks at the location information you divulge on Twitter in the light of potential privacy concerns. A footprint is made up of GPS enabled tweets, social check-ins, natural language location searching (geocoding) and profile harvesting. It states that “14 million tweets per day contain embedded GPS coordinates and up to 35% of all tweets containing additional location information”, which seems rather higher than in my experience.

Geolocating tweets the hard way

Back in lesson 1, it was noted that locations relevant to a particular tweet could include:

  • the locations mentioned in the message itself
  • the user’s location when they created the message
  • the user’s home location
  • the locations implied by the message

What are you plotting when you plot location? Where people live, where they work, where there is free wifi?

And from a thread, the following methods can be used to determine the spatial origin of tweets:

  • gelocation (geotags?)
  • Geo-IP and user designations (haven’t a clue)
  • the location from the user’s profile

So, there’s more to it than geotagging via GPS. See for example Tweak The Tweet, which uses “a hashtag-based syntax to help direct Twitter communications for more efficient data extraction”.

A bunch of maps were presented on the forums, including a lone Facebook example (Mapping the world’s friendships), leading to extensive discussions on sentiment analysis and how it might/not work. Happy days!

For starters, at least three university projects use Twitter to understand [emotions] in the USA, including…

Other projects which may/not be connected to the above: Emography | Tweetfeel | Twittermood | We feel fine | Mappiness (UK). Enough already! Update, June 2014: Five Labs ” analyzes your Facebook posts to predict the personalities of you and your friends”.

More clues on sophisticated methods IRT geolocation no doubt to be found in:

I could also do with:

A nice story to finish, in the warm up to week 5.  #mapmoocer Tony Targonski created a map of Seattle on an earlier Coursera MOOC: “Larger circles mean more social activity. Greener colour represents more “positive” than expected; redder is less “positive” than expected. In this case “positive” refers to valence (a commonly used measure of sentiment), and “expected” is the predicted valence score based on the walkability measure of the block (overall more walkable places correlate with more positive sentiment).”

Which is an interesting point IRT Happy Denmark. They’re not happy, they just bike a lot (like I didn’t know).

#mapmooc statistics week 4 (7-13 August):

  • 656 (558; 374; 206) tweets, 202 (181, 117, 82) RTs, 264 (212, 112, 61) links (all +/- due to time zone differences)
  • top tweeters: @MapRevolution, @DougOfNashville, @PublicUniverse
  • n=246 (230, 152, 129); 157 (155 (102, 74) have tweeted only once
  • 61 (54, 40, 30) threads 9 (9 (11, 12)%
  • top conversationalists: @MapRevolution, @derekbruff, @annindk

Postscript: among its rather nice web apps Esri offers a social media app (hopefully a bit more stable than the gallery app) plus stuff on making a social media map in minutes – come in! See the Horn of Africa Drought Crisis Map for an example.

As a quick test I took a look at Denmark’s most popular hashtag,#dkpol. Danes aren’t big tweeters, but they are big mobile users and #dkpol people are a pretty vociferous bunch, but the results were rather underwhelming. Putting #SoMe on a map seems to be less about creating a meaningful map and more about simply harvesting the data – see We are on Albert Drive for an example of what can be done. To be revisited.