LitLong Edinburgh: exploring the literary city

Update: LitLong 2.0 launched at the 2017 Embra BookFest; see article

Edinburgh has just celebrated its 10th anniversary as UNESCO city of literature (Facebook | Twitter). The original city of literature, here’s Edinburgh’s literary story and details of tours and trails (guided | self guided | virtual – a bit lacking in the maps department, mind). Edinburgh is also home to the Scottish Poetry Library (Facebook | Twitter), the world’s first purpose built institution of its kind, it says here, and the Scottish Storytelling Centre (Facebook | Twitter), ditto, adjacent to John Knox House. Not forgetting the Book Festival (Facebook | Twitter), the “largest festival of its kind in the world“. 

The UK has one other city of literature, Norwich (see City of stories), and further literary cities include Dublin (great writers museum), and, pleasingly, Dunedin (about). Update: Nottingham has a bid in! If But I know this city! (tweets | David Belbin | report) is anything to go by, it should be successful. And there’s even Literary Dundee (@literarydundee). Unexpected update: Literary Odessa.

I suspect not entirely coincidentally, 30 March saw the launch of LitLong (@litlong), the latest output from the AHRC funded Palimpsest project (@LitPalimpsest) at the University of Edinburgh (see Nicola Osborne’s liveblog and #litlonglaunch, esp @sixfootdestiny). An “interactive resource of Edinburgh literature” currently based around a website with an app to come launched for iOS, LitLong grew out of the prototype Palimpsest app developed three years ago, taking a multidisciplinary team 15 months to build – geolocating the literature around a city is no trivial matter! See about LitLong for some of the issues.

550 works set in Edinburgh have been mined for placenames from the Edinburgh Gazetteer, with snippets selected for “interestingness” and added to the database, resulting in more than 47,000 mentions of over 1,600 different places. The data can be searched by keyword, location or author, opening up lots of possibilities, such as why is Irvine Welsh’s Embra further north than Walter Scott’s Edinburgh? Do memoir writers focus on different areas than crime writers? See too Mapping the Canongate.

Part of the point of Palimpsest is to allow us to explore and compare the cityscapes of individual writers, as well as the way in which literary works cultivate the personality of the city as a whole.

On the down side, while there is a handful of contemporary writers in the mix, the majority of the content necessarily comes from copyright free material available in a digitised corpus, ie old stuff they made you read at school. Plus search results can be rather overwhelming (339 hits for the Grassmarket) – filters for genre, time period, might be an idea. However the data is to be made available enabling interested parties to play around as they wish, with open source code and data resources on GitHub.

I’ve had a look at the data around Muriel Spark, who would surely be delighted to be considered contemporary. The prime of Miss Jean Brodie (1961) has a section set in Cramond, near where I grew up. Drilling down using the location visualiser quickly brings us to:

“I shouldn’t have thought there was much to explore at Cramond,” said Mr. Lloyd, smiling at her with his golden forelock falling into his eye.

Searching the database brings up three pages of Cramond results to explore, including 17 Brodie snippets. Note that here you can filter by decade or source.

A search for Cammo, even closer to home, brought up a quote from Irvine Welsh’s Skagboys, although the map shown was different depending on which tool I used:

Edinburgh is a city of trees and woods; from the magnificence of the natural woodlands at Corstorphine Hill or Cammo, to the huge variety of splendid specimens in our parks and streets, Alexander argued, a pleasing flourish to his rhetoric. — Trees and woodlands have an inherent biodiversity value, whilst providing opportunities for recreation and environmental education.

location visualiser map - quill not in park

location visualiser map – quill in back gardens rather than the “natural woodlands” #picky

database search map - not Cammo!

database search map – not Cammo!

At the other end of the scale a search for ‘Bobby’ brings up 72 snippets from Eleanor Atkinson’s book, that’s a lot to handle…TBH I don’t really want them, I want a nice map of locations mentioned in the book, or at least a list, to create my own Greyfriars Bobby trail. At the moment it’s not possible to switch between the text and the map from the location visualiser, although you can do this snippet by snippet from the database search.

As things stand LitLong feels like an academic project rather than a user friendly tool – some use cases might be an idea.Hopefully the same approach will be applied to other cities in due course.

#smwbigsocialdata: getting social at CBS

On 27 February the boffins at Copenhagen Business School (aka the Computational Social Science Laboratory in the Department of IT Management) opened their doors for Social Media Week with Big social data analytics: modelling, visualization and prediction. This was the second time CSSL has participated in #smwcph, with their 2014 workshop (preso) looking at social media analytics. See also my post on text analysis in Denmark.

Wifi access was not offered, resulting in only 19 tweets, but as many of these were photos of the slides I’m not really complaining. Also no hands-on this year, all in all a bit of a lacklustre form of public engagement.

Ravi Vatrapu kicked off the workshop with a couple of definitions:

  • What is social? – involves the other; associations rather than relations, sets rather than networks
  • What is media? – time and place shifting of meanings and actions

The CSSL conceptual model:


  • social graph analytics – the structure of the relationships emerging from social media use; focusing on identifying the actors involved, the activities they undertake, the actions they perform and the artefacts they create and interact with
  • social text analytics – the substantive nature of the interactions; focusing on the topics discussed and how they are discussed

It’s a different philosophy from social network analysis, using fuzzy set logic instead of graph theory, associations instead of relations and sets instead of social networks.

Abid Hussain then presented the SODATO tool, which offers keyword, sentiment and actor attribute analysis on Twitter and Facebook (public posts only, uses Facebook Graph API). Data from (for example) a company’s wall can be presented in dashboard style, eg post distribution by month.

Next, Raghava Rao Mukkamala explored social set analytics for #Marius and other social media crises. Predictions (emotions, stock market prices, box office revenues, iphone sales) can be made based on Twitter data.

Benjamin Flesch’s Social Set Visualizer (SoSeVi) is a tool for qualitative analysis. He has built a timeline of factory accidents and a corpus of Facebook walls for 11 companies, resulting in a social set analysis dashboard of 180 million+ data points around the time of the garment factory accidents in Bangladesh.

The dashboard shows an actor’s engagement before, during and after the crisis (time), which can also be analysed over space (how many walls did they post on). Tags are also listed, allowing text analysis to be undertaken.

Niels Buus Lassen and Rene Madsen then outlined some of their work with predictive modelling using Twitter. You have to buy into #some activity being a proxy for real world attention, ie Twitter as a mirror of what’s going on out in the market – a sampling issue like any other. Using a dashboard driven by SODATA they classify tweets using ensemble classifiers, such as iPhone sales from 500 million plus tweets containing the keyword “iphone” (see CBS news story | article in Science Nordic).

They also used a very cool formula I nearly understood.

Last up, Chris Zimmerman gave an overview of CSSL’s new Facebook Feelings project, a counterpart to all those Twitter happiness studies. A classification of 143 different emotions on Facebook, based on mood mining from 12 million public posts, yikes. “Feeling excited” was the most popular feeling by far. Analysis can be done and correlations made on any number of aspects of the data, with an active | passive axis in addition to the positive | negative axis used in sentiment analysis. Analysis by place runs into the usual issue – only 5% of data has locality data.

Overview slides currently available from the URL below…

#corpusmooc (and text analysis) linkage

Latest: Journalistic representations of Jeremy Corbyn in the British press

Updates: never forget – Sentiment analysis is opinion turned into code; see Stanford Named Entity Tagger, which has three English language classification algorithms to try, and a list of 20+ Sentiment Analysis APIs. Next up: Seven ways humanists are using computers to understand text, Semantic maps at DH2015Sensei FP7 project: making sense of human – human conversation (Gdn case study). Donald Trump’s tweets analysed. Pro-Brexit articles dominated newspaper referendum coverageAmericanisation of English

Updates: just came across culturomics via a 2011 TEDx talk – no, stay…two researchers who helped create the Google Ngram Viewer analyse the Google Books digital library for cultural patterns in language use over time. See the Culturomics site, Science paper etc. Critique: When physicists do linguistics and Bright lights, big dataEMOTIVE, sentiment analysis project at Lboro…Laurence Anthony reviews the future of corpus toolsSentiment and  semantic analysisanalysing Twitter sentiment in the urban context and againWisdom of the crowd, research project from inter alia Demos and Ipsos MORI, launches with a look at Twitter’s reaction to the autumn statementThe six main arcs in storytelling, as identified by an AI

Aha, a links post…I’ve got links on text analysis and related all over the shop – see the category and tags for text mining and sentiment analysis on this blog for starters, in particular #ivmooc 4: what? and #ivmooc 2: burst detection, plus Word clouds for text mining. Here’s a broadly corpus related haul.




There’s no shortage of cases. Here’s a selection with particular appeal, either due to subject matter or methodology:

Blogs, Twitter…The dragonfly’s gaze looks at computational approaches to literary text analysis, with a nice post listing repositories and exploring file formats.

#corpusmooc: review that journal


Updates: why take notes? The Guardian view on knowledge in an information age. What type of note taker are you?

Each week in #corpusmooc, straight after the vids, we’ve been exhorted to “update your journal”. A bit of explanation might have been idea for those not into Lancaster’s particular form of reflective practice, plus maybe “notes” would have worked better as a catch all, but hey… As you can see there were 37 comments on this particular page (en passant, think that comments is new; maybe it wasn’t just me who queried what the number referred to – my initial thought was page views). But what’s to comment on?

Some people take handwritten notes, some use Wikipad, Evernote, a couple use mindmapping “to keep the written record of the connections between ideas that come to my mind while learning and reflecting upon what I have learned”. Someone on pen and paper notes commented that “I think I’m absorbing more and retaining what I learn better”. It’s particularly fun that handwritten notes are called out for being “slow” – for me a bigger problem is that underuse has led to my handwriting being even more appalling than before the advent of computers. Mention of Docear, an ‘academic literature suite’ which offers electronic PDF highlighting as well as a reference manager and mindmapping, looks interesting.

Hamish Norbrook has a great approach:

Pen and paper transferred to the single file “MOOC notes”: individual units filed by unit number. I try and sift as I’m going into ‘Stuff I really need in my head and not on paper”, “Stuff I can come back to or refer to’ and and… ‘Stuff I’m unlikely to understand’.

Having never mastered mindmapping I’m a fan of the bullet point. I’ve made the biggest use of screen captures on this MOOC, thanks to Laurence Anthony introducing us to the Windows snipping tool, but in the past I’ve also tried out – video watching and notetaking on one screen. Why take notes? An infographic on notetaking techniques offers some insights into the recording and retaining of information:

  • only 10% of a talk may last in your memory, but if you take and review notes you can recall about 80%
  • notetaking systems (who knew?) include the Cornell System with a cue column and notetaking and summaries areas, the outline system and the flow based system
  • writing vs typing – writing engages your brain while you form and connect letters helping you retain more – typing gives a greater quantity of notes

Here’s an article on student notetaking for recall and understanding.

CaptureThe final activity on the course is to review your journal, as I suggested in week 4. A number of people have made some progress in analysing their personal or other corpora:

  • on The Waste Land: “‘you’ features as much as ‘I’, which brought home to me how much the fragments in The Waste Land are parts/one side of a conversation, though the actual ‘you’ may not be given a voice”
  • on own notes: “Besides the classic function words such as articles, pronouns, conjunctions we use to see in corpora, I just realized that I use a lot the word ‘so’ in different contexts, especially as an adverb (I have a tendency to write things like ‘this is so interesting’, ‘this subject is so important’, etc), and as a linking word that I seem to use at the beginning of almost every paragraph.”
  • on own tweets, comments on the MOOC but difficult to get data (groan)

Some people have gone the full nine yards already. Liliana Lanz Vallejo:

I loaded the notes that I took of the course and I added the comments that I wrote in all the forums. This made a total of 9,436 word tokens and 2,338 word types. Something got my attention. While in most of the English corpora that I’ve cheked in this course, the pronoun “I” appears close to a rank 20, in my notes and comments corpus “I” appears in rank 2, after “the”.  This is curious because the same thing happens in the corpus of tweets containing Spanish-English codeswitchings that I gathered some years ago. In it, “I” appears in rank 1 of words in English, while “the” is in rank 3. It seems that my English and the English of Tijuana’s Twitter users in my corpus is highly self-centered. We are focusing in our opinions and our actions. Of course, the new-GSL list, the LOB and Brown corpus and all the others were not made with “net-speech”. So there is a possibility of native English speakers favoring the usage of the pronoun “I” in social media or internet forums…I would need to compare my notes and comments corpus to a corpus made of forum comments, and the tweets corpus to one made of social media posts (or tweets, that would be even better).

Andrew Hardie (CPQweb guru) responds: “May this be a genre effect? Are comments/twitter posts of equivalent genre to the written data you are comparing it to? Use of 1st and 2nd person pronouns is generally considered a marker of interactivity or involvement, which is found in spoken conversation but not in most traditional formal written genres. But then, comments on here are not exactly what you would call traditional formal written genres!”. Kim Witten (mentor): “Also keep in mind that while “I” can be perceived as focused on opinions and actions, it is also often indicative of the act of sharing (e.g., “I think”, “I feel”, “I want”), which as Andrew says is a marker of interactivity or involvement. So perhaps it is inward-facing, but for the intent of being outward-connecting.”

Anita Buzzi:

I generally take notes with pen and papers, so I decide to collect all the answers I gave in the two MOOCs on Futurelearn I attended creating my own corpus delicti. I generate a word list with AntConc – word types 944 word token 2937- the results: The first token is “the” freq. 140; the second token reveals that my favourite preposition is “in” 105 freq. then the list goes on showing: “and” ,“I”,”to”, “of”. I annotated the corpus in CLAWS–3016 words tagged, tagset C7 and then USAS. I generate a word list in CLAWS C7 – word types 1032- words token 5910. the resultes shows : nn -nouns 812, jj- general adjectives 213, AT- articles 201, ii preposition 181. I look for VM modal verbs. The first modal 17 hits is “can” and the concordance shows mostly in association with “be”, The second with 15 hits is “may” : may share, provide, be, reflect, feel, represent The third is “would” 10 hits : would like, would be; followed by “could”, “should” and “will” 4 hits; “need to” just 1 hit. While the modal verbs in the London Lud Corpus of Spoken English appear in this scale WOULD – CAN – WILL- COULD- MUST – SHOULD – MAY – MIGHT – SHALL The results I had from the corpus was: CAN- MAY – WOULD- COULD- SHOULD – WILL – MIGHT Why do I use “may” so much? Probably because I was talking about specific possibility, or making deductions.

Amy Aisha Brown (mentor): “Did you take a look at your concordance lines? What does ‘may’ collocate with? That might give you a hint at why you use it so much. Another thought, I wonder if someone has put Tony’s lectures into a corpus. It could be that he uses ‘may’ often and that you have picked it up from him? Maybe you always use it often?” Tamara Gorozhankina:

I’ve collected a very small corpus of all my comments through the course (4,835 tokens), and saved them in 8 separated text files (each file for each week). I used POS annotation in CLAWS C5, and the keyword list showed: Nouns – 510 Verbs – 208 Adjectives – 177 Adverbs – 100 Personal pronouns – 74 Then I divided this tiny corpus into 2 subcorpora: the first one for the comments of the first 4 weeks and the second one for the comments of the last 4 weeks of the course. The number of tokens was balanced. After getting the results, I realised that there was an interesting shift in using personal pronouns, as I tend to generalise the ideas by using “we” in the comments of the first 4 weeks, while in the last weeks’ comments there’s a tendency to use “I” instead. These results are quite unexpected I should say.

Finally, here’s a list of all the bloggers I’ve found on this MOOC:

See the #corpusMOOC tag for all my posts on this MOOC.

#corpusmooc 8 and wrap-up

See the #corpusMOOC tag for all my posts on this MOOC. One more to come, on notetaking and blogging, plus a little text analysis…

Week 8 was on swearing, focusing on conversational English, with a disclaimer encouraging participants to “discuss and debate the topic of this step in an adult and constructive manner”. The warm up activity asked participants to listen out for examples of bad language, make a brief note of the context, who was speaking to who [sic] and what was said, returned to in the discussion question/s: 

Did the analytical framework presented work for the data you collected? If so, which categories of bad language did you hear? If not, why not? Was the language used an issue perhaps? Were there contextual factors not present in the corpus data that seemed important to interpretation in context? Has linguistic innovation changed the use of bad language since the 1990s?

The vids looked at what is ‘bad language’, developing a classification scheme for the data, do men swear more than women (no, but they use different words), how do men and women swear at the opposite gender (men swear less at women), and their own (men use stronger words), do different categories of swearing select stronger or weaker words systematically (quite possibly), how does bad language use and age interact (the young swear more, but is it age which is the issue), how does the use of bad language and class interact (tricky), the desirability – and viability – of looking at multiple factors at the same time, combining gender and age/class in two case studies . It’s all in Tony’s book. See also an article on Rude Britannia, and When Swedes swear, they do so in English: “often in contravention of accepted linguistic norms”. It turns out there’s a network for swearing researchers in the Nordic countries, called SwiSca, and they’ve just published a book.

No quiz, instead the “the opportunity to participate in a rigorous assessment”, similar to that in week 4, with a choice of three essay questions:

  • use the Lancaster Newsbooks Corpus to identify key themes connected with the The Glencairn Uprising
  • use the Lancaster-Oslo/Bergen Corpus (LOB) to explore the use of the passive construction in different genres of written English
  • use the VU-Lancaster Advanced Writing Corpus to explore the use of linking adverbials in advanced student writing

Still not for me.

OK, so what of this MOOC as a whole, and the FutureLearn platform?

Looking first at the discussion forum that wasn’t, it felt hard work just to find comments – click to open the list, endless scroll…then out of normal ‘workflow’, how do you get back to comments?

I’ve had to take this screenshot down to a silly size to get everything on, which makes the point itself (click for a clearer version):


The comments link at the bottom of the screen opens the list of comments for that page. You can click on poster names, but I’m not really sure why I would want to do that. To get to a broader list of comments you need to click on the square top left, which opens a window with Activity as an option (alongside To do and Progress), see the pretty graphics below.

Here you get options for everyone, following and replies, all completely out of context obv, although you can go to the thread. To find your own posts you need to look somewhere else entirely – top right, the little grey man (I didn’t add an avatar) offers My profile (plus my courses, settings, sign out), with activity, followers and following as options.

Finally, the grey block of nine in the centre at the top of the screen brings down links to courses, about and partners. It’s all a bit sparse, although they oceans of white space may in part be due to the size of my (pretty bog standard) laptop screen.

In addition I found the tone of the discussion forum offputting. Every comment was given a pat on the head, and there seemed to be little substantive discussion. Moreover, on occasion the mentors might pose a question, but the commenter may never see it as there’s no mail alert or sensible way of getting back to your comment. You’d spend more time trying to find stuff of interest than actually digesting the comments. Very disappointing, and that’s without addressing the point that participants were unable to initiate discussions outwith the defined structure of the course.

Whereas in some MOOCs the instructors are completely absent, here they were falling over each other – not a sustainable approach, and I wonder how this affected the discourse. In his final mail Tony comments: “So many of you have said that you have learned a lot from me. As always happens with corpus work, the teacher learns a lot from the students too” – all very binary. And there was no peer review – while acknowledging issues with that, it’s a further reflection of the nature of this beast.

Finally, ain’t it pretty, but what does it all mean?

Under Progress, top left of screen – go me!:


More broadly, while corpus linguistics is not rocket science at this level (and the conclusions often seem surprisingly subjective) it’s a technique I’m glad I know more about. For my needs there was too much on using massive corpora – some examples of smaller projects might be an idea next time out, plus less ‘pure’ linguistics. In terms of presentation it felt more like a ‘course’ aimed at a fairly traditional student cohort than something more innovative, due in part to the absence of community and curation – just a loong stream of stuff. Looking at Tony’s post on Macmillan Education, this is perhaps not altogether surprising:

Are MOOCs the future of education? Well, in my opinion, yes and no. Yes – we must use them…But then also no – MOOCs must live with, and complement, face-to-face teaching, in my view. The responsiveness and immediacy of face-to-face teaching cannot be readily provided via a MOOC. If nothing else, the scale of the enterprise defies any credible and sustained attempt at building a rapport with individual students, which is, in my experience, a key motivator for students and staff alike.

In the light of all the above it’s not really surprising that Twitter didn’t really take off, but here’s the TAGS bits n bobs: viewer | spreadsheet | spreadsheet map version | map:


#corpusmooc and spatial humanities

Update, 2016: not so supplementary now; see Spatial Humanities 2016 (programme: short & full | @spatialhums & #SH_2016), lots of delights.

Amongst the supplementary materials in weeks 6 and 8 of #corpusmooc was Ian Gregory on the potential for using GIS in corpus linguistics, aka spatial humanities.

First up, Mapping the Lakes (and version 2):


Place names were coded in XML and converted to a GIS, allowing mentions to be compared. Other features mapped included emotional response (on a scale of 1-10) and physical characteristics, ie altitude. Photos from Flickr were also incorporated. The end result permitted close reading of the text alongside a map of the area described. Next up, a corpus of Lake District writing for the period up to 1900, over a million words from 80 texts.

Next, geographical text analysis:


Claire Grover’s work (of Trading Consequences) on georeferencing (ie identifying all the place names automatically (?), pulling them out of the text and linking them up to a gazetteer to give them a point location on a map) 17,667 instances of places mentioned in the Registrar General’s Reports, 1851-1911 (2 million words; Histpop). Recall: 81%, precision: 82%, and correct with locality: 75%. Mapping the instances and smoothing gave a pretty good reflection of major population centres in England, however with Bedford as an outlier cluster:


Analysing ‘London’ found high z-scores relating to water supply/quality, whereas the Liverpool/Manchester cluster was more descriptive of diseases, with no discourse on water supply. Exploring causes of death and mapping collocations with place names led to the following conclusions:


Geographical text analysis can help to understand the geographies within a corpus. At the moment we only have recall and precision statistics of about 80%, but this will get better, and even if it doesn’t you still have most of the place names within a text. Bringing together statistical summaries from corpus linguistics and micro/close readings helps understand what’s going on within a text, to aid in decisions on which parts you perhaps need to close read and which parts you can ignore.

More on georeferencing place names, from Putting big data in its ‘place’…the power and value of amalgamating and querying content by ‘place’ has long been recognised through the use of place name gazetteers, however these have limitations as they tend to record only modern place names and lack spatial resolution. A number of initiatives aimed at extending the scope of modern gazetteers include:

Some spatial hums linkage:

See also my post on Telling stories with maps: literary geographies.