#corpusmooc 2: collocations and keywords

See the #corpusMOOC tag for all my posts on this MOOC.

Week 2 has now been and gone. Like one of the other bloggers I’ve failed on the readings, which feel “a bit too dense and full of stats”. Time also didn’t permit a look at any of the supplementary/advanced material. I’m still querying the ‘flow’ created by sticking so much content within a weekly structure – it now feels like that material has been and gone, when surely many people are likely to want to review it later in the course, or even afterwards. There’s a real need for more than one way of accessing the content.

I delved a little into the comments – the ‘inline’ format seems to suit people who like to verbalise their thought processes, but for me it’s at best a distraction and feels like a conversation I’m not part of. Plus having to click on the pink box, scroll down, scroll up etc is not a very user friendly format. However there’s quite a lot of high level stuff going on in there, and Tony is present. Would be nice to do some #sna on the interactions. How many people outside a charmed group are actually interacting?

I also discovered by chance that if I take screen size down to 90% comments appear down the side – wonder how it works on other devices. Maybe that’s where I’m going wrong…

Below are my notes from this week’s lectures.

Collocation and keyword analysis are key tools in corpus linguistics, telling us ‘about’ texts and change over time as well as helping to decode argumentation strategies.


  • the top 10 frequent words in many corpora are pretty much the same (eg function words), but by using collocation we can manipulate and exploit frequency data to get deeper insights
  • definition – the systematic co-occurrence of words in use which tend to occur frequently with one another and as a consequence may start to determine or influence one another’s meanings (what?); eg telephone – operator, back – front (back to front, front and back), tell – story (tell me a story)
  • how close should words be to one another? typically set at +/- 5 words either side of the word we are looking at
  • many people set a minimum threshold of frequency for words to count as collocates, eg 10
  • also option to stop at sentence boundaries
  • to find collocates wrangle frequency data in mathematical ways, eg mutual information score (how often two words occur in the context of one another relative to how frequently they occur without one another)
  • shows how closely words might be associated and which words are more powerfully associated with the word we are interested in (not very accurate if you’re dealing with data which is low in terms of frequency – an alternative is the Dice coefficient)
  • potential observations found via collocates:
    • colligation – a strong affinity between a word and a particular grammatical class, eg ‘he’ colligates with verbs, ‘Mrs’ colligates with proper nouns, determiners colligate with nouns (collocation is associated with meaning)
    • groups of words may also tend to collocate with a particular word – semantic preference, ie the relation between a lemma (word form) and a set of semantically related words, eg ‘diamond’ associates itself with a class of words which we call precious stones (all very libship here)
    • discourse prosody – types of words that characterise a speaker’s attitude; eg cause in a general corpus vs in an environmental corpus

collocates: words you can know by the company they keep

Keyword analysis identifies salient words in a corpus, acting as signposts for a linguistic, cultural or discursive analysis. Explaining why keywords are there and what they do can lead to interesting and unexpected findings. Keywords can often not be predicted in advance as humans have cognitive biases when it comes to noticing frequencies. The statistical method is replicable and unbiased so it has a high reliability/validity from a scientific viewpoint.

  • a keyword list is calculated by comparing two frequency lists – usually a much larger reference corpus against a
    smaller specialised corpus (but sometimes two equal sized corpora)
  • Chi square or log likelihood tests identify the words that are statistically much more frequent in one list
  • when is a word a keyword? again, choose a cut-off point for statistical significance (top 10/50/100 keywords), look at minimum frequency (must occur eg 2o times), must be fairly well distributed (eg across 20 texts)
  • common types of keywords are proper names, markers of style (often grammatical words like must, betwixt), spelling idiosyncrasies (color/colour), ‘aboutness’ words (politics, a recipe)

Use to look at stable vs emerging keywords, eg change over time. In British English some words have:

  • steadily become less frequent; eg terms of address: Mr, Mrs, sir (a more informal society?), stronger modal verbs: shall, must, should (a more democratic society?), longer forms: cannot, upon, half (densification?)
  • steadily become more frequent; eg contracted forms (it’s, didn’t, I’m, don’t), writing numbers (34 rather than thirty four), social terms (family, children, child, people, social, health, help)
  • stayed pretty much the same (lock words); eg weaker modality (can, could, would), wh- word (who, what, whether, where), body parts (hand, face, head, body, eyes), other nouns (life, world, government, money)

Reflecting changing values, eg irt children.


  • What is collocation? – systematic co-occurrence relationships between words in use, eg tell-story
  • What is a colligation? – a strong affinity of a word for a grammatical class, eg Mr-proper noun
  • What is a semantic preference? – a relationship between a word and a set of other words that form a semantic category, eg glass of-DRINKABLE LIQUIDS
  • What is discourse prosody? – the way that words in a corpus can collocate with a related set of words or phrases, often revealing (hidden) attitudes, eg happen-UNPLEASANT THINGS
  • What is the keyword method? – a method of identifying words that are statistically significantly more frequent in one corpus as compared with another corpus (ie differences between corpora)
  • What is a lock word? – a word that has pretty much the same frequency over time (ie similarities between corpora)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s