This post summaries the AntConc practicals from #corpusMOOC.
The first practicals use Laurence Anthony‘s AntConc (videos). I’ve tried out a few text analysis tools in the past but not got very far – let’s see if having a framework – aka being led through by the hand – helps.
Note, on Tuesday morning I seem to be the first person looking at these pages, or maybe the stat is for those who have marked it as complete. Haha turns out they are actually comments.
After several goes think it will be rather easier to use two devices – lots of window switching is making things rather tedious. So happy I bought a Chromebook! There are subtitles but no transcript, plus the video is pretty poor quality and you can’t speed it up to back and forth to review bits, but hey! it’s not that hard, in fact perhaps it’s been over-complicated a bit.
Two corpora on offer, snapshot corpora representing a broad range of genres of published, professionally authored, English at one moment in time:
- Brown – America in the early 1960s, built at Brown University (available via Sketch Engine)
- LOB – UK in the early 1960s, built by Lancaster-Oslo-Bergen universities
Bits n pieces:
- can use AntConc with any language, yay! just need a text file
- can file – save to get output as a text file
- type vs token – a token is any given word in the corpus, a type is the number of unique word forms present in a corpus
- Search – Start – can increase window size to eg 100 char
- Sort – 1R, 1L etc
- float over hit to go to File View
- can search for phrases, wild card * for 0 or more char, then sort on 0 word at Level 1
- details of wild cards in Global Settings
- search for a string within a word by unchecking the Words option
- also search by Case or Regex (Regular Expression, ie formula)
- compare two sets of results via Clone Results
- key is to keep an eye on what you’ve got checked…
- search history – via up/down arrow
- select – double click for a word, triple click for a line, shift + click for row, ctrl + click for non-sequential rows, ctrl + alt + a for everything in results window, then copy n paste
- delete row via delete, delete whole lot bar line you’re on via insert
- context search – eg said within 5 words of report
- more options via Advanced Button and Tool Preferences, can’t be that hard
Creating and using a word list:
- can sort by word, word end or frequency, plus invert
- forces lower case by default – can change in Tool Preferences
- similar search options to concordance tool
- can load a lemma list and then words will be grouped by lemma (repeat search when loaded)
- to exclude some words from your list choose Word List Range – can also create a list containing only specific words (stop word lists: AntConc | Wikipedia)
Collocation – move along, nothing to see here, perfectly straightforward.
Keyword analysis – possibly a few fiddly bits here, but it is all in Help, so not too fussed. Like the idea of finding negative keywords, ie those which appear unusually infrequently.
File view tool:
- full text browsing, gives a very broad context
- can search for a term in the same way as using the concordance tool
- CTRL + CLICK to jump to nearest hit
- click on hit to see all hits for the term in the concordance tool
- ie two tools work together
- clusters are repeated patterns of two or more words in a corpus
- can also summarise results from concordance tool
- can fiddle with cluster size and minimum frequency, minimum range (how many files it appears in)
- click on a hit to see in concordance tool, sort bu frequency, range, word (a-z), word end, probability etc
- probability – how likely it is a second word will occur after the first word
- part of clusters tool
- enables you to search the entire corpus for N word clusters, eg 2 word: this is -is a – a pen
- ie don’t need to specify a search term, can find patterns
- can set size, minimum frequency, range…
- other options as for clusters tool
Working with tagged data:
- see UCREL CLAWS7 Tagset, mnemonics for parts of speech
- word followed by underscore followed by tag, eg the_AT (article)
- tags treated as word strings, ie can search for word + tag, word w/o tag, or just tag
- tags can be shown or hidden via Global Settings – Tags –> two options, one allows tags to be searched but not displayed
- click CTRL + RTN to reveal tags if desired, normal RTN or Start to hide tags
- use Global Settings – Token definition to define words to include eg numbers, underscores if desired
OK I’ve done more than in week 2 but the shading is more pink, so once again I’ve misinterpreted the platform – I’m guessing the colour depends on how far through the week it is.
Some posts on AntConc in action:
- A criminologist’s introduction to AntConc and concordance analysis
- Exploring the grammar of Netflix in The Atlantic
- Corpus linguistics for historians
- Was Textmining über das neue Sarrazin-Buch hervorbringt
- see also @heatherfro on Getting started with AntConc
Update, June: Laurence is on sabbatical in Lancaster and has developed some more bits and pieces.