See the #corpusMOOC tag for all my posts on this MOOC.
This week considers the practicalities of building a corpus. The practical introduces two tools to help automatically annotate English language materials, and there’s an assessment, but no quiz.
Issues to consider:
- Corpus design – size depends on genre; a rare feature needs a large corpus; importance of sampling
- Planning a storage system and keeping records
- Obtaining permissions
- Text capture – see handout for examples
- Markup – can be done automatically, although error tagging (eg wrong word form used, something missing, word/phrase that needs replacing…) must be done by hand
The assessment takes the form of marking essays, not quite sure why. And involves 14 questions. Don’t think so. It’s the usual one size fits all approach.
How to analyse a corpus
Annotating a corpus using online tools developed at Lancaster University:
- CLAWS – a part-of-speech (POS) tagger; web interface available for a chunk of text, which you can save as a text file and load into AntConc
- USAS – a semantic tagger; web interface; another nice overlapping example with librarianship
See chapters 2 and 4 in Corpus annotation for more on this.
The warm up activity suggests creating a personal corpus, something I looked at back in 2012 and also thought about here:
It might be fun to look at all the #corpusmooc blog posts – would other bloggers be up for creating and sharing text files? Haven’t had much time to take this further yet, but hope to this week. When I first looked at my posts I had 3605 tokens and 977 word types. How to download a stop list? Or are the little words of interest? Need to look at keywords, plus how to intrepret results – or to turn it round, what’s my research question? In terms of Danish research questions, it might be interesting to explore tropes such as events which løber af stablen, use of vi/danskere, etc.
Suggestions for the practical activity include creating POS and semantically annotated versions of your personal corpus. Necessary token definition settings: Global Settings – Token Definition – make sure the ‘Letter,’ ‘Number,’ ‘Punctuation,’ ‘Symbol,’ and ‘Mark’ checkboxes are activated. Next, click ‘Apply’ and proceed with your analysis.
Re the forum that isn’t…
The fact there are no titles to threads means you can’t tell at a glance what might be of interest.
Now, this is tantalising, but how would you locate it in the forum:
Actually it wasn’t that hard, a search for lecture on the practical activity page brought up:
I managed to create a corpus from the lectures by Tony…It is the five lectures from week 2 and I have got the following results: 5,337 words tagged with part of speech annotation, using Tag set C5 horizontal format.
The AntConc results were: 1,354 word types and 6,547 word tokens. I created a word list and noticed that the word ‘you’ was high; possibly due to this corpus being spoken words. The word was ranked 11, with a frequency of 81. The tag code was pnp, which I couldn’t find on the list but it is still a pronoun and should be ppy.
Using the semantic tagging, I managed to tag 5,336 words (don’t know why there is a difference of one) in the horizontal format. In AntConc, the result was: 1,727 word types and 6,312 word tokens. I looked at the word ‘you’ again and found that it ranked 12 with a frequency of 74; its code was z8mf (z8 pronouns) the m and f denote male or female.
Looking at another issue:
The team seems to be doing a phenomenal job responding on every possible level. Does that mean that participants are more passive, though?
Couple of blog posts this week: