Text mining workshop at #or2012

A workshop on text mining was held on 9 July at Open Repositories 2012. Below are some key points from the +/- 12o subhashtagged tweets. The session was avidly live tweeted by @criticalsteph, and proceedings will be published in. due. course.


Further sessions covered legal and ethical issues, for example (largely verbatim from tweets):

  • for mailing lists, is harvesting addresses legal, who owns the content?
  • copyright, contracts for resources, TOS, paywall, privacy and data protection law can all be barriers
  • shifting sands – law is dynamic, and changing; many see money in text/data mining, which can be a catalyst to rapid change
  • UK govt says non-commercial research can be an exception, although this must be done on larger scale with EU agreements
  • databases allow private order to be applied – lets publishers opt out of the text mining exception. Publishers want to keep control!
  • data/text mining could be maybe treated as an index? Author needs protection – maybe? There *is* an issue of author rights.
  • Is student author copyright being ignored in plagerism s/ware e.g. Turnitin? Legal challenge = no in USA. Unclear.
  • Privacy and Data Protection UK – sensible steps to follow, quite clear & can be used in text mining without problems BUT need to do a personalisation data minimisation risk assessment on this to show intent.

Key text mining resources:


  • discipline specific research – bound to be lots of law stuff about
  • techniques – sentiment analysis/subjectivity analysis, opinion mining, affect analysis, metaphor analysis
  • approaches – metadata extraction, categorisation, summarisation
  • text mining over the social web – community detection, timelines
  • legal aspects