Text analysis in Denmark

As part of Social Media Week Copenhagen Business School (CBS) held an event on Social media analytics: concepts, models, methods and tools (aka #smwcbsdata) on 21 February. I couldn’t go, but made it to Computing feelings: Danish approaches to sentiment analysis on 28 February. Both events are also interesting as examples of Danish academics engaging with the public.

Computing feelings kicked off with a 15 minute keynote from Bernardo Huberman (HP Labs) on #some and the attention economy, which I missed, followed by Anders Søgaard (Center for Sprogteknologi, KU; paper).

The panel Q&A shed some light on Infomedia, who do press scanning for companies, aka media monitoring. Takes me back…they employ 40 students as coders, so from a business perspective it would make sense to automate as much as possible. Broad word/phrase searches can bring up as much as 80% noise (recall vs precision), but they also need to monitor for topics, eg market performance. Can sentiment analysis help? Relevant for factual articles, eg stock rise/fall? Topic extraction can look for new topics coming up, as well as picking up topics which aren’t there (negative keywords), which can also help with the don’t know what you don’t know issue. Infomedia aim for as high a level of accuracy as possible, but is it ever possible to get better then 85% accuracy?

I also missed Dan Hardt’s intro, but I’m guessing it was similar to his #smwcph preso, with an introduction to the two tools and datasets used by the bulk of the six 10 minute presentations:

  • Project #marius –  presented by Chris Zimmerman (@socialbeit | @zimme2cj); see Project #marius and infostorms
  • Ensemble classifiers – René Madsen (Infomedia) & Niels Buus Lassen (Evalua) on combining classifier algorithms by embedding, running in parallel or as a chain, with the aim of improving recall and precision; if you know your domain you can train algorithms to perform really well; looked at Trustpilot reviews; presented on predictive modelling (preso) at #smwcph, see article
  • From simple to more advanced sentiment lexicon approach – Anders Boje Larsen (@anders_boje) uses a sentiment lexicon approach – human judgement, not machine learning (qv Dan’s slides); words indexed/tagged from +5 to -5 and then run on text; 7K words indexed, now created dictionaries for decrementers and incrementers, invertors, stop words, increases accuracy; see article and Sociallytic tool
  • three presos on domain adaptation, ie using a classifier trained in one domain on another; Unigram bag of words model, ie all the words, then find top 10 features per class (ie top 10 words for positive, negative, neutral); 1-gram vs 3-gram, ie n-grams; f-score results; movie reviews (Scope, 60% neutral) different from company reviews (Trustpilot, 80% positive); create a vector from features, a string of 1s and 0s; logistic regression classifier; domain specific features, cf self driving car and a ditch –> Infomedia needs client specific algorithms for each domain

The limitations of sentiment analysis:

  • classifying text as positive, negative or neutral is not an exact science, plus doesn’t include sarcasm, emotional state etc (may end up as neutral by default)
  • DK parsers not there yet – POS analysis OK, semantic stuff lacking
  • 90% accuracy possible for both Danish and English – but how do you find and determine accuracy?
  • systems could usefully create a flag for “hooman needs to look at this”
  • linguistical challenges include negation, intensifiers, irony
  • sentences with attitude verbs, modals, helping verbals are difficult to parse
  • modal sentences could be removed as they are not reality anyway…
  • cf not – changes meaning!

The events were mastermind by the CBS Competitiveness Platform (@CBSCompete | Facebook), who held a similar ‘open’ event last year (Workshop on big data and language technology), with the Dept of IT Management’s Dan Hardt and Ravi Vatrapu.

Some local cases:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s