Update: see coverage of the CCC Symposium in Copenhagen on 16 October for lots more on the perils of big data and quant methods.
The second round of #nsmnss activities took the form of a tweetchat (24 September) followed by a knowledge exchange event (26 September). Topics covered: data visualisation, populations and sampling, and big data. Notes below mainly paraphrased from Twitter. Here’s some videos too (posted 25 Jan 2013), and there’s a webinar on 8 April which I am sitting out.
Are we doing more than data mining when we analyse social media data? Which research questions is it best able to answer? What are the biggest methodological challenges when working with quantitative social media data?
- Compared to surveys, social media data are conversations rather than responses to standardised questions.
- Overall quality of data – social media is an expressive medium.
Social media data can often be analysed using visual methods. How can we visualise data collected by social media? How does visualisation relate to statistical analysis? What are the payoffs from using visualisations?
- We can see patterns not visible in a table, for example can compare categories of data by visualising the size of each block of data- see Information is Beautiful.
- Tools can explore data interactively – see Interactive Visualizations, Gapminder and Hans Rosling’s TED Talk – and track how users explore data.
- Visualization for storytelling and illustration, or visualization for prediction? Visualization to answer research questions.
- “Often beautiful, sometimes helpful.” Dataviz may be adding impact and getting statistics to a wider audience, but is it adding anything methodologically? Visualisations can be misleading – remember the real purpose of the data.
What is the ‘population’ on social media platforms? How do platforms differ in population characteristics? How can we select cases or samples on social media? Is it possible to get a statistically representative sample using from social media platforms? Does it matter?
- Sampling goal: to make statements about a certain group of people or objects (webpages, tweets etc)
- Sampling frame: list of every object in population. Internet: special issues, hard to find list of all blogs, pages etc, meaning that bias is common.
- Huge problem with sampling – lack of demographic info on most people’s profiles makes this even harder.
- Advantages of sampling with the Internet – cheap, fast turnaround, no interviewer effects.
- Disadvantages – many have no or limited Internet access, so cannot generalise findings. Population characteristics unknown, meaning of behaviour may be unknown.
- Social media users are only representative of social media users, not of any larger group – see Sampling and social media and Tortoise or the hare: social media sampling.
Social media research can involve very large datasets. What do we gain and lose with big data? How is big data changing the way we do research?
- Are we analysing big data because it’s available rather than because it is suitable to answer our research questions? Bigger does not necessarily means more useful.
- Availability of big data precedes availability of suitable quant methods for analysing complex data structures – we need to understand the structure before we can analyse it.
- cases: Waller’s study of Australian Google users | text analysis using Google books, studies by Michel et al and Heuser and Le-Khac
- Twitter as social network or news/broadcast medium? Example: Kwak’s study of 1.47 billion social relations.
- Is retweeting a social relationship? Claims based on big data network graphs, eg w influences x,y & z because w retweets them. Does an RT mean you’ve influenced someone?