Corpora

I would love to be able to share my Twitter corpora with you, however Twitter won't let me! See why here. Feel free to complain to Twitter that researchers are getting caught in a net presumably meant for commercial developers!

Some of the corpora I use in my work include:

HERMES - A 100 million word randomised corpus of tweets originally collect in 2009. I have recently compiled a new version of this corpus in 2013. This corpus is used in Discourse of Twitter and Social Media

Obama Win Corpus (OWC) - A corpus of 45,000 tweets containing the lexical item 'Obama' collected over the 24 hours after the announcement of Barak Obama’s victory in the 2008 US presidential elections. This corpus is used in Ambient affiliation: A linguistic perspective on Twitter

MORPHEUS - A 100 million word corpus of tweets about sleep!

LUCIA - The entire twitter stream of a single user who writes about her experiences of motherhood.

Other corpora


Tweets2011 corpus (only IDs - you need to reconstruct the corpus once you have the IDs)


FSD corpus of tweets - This page includes code the enables you to download the FSD corpus of tweets

Twitter  Stratified Random Sample (SRS) - A time-stratified, random sample of tweets. They sample at 10 minute intervals to build "a set of month-based corpora, each containing at least one million English tweets".


3 comments:

  1. hi, may I ask if HERMES is available publicly . thank you :)

    ReplyDelete
  2. Somehow I missed this comment - sorry. Unfortunately Twitter's Terms of Service doesn't allow me to share the corpus :(

    ReplyDelete
  3. Corpora Photography based in Washington DC provides professional photography services for Washington DC Headshots and Headshots Orlando.

    ReplyDelete