
CMC Logo
This site provides access to a corpus of over 900 text samples gathered from test subjects at Loyola College, Baltimore, Maryland, in 2006 and 2007. Twenty-one subjects provide a completely correlated corpus in which each subject provided their opinion in each of six predetermined topics in each of six genres: blog, chat, discussion, email, essay, and interview.
We hope this corpus will be useful to researchers in the fields of natural language processing and computational linguistics.
For information on the details of experimental design, data collection, and analysis already undertaken, refer to the papers below.
For access to the corpus, simply choose “login” then register for its use. You will receive a password which will enable you to download a zipped file containing all text samples plus information on filename encoding. For additional information or assistance, contact Roberta E. Sabin, Computer Science Department, Loyola College, res@loyola.edu.
Goldstein-Stewart, J. et al., “Creating and Using a Correlated Corpora to Glean Communicative Commonalities,” Proceedings of the Linguistic Resources and Evaluation Conference, LREC, Marrakesh, Morocco, 2008.
Goldstein-Stewart, J. et al., “Person Identification from Text and Speech Genre Samples,” Proceedings of 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Athens, Greece, March, 2009.
Sabin, R. E. et al., “Gender Differences across Correlated Corpora: Preliminary Results,” Proceedings of Florida Artificial Intelligence Research Society (FLAIRS), Cocoa Beach, FL, May, 2008.