Here’s a selection of useful lists and data sets I’ve found around the web. Feel free to email me cool stuff to include here and I’ll consider it.
Google books Ngrams raw data set. Word + number of times it has been encountered in books scanned by Google. This has thousands of uses.
Majestic Million URLs. CSV of the internet’s top one million URLs according to its scraper. Example data:
GlobalRank,TldRank,Domain,TLD,RefSubNets,RefIPs,IDN_Domain,IDN_TLD,PrevGlobalRank,PrevTldRank,PrevRefSubNets,PrevRefIPs 1,1,google.com,com,409161,2616588,google.com,com,1,1,409290,2619442 2,2,facebook.com,com,401995,2766766,facebook.com,com,2,2,402090,2769586 3,3,youtube.com,com,367512,2230880,youtube.com,com,3,3,367579,2233774 4,4,twitter.com,com,362833,2328402,twitter.com,com,4,4,362893,2331046 5,5,microsoft.com,com,264085,843077,microsoft.com,com,5,5,264168,843481
Hu and Liu’s opinion lexicon, 6,8k list of positive and negative sentiment words.
SentiWordNet, Princeton WordNet data marked with (positivity, negativity, objectivity) sentiment scores.
General Inquirer, lists of words categorized by association with various ideas, emotions, and topics.
AOL Search data leak from 2006. 20 million searches in 10 CSVs. Example data:
AnonID Query QueryTime ItemRank ClickURL 214 jeopardy 2006-03-01 19:22:35 1 http://www.sonypictures.com 214 food network 2006-03-09 16:10:41 214 food network 2006-03-09 16:18:49 3 http://www.foodtv.ca 214 www.foodnetwork.con 2006-03-10 12:52:42 214 free find this person 2006-03-13 18:30:05