Here’s a selection of useful lists and data sets from around the web. Feel free to email us cool stuff to include.


Google books Ngrams raw data set. Word + number of times it has been encountered in books scanned by Google. This has thousands of uses.


Majestic Million URLs. CSV of the internet’s top one million URLs according to its scraper. Example data:

GlobalRank,TldRank,Domain,TLD,RefSubNets,RefIPs,IDN_Domain,IDN_TLD,PrevGlobalRank,PrevTldRank,PrevRefSubNets,PrevRefIPs
1,1,google.com,com,409161,2616588,google.com,com,1,1,409290,2619442
2,2,facebook.com,com,401995,2766766,facebook.com,com,2,2,402090,2769586
3,3,youtube.com,com,367512,2230880,youtube.com,com,3,3,367579,2233774
4,4,twitter.com,com,362833,2328402,twitter.com,com,4,4,362893,2331046
5,5,microsoft.com,com,264085,843077,microsoft.com,com,5,5,264168,843481

Sentiment analysis

Hu and Liu’s opinion lexicon, 6,8k list of positive and negative sentiment words.

SentiWordNet, Princeton WordNet data marked with (positivity, negativity, objectivity) sentiment scores.

General Inquirer, lists of words categorized by association with various ideas, emotions, and topics.


AOL Search data leak from 2006. 20 million searches in 10 CSVs. Example data:

AnonID	Query	QueryTime	ItemRank	ClickURL
214	jeopardy	2006-03-01 19:22:35	1	http://www.sonypictures.com
214	food network	2006-03-09 16:10:41
214	food network	2006-03-09 16:18:49	3	http://www.foodtv.ca
214	www.foodnetwork.con	2006-03-10 12:52:42
214	free find this person	2006-03-13 18:30:05