Recently there was a question posed on Quora asking: “What are some good “toy problems” in data science?” by someone who is studying machine learning and statistics and looking for some simple problems to play with in real world, e.g something as socially relevant as OkCupid but on publicly available datasets/APIs. Ideally the challenge is a specific question to be answered; pointers to interesting data sources are also welcomed. Here is the long list of answers he/she received:
From: Neil Kodner
- The Twitter Streaming API will enable you to capture a lot of data in a relatively short period of time. The statuses/sample method to capture the (limited) public feed. You can also use the statuses/track method which retrieve tweets that mention a given keyword or list of keywords.
From there you can perform nearly an unlimited number of analyses. For inspiration, here is a small sample of the many experiments I’ve run:
- An analysis of tens of thousands canabalt scores posted to twitter, including comparisons of method of death(wall, fell, squashed) vs device type (ipod touch, iphone, ipad)
- At what times do people most frequently tweet about Seinfeld
- Which percentage of tweets contain URLs
- How people spell Goal during the world cup(i.e. goooooooal vs goooooalllllll vs gooooooaaaaallll). For example, I found 1158 mentions of GOOOOOOOOL and 2981 mentions of ggol! I also found a lot of instances where people used their entire 140 to celebrate goals, most often a g followed by 138 Os, and then an L. Sometimes 137 O’s and an AL.
- A market basket analysis of hashtags used in conjunction with other hashtags
- Using classification and NLP to tell if tweets mentioning #homebrew are talking about beer-brewing or homebrew software
- Learning about graph theory and community detection using friends/followers lists.
- Other people, far smarter than myself, have written their own programs to detect what’s currently trending.
- With twitter data, the possibilities are truly endless. All you need to get started is some curiosity. The rest will fall into place. Although these seem like ‘toy projects’, they’ve enabled me to learn a great deal in the process. I’ve also been able to leverage what I’ve learned into my own
From: Drew Conway, PhD student in Politics at NYU
- I find that people learn new methods best when they have a specific task/question they are interested in exploring, and are then forced to use the appropriate tools to solve it. As such, I think the idea of having some “toy problem” to focus on is a good idea, but as others have suggested, I also think that problem has to come from you.
- That said, there are lots of interesting data sets out there that fall into the “socially relevant” category, which could inspire some toy exploration. Some ideas…
- World Bank – http://data.worldbank.org/ there are literally too many data sets here to count, but given the mission of WB most of them are focused on growth and development. A small project that did some basic time-series or correlation of this data could be interesting. Comparing post-earthquake metrics for the reliefs efforts in Haiti vs. Pakistan might be a cool place to start.
- U.S. Census – http://www.census.gov/main/www/a… also Infochimps has a great set of APIs focused on census data (http://api.infochimps.com/), and if you are an R hacker you could use my wrapper to access it (http://cran.r-project.org/web/pa…). Census data is great for doing spatial analysis, e.g., compare the average level of education to mean household income for all US zipcodes and stick it on a map.
- ICPSR – http://www.icpsr.umich.edu/icpsr… the Inter-university Consortium for Political and Social research is a treasure trove of socially relevant data, and includes current and past waves of the American National Election Study. This would be a good place to consider doing a mash-up, perhaps voting patterns in a given census track controlling for income and education.
- Yelp – http://www.yelp.com/developers/d… people love to eat and be entertained, and the Yelp API has a decent set of tools for extracting these preferences. Recently, I have tried to play around with this API as part of a project involving health code violation data from in NYC (http://www.nyc.gov/html/datamine…) and found it to be a bit unruly to work with. But, if you had a smaller project in mind it certainly fits your description.
- Local data – speaking of NYC Data Mine, some of the most useful toy data apps I have seen involve local open data. Check to see if your city, or one nearby, maintains an open data repository and start hacking. Hint: people love to know where buses and taxis are.
From: Josh Wills, Ex-Statistician, Data Scientist at Cl…
- I recommend the Enron email dataset, available from CMU here: http://www.cs.cmu.edu/~enron/.
It is a collection of emails between Enron employees, mostly senior executives, and has been used for experimenting with graph clustering algorithms, text classification, and social network analysis. I’m not sure that it’s fair to call it a toy data set, but it is very manageable on a single computer, as it only has about half a million messages between 150 people. A Google Scholar search  can help you find papers related to the dataset on just about any social network analysis or machine learning topic you can imagine- I’d say start with the descriptive report of the dataset  and go from there.
From: Anonymous User
- Here is a good toy problem: organize twitter users into groups based on similarity of their tweets. To get started you can use simple metrics such as number of words in the tweet, average word length, standard deviation of word length, etc. Use a simple classifier/clustering algorithm of your choice (e.g. see the chapter on Naive Bayes text classification here: http://nlp.stanford.edu/IR-book/)
- You can use Twitter Streaming API as suggested by Neil Kodner to extract users’ status updates and Enron email classification methods suggested by Josh Wills. Run this on at least 1GB worth of tweets (you can extract it in less than a day unless you’re using a dial-up connection), see if your algorithm scales well. Extract more features with standard NLP methods (see How does one determine similarity between people online?) and try to improve your classifier performance. It would be interesting to see how your groupings compare to Twitter’s ‘Similar Users’ suggestions or TunkRank.
- Update from Data 2.0 Conference: You can have full Firehose access now (10,000 keyword filters for 30 cents/hr): http://www.readwriteweb.com/arch…
- Find similar users on Delicious (website) as suggested by Andreas Stuhlmüller: http://www.aiplayground.org/arti…
- Explore Data: Where can I get large datasets open to the public? and APIs: What data APIs or sources should be in my O’Reilly guide? , http://www.reddit.com/r/datasets/
- FAQ extraction from mailing lists, see http://mail-archives.apache.org/…
- Find similar Quora Users by Interests & Segments: see Quora Usage & Statistics: What interesting statistics could be computed from user statistics on Quora?
- Run some stats on Facebook or Google Profiles. See Pete Warden‘s and Paul Butler‘s exercises: http://petewarden.typepad.com/se… , http://petewarden.typepad.com/se… , http://paulbutler.org/archives/v…
- Coupons: http://paulbutler.org/archives/g…
- Machine Learning: What are some good learning projects to teach oneself about machine learning?
- Kinect: Are there any cool hacks for Kinect?
- A better spelling corrector: http://norvig.com/spell-correct….
- Linear A: See Kim Raymoure‘s answer: What are some computational methods used in Linear A decipherment?
- Linear B: Quollaboration: Toy Data Analysis for Linear B
- A murder mystery: http://www.networkworld.com/comm…
- Michael E Driscoll’s answer to What are some good summer programs for PhD students interested in data science?
- Object tracking: http://info.ee.surrey.ac.uk/Pers…
- List the directors that have directed at least 20 movies and acted in all of them, using IMDb data: http://www.imdb.com/interfaces , http://imdbpy.sourceforge.net/
- Mashups: http://www.housingmaps.com/ , APIs: What data APIs or sources should be in my O’Reilly guide?
- Machine Learning: What are some good class projects for machine learning using MapReduce?
- Videolectures.net recommendations: http://www.r-bloggers.com/videol…
- Materials identification: http://tunedit.org/challenge/mat…
- http://www.executablepapers.com/ Also What kind of collaboration tools would reduce duplication of R&D effort in data analysis and sharing?
- Data mining competitions: http://www.kaggle.com/ and http://www.kdnuggets.com/dataset…
- IEEE Vast: http://hcil.cs.umd.edu/localphp/…
- The Mendeley API: http://dev.mendeley.com/ , http://dev.mendeley.com/datachal…
- HIV Progression: http://www.kaggle.com/c/hivprogr…
- Data.gov apps: What are the best apps built on top of open gov data?
- HN search API: http://news.ycombinator.com/item…
- Optimizing FX Trading Strategies: http://gociop.de/gecco-2011-indu…
- Yahoo KDD cup: http://kddcup.yahoo.com
- Analysis of Financial Data with Perl: http://perlmonks.org/index.pl?no…
- Wide Finder: http://www.tbray.org/ongoing/Whe…
- Internet Search: http://himmele.blogspot.com/2011…
From: Sameer Gupta
- You can try kaggle.com
Kaggle is a platform for data prediction competitions.
Companies, organisations and researchers post their data and have it
scrutinised by the world’s best data scientists
From: Daniel Tunkelang, HCIR guy. Data Scientist.
- How about playing with the Google Books Ngram data?
- Via Alex Kamil.
From: Joseph Misiti, I specialize in machine-learning
- There are endless amounts of interesting projects you could do with this data:
- Also worth checking out is these data sets:
- Update: The government has recently cut funding to data.gov, so this resource will not be around for much longer ….
From: Ludi Rehak, Data Miner
- WikiLeaks Data -Visualize Afghan or Iraq events (enemy actions, explosive hazards, detainee operations, etc) over time http://mirror.wikileaks.info/
- Personal Weather Station Data – Organize, visualize, and model local weather data, particularly to compare data at nearby sites to show microclimate patterns. The basic source for the data is the network of personal weather stations (PWS) maintained by the Wunderground project http://www.wunderground.com/
- Wikipedia Editing History – Visualize Wikipedia editing history data to spot trends in user activity or correlations to world events http://wikidashboard.appspot.com/
From: Kim Raymoure, ????????? (asks a lot of questions)
- Prompted by Alex Kamil‘s answer, I put together: Quollaboration: Toy Data Analysis for Linear B as an exercise in Data Analysis and Computational Linguistics.
- Some additional Linear B data available for the grabbing and parsing:
- The KN Am series has two great examples of how identical data can be represented very differently from one source to another. Google Refine can be used to help collapse identical data that is expressed differently, as shown in this series (Chadwick & Ventris vs. Killen & Olivier). See also my answer to Google Refine: What are some cool things folks have done with Google Refine? for some guidance on what it can do. The KN Ap series is also a lot of fun, as it has a lot of repetitive language elements that would allow a data scientist to make some assertions about the percentages of different language elements in use in this series. The MUL, for instance, is an ideogram which means “woman”. KO-WO and KO-WA are sign groups (words) which mean, respectively, “boy” and “girl”. Symbols within sign groups can also be analyzed for the weight with which they appear in a certain position of a word. For instance, even a non-programmatic glance at this data will reveal that JA tends to appear at the end of sign groups.
- And, the Hagia Triada transliterations from Linear A:
From: Paul Amerigo Pajo, APLIMAT prof for GDD at DLS-CSB
- You can start with this 10 datasets
From: Ran Avnimelech, Thesis and work in machine-learning
- You can check tunedit.org/data-competitions in the student-challenge category.
- Maybe infochimps.com/datasets have some simple ones too. You can also check if there are datasets shared from grad classes on data-mining.
From: Pattabhi Nanduri
- How about the predictive+ sentiment analysis on movies from IMDB ?
From: Lucian Sasu, Machine learning, data mining
- See the datasets from UCI machine learning repository: http://archive.ics.uci.edu/ml/. There are a lot of problems for classification, regression, and clustering, most of them would be used on an average computer.
Thanks to all of you who responded on Quora!
Joshua is working to become a Data Scientist with focus on Analytics, Big Data, Machine Learning, and Statistics. His passion for Data and Information are second to none. He is a certified IBM Cognos Expert with more than 10 years experience in Business Intelligence & Data Warehousing, Analtyics, IT Management, Software Engineering and Supply Chain Performance Management with Fortune 500 companies. He has specializations in Analytics, Mobile Reporting, Performance Management, and Business Analysis.
- 4,660 feed subscribers
- Imbo on 5 Tools You Need To Know To Work With Big Data
- Netflix #bigdata #datascience http://t.co/yMXOZoz… on Big Data Analytics and Netflix’s House of Cards
- Barbara Linman on Things to Learn in R
- Rob on Linear Programming: The Gateway to Analytics
- Joshua Burkhow on Linear Programming: The Gateway to Analytics
- Analytics (18)
- Big Data (7)
- Business Intelligence (57)
- Data Science (66)
- Miscellaneous (16)
- Data Sources (3)
Tags2008 Analysis Analytics Article Big Data Book Business Intelligence Charts Cognos Dashboards Data Data Warehouse Design Dimensional Fusion Tables Google Hadoop Humor IBM Learning Logical Market Microsoft Model Modeling Operational Predictive Programming Python R Ralph Kimball Reporting Science Server SQL SQL Server SSIS Statistics TED Tools Tutorial Unstructured Video Visualization Warehousing