Recently there was a question posed on Quora asking: “What are some good “toy problems” in data science?” by someone who is studying machine learning and statistics and looking for some simple problems to play with in real world, e.g something as socially relevant as OkCupid but on publicly available datasets/APIs. Ideally the challenge is a specific question to be answered; pointers to interesting data sources are also welcomed. Here is the long list of answers he/she received:

image

From: Neil Kodner

  • The Twitter Streaming API will enable you to capture a lot of data in a relatively short period of time. The statuses/sample method to capture the (limited) public feed. You can also use the statuses/track method which retrieve tweets that mention a given keyword or list of keywords.
    From there you can perform nearly an unlimited number of analyses. For inspiration, here is a small sample of the many experiments I’ve run:
  • An analysis of tens of thousands canabalt scores posted to twitter, including comparisons of method of death(wall, fell, squashed) vs device type (ipod touch, iphone, ipad)
  • At what times do people most frequently tweet about Seinfeld
  • Which percentage of tweets contain URLs
  • How people spell Goal during the world cup(i.e. goooooooal vs goooooalllllll vs gooooooaaaaallll). For example, I found 1158 mentions of GOOOOOOOOL and 2981 mentions of ggol! I also found a lot of instances where people used their entire 140 to celebrate goals, most often a g followed by 138 Os, and then an L. Sometimes 137 O’s and an AL.
  • A market basket analysis of hashtags used in conjunction with other hashtags
  • Using classification and NLP to tell if tweets mentioning #homebrew are talking about beer-brewing or homebrew software
  • Learning about graph theory and community detection using friends/followers lists.
  • Other people, far smarter than myself, have written their own programs to detect what’s currently trending.
  • With twitter data, the possibilities are truly endless. All you need to get started is some curiosity. The rest will fall into place. Although these seem like ‘toy projects’, they’ve enabled me to learn a great deal in the process. I’ve also been able to leverage what I’ve learned into my own

From: Drew Conway, PhD student in Politics at NYU

  • I find that people learn new methods best when they have a specific task/question they are interested in exploring, and are then forced to use the appropriate tools to solve it. As such, I think the idea of having some “toy problem” to focus on is a good idea, but as others have suggested, I also think that problem has to come from you.
  • That said, there are lots of interesting data sets out there that fall into the “socially relevant” category, which could inspire some toy exploration. Some ideas…
  • World Bank – http://data.worldbank.org/ there are literally too many data sets here to count, but given the mission of WB most of them are focused on growth and development. A small project that did some basic time-series or correlation of this data could be interesting. Comparing post-earthquake metrics for the reliefs efforts in Haiti vs. Pakistan might be a cool place to start.
  • U.S. Census – http://www.census.gov/main/www/a… also Infochimps has a great set of APIs focused on census data (http://api.infochimps.com/), and if you are an R hacker you could use my wrapper to access it (http://cran.r-project.org/web/pa…). Census data is great for doing spatial analysis, e.g., compare the average level of education to mean household income for all US zipcodes and stick it on a map.
  • ICPSR – http://www.icpsr.umich.edu/icpsr… the Inter-university Consortium for Political and Social research is a treasure trove of socially relevant data, and includes current and past waves of the American National Election Study. This would be a good place to consider doing a mash-up, perhaps voting patterns in a given census track controlling for income and education.
  • Yelp – http://www.yelp.com/developers/d… people love to eat and be entertained, and the Yelp API has a decent set of tools for extracting these preferences. Recently, I have tried to play around with this API as part of a project involving health code violation data from in NYC (http://www.nyc.gov/html/datamine…) and found it to be a bit unruly to work with. But, if you had a smaller project in mind it certainly fits your description.
  • Local data – speaking of NYC Data Mine, some of the most useful toy data apps I have seen involve local open data. Check to see if your city, or one nearby, maintains an open data repository and start hacking. Hint: people love to know where buses and taxis are.

From: Josh Wills, Ex-Statistician, Data Scientist at Cl…

  • I recommend the Enron email dataset, available from CMU here: http://www.cs.cmu.edu/~enron/.
    It is a collection of emails between Enron employees, mostly senior executives, and has been used for experimenting with graph clustering algorithms, text classification, and social network analysis. I’m not sure that it’s fair to call it a toy data set, but it is very manageable on a single computer, as it only has about half a million messages between 150 people. A Google Scholar search [1] can help you find papers related to the dataset on just about any social network analysis or machine learning topic you can imagine- I’d say start with the descriptive report of the dataset [2] and go from there.

From: Anonymous User

image

From: Sameer Gupta

  • You can try kaggle.com
    Kaggle is a platform for data prediction competitions.
    Companies, organisations and researchers post their data and have it
    scrutinised by the world’s best data scientists

From: Daniel Tunkelang, HCIR guy. Data Scientist.

From: Joseph Misiti, I specialize in machine-learning

From: Ludi Rehak, Data Miner

  • WikiLeaks Data -Visualize Afghan or Iraq events (enemy actions, explosive hazards, detainee operations, etc) over time http://mirror.wikileaks.info/
  • Personal Weather Station Data – Organize, visualize, and model local weather data, particularly to compare data at nearby sites to show microclimate patterns. The basic source for the data is the network of personal weather stations (PWS) maintained by the Wunderground project http://www.wunderground.com/
  • Wikipedia Editing History – Visualize Wikipedia editing history data to spot trends in user activity or correlations to world events http://wikidashboard.appspot.com/

From: Kim Raymoure, ????????? (asks a lot of questions)

  • Prompted by Alex Kamil‘s answer, I put together: Quollaboration: Toy Data Analysis for Linear B as an exercise in Data Analysis and Computational Linguistics.
  • Some additional Linear B data available for the grabbing and parsing:

    http://minoan.deaditerranean.com…

  • The KN Am series has two great examples of how identical data can be represented very differently from one source to another. Google Refine can be used to help collapse identical data that is expressed differently, as shown in this series (Chadwick & Ventris vs. Killen & Olivier). See also my answer to Google Refine: What are some cool things folks have done with Google Refine? for some guidance on what it can do. The KN Ap series is also a lot of fun, as it has a lot of repetitive language elements that would allow a data scientist to make some assertions about the percentages of different language elements in use in this series. The MUL, for instance, is an ideogram which means “woman”. KO-WO and KO-WA are sign groups (words) which mean, respectively, “boy” and “girl”. Symbols within sign groups can also be analyzed for the weight with which they appear in a certain position of a word. For instance, even a non-programmatic glance at this data will reveal that JA tends to appear at the end of sign groups.
  • And, the Hagia Triada transliterations from Linear A:
    http://people.ku.edu/~jyounger/L…

From: Paul Amerigo Pajo, APLIMAT prof for GDD at DLS-CSB

From: Ran Avnimelech, Thesis and work in machine-learning

  • You can check tunedit.org/data-competitions in the student-challenge category.
  • Maybe infochimps.com/datasets have some simple ones too. You can also check if there are datasets shared from grad classes on data-mining.

From: Pattabhi Nanduri

  • How about the predictive+ sentiment analysis on movies from IMDB ?

From: Lucian Sasu, Machine learning, data mining

  • See the datasets from UCI machine learning repository: http://archive.ics.uci.edu/ml/. There are a lot of problems for classification, regression, and clustering, most of them would be used on an average computer.

Thanks to all of you who responded on Quora!

Tagged with:
 

One Response to Good Problems To Solve In Data Science

  1. […] Problems To Solve In Data Science” http://dataenthusiast.com/… via prismatic  and […]

Leave a Reply

About The Author

Joshua Burkhow

Joshua is working to become a Data Scientist with focus on Analytics, Big Data, Machine Learning, and Statistics. His passion for Data and Information are second to none. He is a certified IBM Cognos Expert with more than 10 years experience in Business Intelligence & Data Warehousing, Analtyics, IT Management, Software Engineering and Supply Chain Performance Management with Fortune 500 companies. He has specializations in Analytics, Mobile Reporting, Performance Management, and Business Analysis.

Pinterest
Email
Print