I could easily study and write solely on the topic of Big Data. I could dive deep into every single Apache project and all the other software offerings, white papers, and technologies around big data and I’d have a lot to write about. The challenge with this is that we are not robots, we can’t know all things, although I hate to admit it we can’t master every tool, technology, or language. Sorry folks it ain’t gonna happen. I choose to see what floats to the top first when I am learning mode. Simply going after the ones that have the most bang for the buck. Those tools that have proven their value among as many users as possible. Now when people come to me and ask the simple (but rabbit-hole-ish) type questions: “I want to work with Big Data, what do I need to know?”


Here is my top 5 that in my humble opinion satisfy the 80/20 rule for Big Data:


I honestly think I would be laughed out of the big data stratosphere if I did not only include Hadoop but also put it as my #1. It is the base of Big Data. I would love to do an analysis of all titles that have “Big Data” in them and I hypothesize that “Hadoop” would be a term that would rank very highly in the title as well. Especially amongst those looking from the outside in on big data, Hadoop is simply synonymous with Big Data.

Where to learn Hadoop:


When Hadoop first came out and developers realized the only way to get data out of it was through programming MapReduce jobs in Java, they soon realized they had better come up with something a lot more usable by the great majority of data analysts. They realized quickly that the defacto data language was SQL and so Hive was built to allow people with SQL skills quickly adapt (using HiveQL) and be able to run SQL like queries against Hadoop. Hive takes these queries and translates them to MapReduce jobs. Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 file system. To accelerate queries, it provides indexes, including bitmap indexes (wikipedia)

Hive Apachecon 2008 (Slideshare)

Where to learn Hive:


You have a place to store and retrieve data. You have a tool to extract it using SQL-like code. Now, let me ask you a simple question, A very basic question….uhh how do you get the data there in the first place ? We have Pig and Pig is a great tool. Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data operations: standard extract-transform-load (ETL) data pipelines, research on raw data, and iterative processing of data. Pig runs on Hadoop and makes use of MapReduce and the Hadoop Distributed File System (HDFS). The language for the platform is called Pig Latin, which abstracts from the Java MapReduce idiom into a form similar to SQL. Pig Latin is a flow language whereas SQL is a declarative language (HortonWorks)

Where to learn Pig:

Java / Python

The difference between the first three and the next two tools is that the first three are software that have been built and are ready for use to process big data. The next two (really three) Java/Python/R are programming languages. The easy way to view Java/Python in the eyes of big data is that they are tools to be used to extend/simplify the work you are doing. For example knowing Java will not only help you understand the entire Map/Reduce paradigm but it will also help you in understanding many of Apache’s projects as there are many written in Java. It will help you in being able to extend Pig/Hive by writing User Defined Functions (UDF’s). Yeah there will be a lot of programmers when asked to compare languages and Java gets brought up will be the first to say “Oh Java is sooo hard and sooo difficult and takes forever to learn”. Hmmmm no.  Remember this: You aren’t programming windows 8 from scratch. You need to think 80/20. What is 20% of the programming language that will get me 80% of the way. Python is the same way but I will agree right away that its a lot easier to pickup than Java. Python can be used in so many scenarios its just silly. You can write scripts to process data, to move data, to do analysis, to apply algorithms to your data, to connect tools, and to even bake a cake….I think. Just choose a language, find the 80/20, put your nose down, and work your magic.

Where to learn Java:

Where to learn Python


The reason I personally included R in this mix of things to know is because of its use in Statistical Analysis. Really the question you should ask naturally is why capture all this data if you aren’t going to figure out what its telling you? Using R to do your analysis is a way to do it. R indeed isn’t the only tool, there is SPSS, SAS, and many others but R is open source and free. It is very easy to learn and has great integration with many other tools like Tableau, Python, etc. I really believe R is where the value of Data Science gets visualized. In the big data world the data can be moved anywhere and made available to any tools however only until someone starts analyzing the data does the value of Big Data really get proven.

Where to learn R


You can spend a lifetime learning every tool in use today for Big Data. Focus on the 80/20 and go after the biggest bang for the buck (buck being your valuable time 🙂 ) . Go through the resources I provided, submit more via the comments below, and as you learn something new share it so others can learn as well. Sure would love to hear your thoughts below!

Tagged with:
  • A really straightforrward potted guide for an amateur such as myself. Thanks Joshua!

  • CM

    Thanks Josh! I am definitely going to research all–Just be accuse everyone is talking about “big data,” there is an expectation that we ll know what to do with it and how.

  • Imbo

    On the dot!

  • I was having had time figuring how to start with big data…..and here you answered everything i need to know…and will always keep in mind the 80/20 rule it makes alot of sense… Thanks alot

  • Nath

    The big data post is really informative for all who have question on where to start. Thank you. I am into datawarehousing and ETL using Informatica, can you suggest me a bigdata tool which could be a potential candidate for learn and get job. Since I am from DWH /informatica background I would like to move into similar platform for big data.

  • Josh! It’s refreshing to see a fellow big data enthusiast’s take on the learning path and I am happy that I have exact similar thoughts… that’s how I have taken up this challenge to learn this technology and I loved your 80/20 rule about Python/Java makes a little less nervous!

  • Peter Wolf

    Very nice article… for 2014. Pretty much all these tools are obsolete and have been superceded by something better. How about Spark, SparkSQL, Casandra, Kafka and Scala as a more state of the art list?

  • Enna Tewari

    very cool and helpful

  • Richard Kozicki

    I agree that the tools listed in this article are very useful with big data. Another tool that adds value to big data is Redis. The Redis in memory database is a great front end cache to Hadoop, to speed up management of the realtime data component. Output of Redis operations can be pushed to Hadoop for persistence, analytics, etc.

  • Robin White

    Great examples! Personally, I understood those by taking classes http://www.thedevmasters.com but I recommend that before taking classes, you better understand those concepts by going through website like here. Anyway, thanks for sharing those information. It is really helpful to me!

About The Author

Joshua Burkhow

Joshua is an experienced analytics professional with focus on areas such as Analytics, Big Data, Business Intelligence, Data Science and Statistics. He has more than 13 years experience in Business Intelligence & Data Warehousing, Analtyics, IT Management, Software Engineering and Supply Chain Performance Management with Fortune 500 companies. He has specializations in building Analytics organizations, Mobile Reporting, Performance Management, and Business Analysis.

Set your Twitter account name in your settings to use the TwitterBar Section.
%d bloggers like this: