I could easily study and write solely on the topic of Big Data. I could dive deep into every single Apache project and all the other software offerings, white papers, and technologies around big data and I’d have a lot to write about. The challenge with this is that we are not robots, we can’t know all things, although I hate to admit it we can’t master every tool, technology, or language. Sorry folks it ain’t gonna happen. I choose to see what floats to the top first when I am learning mode. Simply going after the ones that have the most bang for the buck. Those tools that have proven their value among as many users as possible. Now when people come to me and ask the simple (but rabbit-hole-ish) type questions: “I want to work with Big Data, what do I need to know?”
Here is my top 5 that in my humble opinion satisfy the 80/20 rule for Big Data:
I honestly think I would be laughed out of the big data stratosphere if I did not only include Hadoop but also put it as my #1. It is the base of Big Data. I would love to do an analysis of all titles that have “Big Data” in them and I hypothesize that “Hadoop” would be a term that would rank very highly in the title as well. Especially amongst those looking from the outside in on big data, Hadoop is simply synonymous with Big Data.
Where to learn Hadoop:
When Hadoop first came out and developers realized the only way to get data out of it was through programming MapReduce jobs in Java, they soon realized they had better come up with something a lot more usable by the great majority of data analysts. They realized quickly that the defacto data language was SQL and so Hive was built to allow people with SQL skills quickly adapt (using HiveQL) and be able to run SQL like queries against Hadoop. Hive takes these queries and translates them to MapReduce jobs. Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 file system. To accelerate queries, it provides indexes, including bitmap indexes (wikipedia)
Where to learn Hive:
You have a place to store and retrieve data. You have a tool to extract it using SQL-like code. Now, let me ask you a simple question, A very basic question….uhh how do you get the data there in the first place ? We have Pig and Pig is a great tool. Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data operations: standard extract-transform-load (ETL) data pipelines, research on raw data, and iterative processing of data. Pig runs on Hadoop and makes use of MapReduce and the Hadoop Distributed File System (HDFS). The language for the platform is called Pig Latin, which abstracts from the Java MapReduce idiom into a form similar to SQL. Pig Latin is a flow language whereas SQL is a declarative language (HortonWorks)
Where to learn Pig:
Java / Python
The difference between the first three and the next two tools is that the first three are software that have been built and are ready for use to process big data. The next two (really three) Java/Python/R are programming languages. The easy way to view Java/Python in the eyes of big data is that they are tools to be used to extend/simplify the work you are doing. For example knowing Java will not only help you understand the entire Map/Reduce paradigm but it will also help you in understanding many of Apache’s projects as there are many written in Java. It will help you in being able to extend Pig/Hive by writing User Defined Functions (UDF’s). Yeah there will be a lot of programmers when asked to compare languages and Java gets brought up will be the first to say “Oh Java is sooo hard and sooo difficult and takes forever to learn”. Hmmmm no. Remember this: You aren’t programming windows 8 from scratch. You need to think 80/20. What is 20% of the programming language that will get me 80% of the way. Python is the same way but I will agree right away that its a lot easier to pickup than Java. Python can be used in so many scenarios its just silly. You can write scripts to process data, to move data, to do analysis, to apply algorithms to your data, to connect tools, and to even bake a cake….I think. Just choose a language, find the 80/20, put your nose down, and work your magic.
Where to learn Java:
Where to learn Python
The reason I personally included R in this mix of things to know is because of its use in Statistical Analysis. Really the question you should ask naturally is why capture all this data if you aren’t going to figure out what its telling you? Using R to do your analysis is a way to do it. R indeed isn’t the only tool, there is SPSS, SAS, and many others but R is open source and free. It is very easy to learn and has great integration with many other tools like Tableau, Python, etc. I really believe R is where the value of Data Science gets visualized. In the big data world the data can be moved anywhere and made available to any tools however only until someone starts analyzing the data does the value of Big Data really get proven.
Where to learn R
You can spend a lifetime learning every tool in use today for Big Data. Focus on the 80/20 and go after the biggest bang for the buck (buck being your valuable time 🙂 ) . Go through the resources I provided, submit more via the comments below, and as you learn something new share it so others can learn as well. Sure would love to hear your thoughts below!
Joshua is an experienced analytics professional with focus on areas such as Analytics, Big Data, Business Intelligence, Data Science and Statistics. He has more than 13 years experience in Business Intelligence & Data Warehousing, Analtyics, IT Management, Software Engineering and Supply Chain Performance Management with Fortune 500 companies. He has specializations in building Analytics organizations, Mobile Reporting, Performance Management, and Business Analysis.
Please follow us :)5k
- Analytics (21)
- Big Data (9)
- Business Intelligence (59)
- Data Science (70)
- Miscellaneous (17)
Tags2008 Analysis Analytics Article Big Data Book Business Intelligence Charts Cognos Dashboards Data Data Visualization Data Warehouse Design Dimensional Fusion Tables Google Hadoop Humor IBM Logical Market Microsoft Model Modeling Operational Predictive Programming Python R Ralph Kimball Reporting Science Server SQL SQL Server SSIS Statistics TED Tools Tutorial Unstructured Video Visualization Warehousing