After doing a lot of reading on “Data” oriented themes, one of the discussions that always comes around is the topic of ‘Structured’ data vs. ‘Unstructured’ data. So much so that it is a staple topic for most computer science undergrads.
Unstructured Data and the 80 Percent Rule
According to Seth Grimes who is an analytics strategist with Washington DC based Alta Plana Corporation, “It’s a truism that 80 percent of business-relevant information originates in unstructured form, primarily text. The figure is very widely cited by analysts, vendors, and users alike, all seeking to make the case for text analytics. There are variations; Anant Jhingran of IBM Research, among others, cites an 85% figure….. It does seem obvious that a very high proportion of data is unstructured: How much of your workday is spent reading or writing e-mails, reports, or articles and the like, in conversations, or listening to live or recorded audio? And in making the case for tapping unstructured sources, a very important asset in fields ranging from customer experience management to counter-terrorism, it’s helpful to be able to quantify the proportion, to put a number on it. (Source: Clarabridge)
“Eighty-five percent of data is unstructured, and you need text analysis and text abstraction along with a relational database to arrive at an integrated view,” says Jerry Hill, vice president of manufacturing, for Teradata. (Source: Forbes)
Data companies are constantly working to find a better answer on how to better organize and use unstructured data more effectively. In a recent article the company EMC said its “Stepping up its pursuit of big-data analysis”, as the announced that it will release its own distributions of open-source Apache Hadoop distributed processing software, along with a related appliance that will analyze both structured and unstructured data on a single platform.
Unstructured data can’t be analyzed in conventional relational databases, so organizations swamped with tens or hundreds of terabytes or more rely on Hadoop, which can spread processing across tens, hundreds, or thousands of compute nodes on commodity servers, depending on the scale of the deployment. Hadoop also provides a MapReduce engine, which helps split up workloads when handling particularly large sets of unstructured data. (Source: InformationWeek)
What are some examples of Unstructured Data?
- Excel Files
- Word Documents
- PDF Documents
- Images (e.g., .jpg, or .gif)
- Media (e.g., mp3, .wma, or .wmv)
- Text Files
- PowerPoint Presentations
Unstructured Data Solutions
It should be obvious that IBM would be one of the many companies that has gone after solving the unstructured data problem and have done so with a product called Content Analytics. Here is a short video that describes what content analytics does.
Much more to come…
There is so much that can be written about and discussed in regards to unstructured data. Especially since this is a huge focus of many Business Intelligence companies. So please stay tuned for more in depth coverage of this topic.
Joshua is working to become a Data Scientist with focus on Analytics, Big Data, Machine Learning, and Statistics. His passion for Data and Information are second to none. He is a certified IBM Cognos Expert with more than 10 years experience in Business Intelligence & Data Warehousing, Analtyics, IT Management, Software Engineering and Supply Chain Performance Management with Fortune 500 companies. He has specializations in Analytics, Mobile Reporting, Performance Management, and Business Analysis.
- 2,084 feed subscribers
Tags2008 Analysis Analytics Article Big Data Book Business Intelligence Charts Cognos Dashboards Data Data Warehouse Design Dimensional Flow Elements Fusion Tables Google Humor IBM Install Learning Logical Market Microsoft Model Modeling Operational Predictive Programming Python Ralph Kimball Reporting Science Server SQL SSIS Statistics TED Tools Tutorial Unstructured Video Visualization Warehousing Windows