Data Reduction
What is Data Reduction?
Data Reduction are the techniques used to transform massive datasets into a more suitable, or smaller, form for data analysis, while preserving the intrinsic characteristics of the data and minimizing the loss of accuracy. Data reduction is used for scientific computation, statistical analysis, control of
industrial processes, and business applications, as well as in data mining. Many data reduction techniques focus on obtaining an approximate representation of the data, others on reducing the size of the original data.
Data reduction includes both parametric and nonparametric techniques. Parametric techniques assume a model for the data and attempt to estimate the model parameters that produce a best fit of the data, while nonparametric techniques represent, or categorize, the data without making any assumptions on the data model. (Source: Julia Couto)
Framing the Analysis
According to Emily Namey, Greg Guest, Lucy Thairu, And Laura Johnson in their paper “Data Reduction Techniques for Large Qualitative Data Sets”:
“Large qualitative data sets generally encompass multiple research questions. Hence, very few, if any, analyses of such data sets simultaneously involve all of the data that have been collected. From the outset, researchers need to delineate the boundaries of a given analysis with a comprehensive analysis plan. This plan can include guidelines for data set reduction, including whether all the data will first be coded in an exploratory analysis, whether they will be partitioned in a way appropriate for theoretical analysis and hypothesis testing, or whether some data will simply not be included in specific analyses. Eliminating data not relevant to the analysis at hand—or extracting the data that are relevant—is usually the first, and arguably the simplest, form of data reduction.
As Miles and Huberman (1994) explain, Data reduction is not something separate from analysis. It is part of analysis. The researcher’s decisions—which data chunks to code and which to pull out, which evolving story to tell—are all analytic choices. Data reduction is a form of analysis that sharpens, sorts, focuses, discards, and organizes data in such a way that “final” conclusions can be drawn and verified. In cases where the larger data set was compiled from more than one type of data collection instrument (e.g., semi-structured in-depth interviews, structured focus groups, and pile-sorting activities), the researcher needs to make a decision about the type of data she or he will select from the larger data set. She or he may choose to analyze data from only one type of instrument, or from several different instruments. As described by Patton (1990), this form of “triangulation” across different data collection strategies during data analysis can be particularly helpful when dealing with large data sets. The researcher may also need to frame the analysis in terms of the sources of data or the categories of participants from whom the data were collected (MacQueen and Milstein 1999). This may require limiting the analysis to one or two sites of a multisite project, or limiting the number of subgroups included in the analysis (e.g., including only data relevant in terms of select participant characteristics regardless of site).
Data Explosion
So far we have talked about Data Reduction from a researcher’s point of emphasis however from a business perspective. We know that if we are unable to handle the amount of data then no analysis at all will become worthwhile. According to IBM, The continuing explosion of data is one of the biggest challenges that IT organizations of all sizes are facing today. IBM says there are fours ways to approach this problem.
The first step is to eliminate the number one source of data growth by only backing up the data that has changed since the last backup – almost all backup products on the market force you to perform periodic full backups, which creates enormous amounts of duplicate data.
Second, determine what different types of data you have and categorize it so that you can manage it most effectively, by moving less frequently-accessed data to lower-cost tiers of storage, and by deleting the data that you no longer need or want. By cleaning out your production storage, you will shorten your backup cycles, improve application performance, and delay the acquisition of additional capacity.
Third, put automated processes in place, based on policies that meet business requirements and/or service level agreements, to keep your production systems as clean as possible.
Finally, compress and de-duplicate the data you end up putting into your data protection and retention systems.
Source:
Julia Couto, “Data reduction,” in AccessScience, ©McGraw-Hill Companies, 2008, http://www.accessscience.com
See MLA or APA style
Emily Namey, Greg Guest, Lucy Thairu, And Laura Johnson, “Data Reduction Techniques for Large Qualitative Data Sets”, at http://www.stanford.edu/~thairu/07_184.Guest.1sts.pdf
Joshua Burkhow
Joshua is working to become a Data Scientist with focus on Analytics, Big Data, Machine Learning, and Statistics. His passion for Data and Information are second to none. He is a certified IBM Cognos Expert with more than 10 years experience in Business Intelligence & Data Warehousing, Analtyics, IT Management, Software Engineering and Supply Chain Performance Management with Fortune 500 companies. He has specializations in Analytics, Mobile Reporting, Performance Management, and Business Analysis.
- 2,085 feed subscribers
Tags
2008 Analysis Analytics Article Big Data Book Business Intelligence Charts Cognos Dashboards Data Data Warehouse Design Dimensional Flow Elements Fusion Tables Google Humor IBM Install Learning Logical Market Microsoft Model Modeling Operational Predictive Programming Python Ralph Kimball Reporting Science Server SQL SSIS Statistics TED Tools Tutorial Unstructured Video Visualization Warehousing Windows








