27 Jan 2010
Data mining is the process of analyzing or manipulating the data from different angles and perspectives (Data Mining: What is Data Mining?, reviewed on 16 March 2009, http://www. anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm). By analyzing the data we obtain some useful information that can then be categorized and used to predict future patterns. Consumer focused companies use data mining to determine the impact of the internal or external factors of products on the sales and the profits. Data mining gives a company useful information that can be used to predict the patterns of sales, based on the past purchase patterns of the customers and also helps plan the future promotional activities. So it can be said that major goal of the data mining is prediction of business pattern.
For example, a shop keeper collects data on purchase pattern of some customers based on features like, the age and gender of the buyer, the frequency of purchase, the occasions of purchases. He can then utilize such personal knowledge for bringing in innovation in the promotional strategies to increase sales (a reflective practice, i.e. the outcome is changed by changing the reflex to the stimulus).
Data mining is being used for several decades by many companies which possessed a tacit knowledge of the process. Then it was known in the name of the knowledge management. The benefit those companies derived from such process was huge. When they started using the statistical methods, accuracy of such predictions increased. Relationships between certain inputs with that of certain outcomes were derived using statistical methods. The advent of computers followed by major software developments has helped the companies turn to electronic data mining. This has increased the speed and accuracy of the predictions further. Today, custom made softwares are available for analyzing the data collected by the companies. These soft wares also help the companies in warehousing the data.
Today, the data mining has become an important knowledge management (KM) model in the corporate structure and it plays a major role in the decision making process of the company leadership. For that purpose, most of the data mining processes use the basic statistical principle of Exploratory Data Analysis (EDA). And, data mining, to some extend, makes use of Artificial Intelligence (AI) and the data base research technologies in the processing of the information.
The data mining process
In earlier days, the whole process was done manually and scope of the pattern prediction was very limited. But now, with the help of computers, the raw data is retrieved from the centralised data storage unit (warehouse) and using custom made software, they are categorized. The categorized information is then summarized to get the knowledge about a particular feature which is then used to predict future patterns.
The process mentioned above may seem to categorize data mining as knowledge management process. But, there are more complex activities involved in data mining. The initial process is preparing the data for further classification and analysis. The preparation stage involves the steps mentioned below.
(a) Data preparation: The term denotes the process of omitting unwanted or unrealistic data (such as out of range values, like the negative value for age). This is an important process as the accuracy of the prediction hinges on the accuracy of the raw data.
(b) Feature selection: The process of identifying and selecting the prediction-related feature from the total available features.
(c) Data reduction: Large volume of data is tabulated and aggregated to reduce its volume for further, easy manipulation.
(d) Deployment: The process of employing suitable statistical model over the information for classification, codification and pattern prediction.
(Data mining techniques, reviewed 16 Mar 09, http://www.statsoft.nl/uk/textbook/stdatmin.html)
Types of relationships analyzed in data mining
As part of the process, the data mining seeks to ascertain four types of relationship among the data being mined. The relationships are as mentioned below.
(a) Class: refers to classification of different groups for different features
(b) Cluster: grouping of two or more features which have certain relationship between them
(c) Association: determines association like the combined purchase of two different types of products
(d) Sequential pattern: determines if sale of one product is related to sale of another product.
(Data Mining: What is Data Mining?, reviewed on 16 March 2009, http://www.anderson. ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm)
General analyzing methods used in data mining
There are various innovative methods are being used in data mining these days. Still, as mentioned earlier, EDA (Exploratory Data Analysis) is the major statistics principle employed in data mining. Following are some of those EDA methods used in data mining.
(a) Artificial neural network: resembles the neural network of the human body and is non-linear in nature (this can be compared to the mental models practiced in psychology for behavior analysis)
(b) Genetic algorithms: based on the method of natural selection and natural evolution
(Doug Alexander, Data Mining, reviewed on 16 March 2009, http://www.eco.utexas.edu/~norman/BUS.FOR/course.mat/Alex/).
(c) Decision tree models such as CART (Classification and Regression Tree), and CHAID (Chi Square Automatic Interaction Detection). CART is two dimensional and CHAID is multi dimensional. These are comparatively newer methods.
(d) Nearest neighbor method: classifies each record of a warehoused data sheet with reference to a combination of classes that is part of historical data sheet. For similar class of records, similar prediction (given in the historical data sheet) can be made.
(e) If-then rule induction model: this method uses the principles of cause and effect for making pattern predictions.
(f) Data Visualization: One of the most effective methods. In this method, the data is interpreted using graphic tools by the help of computer software. It has got much potential in the future of data mining.
(Data Mining: What is Data Mining? reviewed on 16 March 2009, Ошибка! Недопустимый объект гиперссылки.).
Above mentioned facts are general in nature. In recent decades, there have been rapid changes in the comprehensive aspects of data mining such as, (a) the type of data being mined (e.g. aspects / field like relational, transactional, multi-media and ontology etc), (b) the knowledge derived by mining (e.g. relationship like association, class, trend, discrimination, multiple and integrated functions etc), (c) the techniques developed (e.g. OLAP analysis, scalable data mining, spatio-temporal data mining etc), and, (d) the applications of data mining (e.g. retail market, banking, credit cards, fraud detection etc) (Longbing Cao & Chengqi Zhang 2008 Domain Driven Data Mining, Data Mining and Knowledge Discovery Technologies, D. Taniar, IGI Global)
Utility of data mining
Data mining has tremendous potential in business field today, as every company depends heavily on the knowledge management system to stay put in the competition. The companies like the Wal-Mart has achieved its mammoth success because they could provide their suppliers with up-to-date data from their large, networked data warehouse which is analyzed by the suppliers with their own custom made soft wares. By this way the suppliers could find out the purchasing trend at any given time at any given outlet and do promotional activities accordingly (Data Mining: What is Data Mining? reviewed on 16 March 2009, http://www.anderson.ucla.edu/faculty/jason.
frand/teacher/technologies/palace/datamining.htm). This type of real time data mining is also known as On-Line Analytic Processing (OLAP). OLAP has created a win-win situation for the company as well as for the suppliers. As an offshoot, even some NGO’s are making use of the data mining these days to compile and analyze the data on social capital, complex adaptive systems of society, community of practice etc and use such information for whistle blowing the HR malpractices of big companies.
Intra disciplinary utility
Data mining has also got some intra-disciplinary utilities, which in turn help the statisticians develop more accurate method for data mining. Some of the intra-disciplinary models are,
(a) Mining programme Grid Minor Assistant (uses grid computing for ontology based framework and is automated. Used in scientific discovery, optimal treatment of patients, cutting costs etc. (Peter Brezany, Ivan Janciak & A Min Tjoa (2208) Ontology-Based Construction of Grid Data Mining Workflows.IGI Global.
(b) Intra disciplinary methods named multiple criteria optimization methods (e.g. multiple criteria linear, quadratic and fuzzy-linear programming) are used for statistical analysis in pharmaceutical industry (e.g. classification of HIV-1 associated dementia), finance industry (e.g. scoring management), and banks (e.g. credit card, fraud detection) etc (Yong Shi, Yi Peng, Gang Kou & Zhengxin Chen (2008) Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications. IGI Global).
Modern data mining techniques exploit the computing capabilities of the company. The selection on processing capacity of the computer is determined based on the number of queries, complexity level of the analysis and volume of the data to be synthesized. Means, more the number of queries, or more the complexity or volume of data, more the processing capacity is required. The security level of data is a deciding factor in choosing the type of hardware. Companies also use server-client networking to warehouse the data for their capacity development.
Though the business community the world over has immensely benefitted from the data mining process, it has raised some doubts on culture and corporate ethics. Some individuals oppose the use of explicit knowledge of their purchase behavior and the codification or classification of such knowledge for company’s profit making. Still, majority of the population does not consider it harmful and the companies seem to be bothered only about the cost and profit.
Data mining is the process of deriving the meaning and patterns from the collected data to predict a future pattern. Thus we can safely assume that, in a broader sense data mining is nothing but knowledge management. Similar process is followed in both: the raw data is converted in to information by analyzing the relationships among the data and this information is used to derive some knowledge by analyzing the patterns between various pieces of information and finally, this knowledge creates the wisdom (knowledge management) or predict the future patterns (data mining).
Longbing Cao & Chengqi Zhang,(2008) Domain Driven Data Mining.IGI Global.
Peter Brezany, Ivan Janciak & A Min Tjoa (2208) Ontology-Based Construction of Grid Data Mining Workflows.IGI Global.
Yong Shi, Yi Peng, Gang Kou & Zhengxin Chen (2008) Introduction to Data Mining Technique via Multiple Criteria Optimization Approaches and Applications. IGI Global.
UCLA Anderson School of Management website. Harvard Referencing Style: http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm
The University of Texas at Austin website Harvard Referencing Style: http://www.eco.utexas.edu/~norman/BUS.FOR/course.mat/Alex
Statsoft.nl website. Harvard referencing styl: http://www.statsoft.nl/uk/textbook/stdatmin.html