Definition of Data Mining
Looking for valuable information in a large amount of data. Automatic or semi-automatic exploration and analysis of a large quantity of data whose task is to find meaningful patterns and rules.
KDD Process
Knowledge Discovery in Databases (KDD) is a non-trivial process of identifying is valid, novel, potentially useful, and ultimately understandable patterns in data.
Knowledge Discovery in Databases
CYCLE OF DATA MINING
Reasons to use Data Mining
Because data is collected and stored at a rapid rate very large (Gbyte / hour). Remote sensor that uses satellites. Telescope scanning the skies. Micro arrays generating gene expression data. Scientific simulations generating terabytes of data Traditional techniques that are no longer feasible Used to reduce data or data fragmented. Catalog, classification, data sharing. Assist scientists in hypothesizing. Data Mining Origins: A description of ideas from Artificial Intelligent Machines, patterns, statistics, database systems and data depictions. Traditional techniques may not be used because of this Amount of data. The high dimensions of the data. Various types of data. The task of Data Mining is divided into two methods namely: Prediction method Use several variables to estimate a value unknown from other variables. Method of description Look for a pattern that humans can interpret so that the data can be described or described. Types of Data Mining Tasks Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] Definition of Classification Providing a collection of records (training set)
Each record contains certain attributes (attributes), one of attributes are classes.
- Look for an example or model for the class attribute as a function of a value from another attribute.
- The goal is previously unseen records
- The record is designated as a class as precisely as possible.
Classification in Application 1:
Direct Marketing
Goal:
Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.
The approach is
Use data for the same product that has been introduced first.
Know which buyer decides to buy andwho does not. This decision (buy, don’t buy) forms a class attributes.
Collect various demographic, lifestyle and company interaction and para-related information customer.
For example, where they live, how big income and others. Use this information as input attributes to learn a classifier model.
Classification in Application 2:
Fraud Detection
The aim is to predict or predict cases embezzlement of credit card transactions.
The approach is
By using transaction information and information from the card as the attribute.
- Example When a customer buys, what he buys, how often he pays on time.
- Label past transactions. This forms the class attributes.
- Learn a model for the class of the transactions.
- Use this model to detect fraud / fraud with
- Observe / review credit card transaction calculations.
Classification in Application 3:
Customer Attrition / Churn
The purpose is
To predict whether a customer is likely to be lost to a competitor.
The approach is
Using detailed records of each transaction customers to look for attributes. Label customers as loyal or non-loyal customers Find a model for loyalty.
Classification in Application 4:
Sky Survey Cataloging
The purpose is
Predict the class (star or galaxy) of celestial objects, specifically describing the weak, based on the pictures taken using a telescope (from Palomar Observatory).
The approach is
Share pictures. Measuring image attributes. The model of a class is based on this description.
Clustering Definition
- Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that.
- Data points in one cluster are almost the same as another.
- Data points in different clusters are less similar to those other.
Size Equation
- Euclidean distance if attributes are continuous.
- Other problem-specific Measures.
ILLUSTRATING CLUSTERING
Clustering Application 1:
Market segmentation:
The purpose is Subdivide a market into distinct subsets of customers where any subset may be conceivably be selected as a target market to be reached with a distinct marketing mix.
The approach is
- Collect different attributes from customers based on information related to geographical and style customer life.
- Search for clusters or collections of similar customers.
- Measuring the quality of clustering by paying attention or observing patterns purchases from customers in the same cluster as the cluster different.
Clustering in Application 2:
Document Clustering
The purpose is To search for groups of documents where the groups are similar to each other based on the existing terms.
The approach is
To identify or equalize the time limit within each document. The shape is almost the same size based on the frequency of Different terms are used as clusters.