Definition of Data Mining

Looking for valuable information in a large amount of data. Automatic or semi-automatic exploration and analysis of a large quantity of data whose task is to find meaningful patterns and rules.

KDD Process

Knowledge Discovery in Databases (KDD) is a non-trivial process of identifying is valid, novel, potentially useful, and ultimately understandable patterns in data.

Knowledge Discovery in Databases

Knowledge Discovery in Databases
Knowledge Discovery in Databases


Data Mining Cycle
Data Mining Cycle

Reasons to use Data Mining

Because data is collected and stored at a rapid rate
very large (Gbyte / hour).

Remote sensor that uses satellites.
Telescope scanning the skies.
Micro arrays generating gene expression data.
Scientific simulations generating terabytes of data

Traditional techniques that are no longer feasible
Used to reduce data or data fragmented.

Catalog, classification, data sharing.
Assist scientists in hypothesizing.
Data Mining Origins:

A description of ideas from Artificial Intelligent Machines,
patterns, statistics, database systems and data depictions.
Traditional techniques may not be used because of this

Amount of data.
The high dimensions of the data.
Various types of data.
The task of Data Mining is divided into two methods namely:

Prediction method
Use several variables to estimate a value
unknown from other variables.
Method of description
Look for a pattern that humans can interpret so that the data
can be described or described.
Types of Data Mining Tasks
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Definition of Classification
Providing a collection of records (training set)

Each record contains certain attributes (attributes), one of attributes are classes.

  • Look for an example or model for the class attribute as a function of a value from another attribute.
  • The goal is previously unseen records
  • The record is designated as a class as precisely as possible.

Classification in Application 1:

Direct Marketing

Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.

The approach is

Use data for the same product that has been introduced first.

Know which buyer decides to buy andwho does not. This decision (buy, don’t buy) forms a class attributes.

Collect various demographic, lifestyle and company interaction and para-related information customer.

For example, where they live, how big income and others. Use this information as input attributes to learn a classifier model.

Classification in Application 2:

Fraud Detection

The aim is to predict or predict cases embezzlement of credit card transactions.

The approach is

By using transaction information and information from the card as the attribute.

  • Example When a customer buys, what he buys, how often he pays on time.
  • Label past transactions. This forms the class attributes.
  • Learn a model for the class of the transactions.
  • Use this model to detect fraud / fraud with
  • Observe / review credit card transaction calculations.

Classification in Application 3:

Customer Attrition / Churn

The purpose is

To predict whether a customer is likely to be lost to a competitor.

The approach is

Using detailed records of each transaction customers to look for attributes. Label customers as loyal or non-loyal customers Find a model for loyalty.

Classification in Application 4:

Sky Survey Cataloging

The purpose is

Predict the class (star or galaxy) of celestial objects, specifically describing the weak, based on the pictures taken using a telescope (from Palomar Observatory).

The approach is

Share pictures. Measuring image attributes. The model of a class is based on this description.

Clustering Definition

  • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that.
  • Data points in one cluster are almost the same as another.
  • Data points in different clusters are less similar to those other.

Size Equation

  • Euclidean distance if attributes are continuous.
  • Other problem-specific Measures.


Data Mining Clustering
Data Mining Clustering

Clustering Application 1:

Market segmentation:

The purpose is Subdivide a market into distinct subsets of customers where any subset may be conceivably be selected as a target market to be reached with a distinct marketing mix.

The approach is

  • Collect different attributes from customers based on information related to geographical and style customer life.
  • Search for clusters or collections of similar customers.
  • Measuring the quality of clustering by paying attention or observing patterns purchases from customers in the same cluster as the cluster different.

Clustering in Application 2:

Document Clustering

The purpose is To search for groups of documents where the groups are similar to each other based on the existing terms.

The approach is

To identify or equalize the time limit within each document. The shape is almost the same size based on the frequency of Different terms are used as clusters.


I am Data Science enthusiast. Interested in Big Data, Python, Machine Learning. My daily activities analyzing data to solve problems

Write A Comment