Visualizing and Exploring Data

February 1st, 2010 by admin No comments »

In data mining, visualization is a very important tool to facilitate the discovery of patterns in our data. Visual methods are important in data mining because they are ideal for sifting through data to find unexpected relationships.

Summarizing Data
There are some simple tools to summarize data and these are as follows

  1. Mean
  2. Median
  3. Mode
  4. Standard deviation
  5. Variance
  6. Interquartile range
  7. Skweness

Tools for Displaying single variables

Histogram, Kernel Estimate, Box and whisker plot

Tools for Displaying relationships between two variables

Scatter plot, Contour plot

Tools for Displaying more than two variables

Scatterplot mattrix, Trellis plotting, icons, chernoff’s faces

Components of Data Mining algorithms

February 1st, 2010 by admin No comments »

Data mining algorithms have the ability to find patterns based on the following components:

1. Model or Pattern Structure
This deals with determining the underlying structure or functional forms that we seek from the data

2. Score Function
This deals with judging the quality of a fitted model

3. Optimization and Search Method
Optimizing the score function and searching over different model and pattern structures

4. Data Management Strategy
handling data access efficiently during the search/optimization

Data Mining Functionalities

February 1st, 2010 by admin No comments »

There are a wide variety of Data Mining Functionalities or tasks. Some of them are highlighted below:

1. Concept/Class Description: Characterization and Discrimination

Data characterization is a summarization of the general characteristics or features of
a target class of data. The data corresponding to the user-specified class are typically collected
by a database query.

Data discrimination is a comparison of the general features of target class data objects
with the general features of objects from one or a set of contrasting classes. The target
and contrasting classes can be specified by the user, and the corresponding data objects
retrieved through database queries. For example, the user may like to compare the general
features of software products whose sales increased by 10% in the last year with those
whose sales decreased by at least 30% during the same period. The methods used for data
discrimination are similar to those used for data characterization.

2. Mining Frequent Patterns, Associations, and Correlations

Frequent patterns, as the name suggests, are patterns that occur frequently in data. There
are many kinds of frequent patterns, including itemsets, subsequences, and substructures.
A frequent itemset typically refers to a set of items that frequently appear together
in a transactional data set, such as milk and bread.

Typically, association rules are discarded as uninteresting if they do not satisfy both
a minimum support threshold and a minimum confidence threshold.

3. Classification and Prediction

Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. The derived model is based on the analysis
of a set of training data (i.e., data objects whose class label is known).

A decision tree is a flow-chart-like tree structure, where
each node denotes a test on an attribute value, each branch represents an outcome of the
test, and tree leaves represent classes or class distributions. Decision trees can easily be
converted to classification rules.

A neural network, when used for classification, is typically
a collection of neuron-like processing units with weighted connections between the
units. There are many other methods for constructing classification models, such as naïve
Bayesian classification, support vector machines, and k-nearest neighbor classification.

Regression analysis is a statistical methodology that is
most often used for numeric prediction, although other methods exist as well. Prediction
also encompasses the identification of distribution trends based on the available data.

4. Cluster Analysis
“Whatis cluster analysis?”Unlike classification and prediction,which analyze class-labeled
data objects, clustering analyzes data objects without consulting a known class label.
Clustering can also facilitate taxonomy formation, that is, the organization of observations into a hierarchy of classes that group similar events together.

5. Outlier Analysis

A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers.
Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by detecting
purchases of extremely large amounts for a given account number in comparison to
regular charges incurred by the same account. Outlier values may also be detected with
respect to the location and type of purchase, or the purchase frequency.

6. Evolution Analysis

Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time. Although this may include characterization, discrimination,
association and correlation analysis, classification, prediction, or clustering of time related
data, distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis.

Statiscal Aspects of Data Mining 1

February 1st, 2010 by admin No comments »