There are a wide variety of Data Mining Functionalities or tasks. Some of them are highlighted below:
1. Concept/Class Description: Characterization and Discrimination
Data characterization is a summarization of the general characteristics or features of
a target class of data. The data corresponding to the user-specified class are typically collected
by a database query.
Data discrimination is a comparison of the general features of target class data objects
with the general features of objects from one or a set of contrasting classes. The target
and contrasting classes can be specified by the user, and the corresponding data objects
retrieved through database queries. For example, the user may like to compare the general
features of software products whose sales increased by 10% in the last year with those
whose sales decreased by at least 30% during the same period. The methods used for data
discrimination are similar to those used for data characterization.
2. Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, as the name suggests, are patterns that occur frequently in data. There
are many kinds of frequent patterns, including itemsets, subsequences, and substructures.
A frequent itemset typically refers to a set of items that frequently appear together
in a transactional data set, such as milk and bread.
Typically, association rules are discarded as uninteresting if they do not satisfy both
a minimum support threshold and a minimum confidence threshold.
3. Classification and Prediction
Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. The derived model is based on the analysis
of a set of training data (i.e., data objects whose class label is known).
A decision tree is a flow-chart-like tree structure, where
each node denotes a test on an attribute value, each branch represents an outcome of the
test, and tree leaves represent classes or class distributions. Decision trees can easily be
converted to classification rules.
A neural network, when used for classification, is typically
a collection of neuron-like processing units with weighted connections between the
units. There are many other methods for constructing classification models, such as naïve
Bayesian classification, support vector machines, and k-nearest neighbor classification.
Regression analysis is a statistical methodology that is
most often used for numeric prediction, although other methods exist as well. Prediction
also encompasses the identification of distribution trends based on the available data.
4. Cluster Analysis
“Whatis cluster analysis?”Unlike classification and prediction,which analyze class-labeled
data objects, clustering analyzes data objects without consulting a known class label.
Clustering can also facilitate taxonomy formation, that is, the organization of observations into a hierarchy of classes that group similar events together.
5. Outlier Analysis
A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers.
Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by detecting
purchases of extremely large amounts for a given account number in comparison to
regular charges incurred by the same account. Outlier values may also be detected with
respect to the location and type of purchase, or the purchase frequency.
6. Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time. Although this may include characterization, discrimination,
association and correlation analysis, classification, prediction, or clustering of time related
data, distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis.