- You start with a dataset to analyse. - Purchase / Social / Medical / Travel
- Many variable are typically collected. - Categorical / Continuous / Geo
- Majority of them can be irrelevant and cause noise.
- Data Mining is Statistics at Scale and Speed.
- Applications in Intelligence / Genetics / Natural Sc. / Bussiness.
- Data Mining has origin with Categorical data whereas Statistics deals with Continuous data.
- Large model overfits the training dataset and may lead to higher prediction error with new situations.
- Consider if predictor variable would be available and relationship holds in future data.
- Cluster analysis is example for
**Unsupervised learning** - Dimension Reduction
- Association Rules
- Classification is example of
**Supervised learning** - Regression, Regression Trees, Nearest Neighbour - Continuous response.
- Logistic Regression, Classification Trees, Nearest Neighbour, Discriminant analysis and Naive Bayes methods are well suited for Categorical response.
**Data Mining**should be viewed as a process :- Data Storage & PreProcessing
- Identify variables for investigation
- Screen the outliers and missing values from data
- Data need to be partitioned for
,*training*and*test*set.*evaluation* - Use
for Large datasets.*Sampling* - Visualize your data - Line, Bar, Scatter, Box, Histogram, Map, Geo
- Summary of data - Mean, Median, Mode, Standard Deviation, Correlation, Principal Components
- Apply appropriate model - Linear, Logistic, Trees, K-means ...
- Verify finding against
set.*evaluation* - Get the insights, Apply the findings! Plan - do - check - act !!

https://www.linkedin.com/in/alokawi (

*Data Engineer, Analytics Engineer, Data Science*)