- You start with a dataset to analyse. - Purchase / Social / Medical / Travel
- Many variable are typically collected. - Categorical / Continuous / Geo
- Majority of them can be irrelevant and cause noise.
- Data Mining is Statistics at Scale and Speed.
- Applications in Intelligence / Genetics / Natural Sc. / Bussiness.
- Data Mining has origin with Categorical data whereas Statistics deals with Continuous data.
- Large model overfits the training dataset and may lead to higher prediction error with new situations.
- Consider if predictor variable would be available and relationship holds in future data.
- Cluster analysis is example for Unsupervised learning
- Dimension Reduction
- Association Rules
- Classification is example of Supervised learning
- Regression, Regression Trees, Nearest Neighbour - Continuous response.
- Logistic Regression, Classification Trees, Nearest Neighbour, Discriminant analysis and Naive Bayes methods are well suited for Categorical response.
- Data Mining should be viewed as a process :
- Data Storage & PreProcessing
- Identify variables for investigation
- Screen the outliers and missing values from data
- Data need to be partitioned for training, test and evaluation set.
- Use Sampling for Large datasets.
- Visualize your data - Line, Bar, Scatter, Box, Histogram, Map, Geo
- Summary of data - Mean, Median, Mode, Standard Deviation, Correlation, Principal Components
- Apply appropriate model - Linear, Logistic, Trees, K-means ...
- Verify finding against evaluation set.
- Get the insights, Apply the findings! Plan - do - check - act !!
https://www.linkedin.com/in/alokawi ( Data Engineer, Analytics Engineer, Data Science )