SharePoint Orange: 2015

Tuesday, December 29, 2015

Road to Machine Learning

You start with a dataset to analyse. - Purchase / Social / Medical / Travel
Many variable are typically collected. - Categorical / Continuous / Geo
Majority of them can be irrelevant and cause noise.
Data Mining is Statistics at Scale and Speed.
Applications in Intelligence / Genetics / Natural Sc. / Bussiness.
Data Mining has origin with Categorical data whereas Statistics deals with Continuous data.
Large model overfits the training dataset and may lead to higher prediction error with new situations.
Consider if predictor variable would be available and relationship holds in future data.
Cluster analysis is example for Unsupervised learning
Dimension Reduction
Association Rules
Classification is example of Supervised learning
Regression, Regression Trees, Nearest Neighbour - Continuous response.
Logistic Regression, Classification Trees, Nearest Neighbour, Discriminant analysis and Naive Bayes methods are well suited for Categorical response.
Data Mining should be viewed as a process :

Data Storage & PreProcessing
Identify variables for investigation
Screen the outliers and missing values from data
Data need to be partitioned for training, test and evaluation set.
Use Sampling for Large datasets.
Visualize your data - Line, Bar, Scatter, Box, Histogram, Map, Geo
Summary of data - Mean, Median, Mode, Standard Deviation, Correlation, Principal Components
Apply appropriate model - Linear, Logistic, Trees, K-means ...
Verify finding against evaluation set.
Get the insights, Apply the findings! Plan - do - check - act !!

More on the next blog! Thank you!!
https://www.linkedin.com/in/alokawi ( Data Engineer, Analytics Engineer, Data Science )

Friday, October 16, 2015

Correlation, Regression and Causation ?

What is a Correlation in statistics?

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn't perfect.

What is a Regression in statistics?

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression.

What is a Causation in statistics?

When an article says that causation was found, this means that the researchers found that changes in one variable they measured directly caused changes in the other.

Monday, January 12, 2015

What Does a Data Scientist Do? - http://datasciencelondon.org/

Big Data [sorry] & Data Science: What Does a Data Scientist Do? from Data Science London

SharePoint Orange

Navigation (Pages)

Tuesday, December 29, 2015

Road to Machine Learning

Friday, October 16, 2015

Correlation, Regression and Causation ?

Monday, January 12, 2015

What Does a Data Scientist Do? - http://datasciencelondon.org/

About Me

@alokawi

Blog Archive

Labels

Contact Alok