SharePoint Orange

Sunday, October 21, 2018

[Repost] Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Top 5 Mistakes to Avoid When Writing Apache Spark Applications from Cloudera, Inc.

Spark Summit East 2016 presentation by Mark Grover and Ted Malaska

Tuesday, June 6, 2017

Intro to Graph Databases Neo4j

Thursday, March 30, 2017

Enterprise Integration Patterns : What, Why, How

Monday, February 27, 2017

How to : working with Graph database, Tinkerpop and Gremlin

Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin from Caleb Jones https://www.slideshare.net/calebwjones/intro-to-graph-databases-using-tinkerpops-titandb-and-gremlin

Friday, March 18, 2016

SparkR with Zeppelin

SparkR + Zeppelin from felixcss

Saturday, January 2, 2016

Algorithm : Types, Classification and Definition

Simple recursive algorithms
Backtracking algorithms
Divide and conquer algorithms
Dynamic programming algorithms
Greedy algorithms
Branch and bound algorithms
Brute force algorithms
Randomized algorithms

Complexity of Algorithms :

Constant - O(1)
Logarithmic - O(log(N))
Linear - O(N)
Quadratic - O(N*N)
Cubic - O(N*N*N)
...
Exponential - O(N!) or O(2^N) or O(N^K) or many others.

Tuesday, December 29, 2015

Road to Machine Learning

You start with a dataset to analyse. - Purchase / Social / Medical / Travel
Many variable are typically collected. - Categorical / Continuous / Geo
Majority of them can be irrelevant and cause noise.
Data Mining is Statistics at Scale and Speed.
Applications in Intelligence / Genetics / Natural Sc. / Bussiness.
Data Mining has origin with Categorical data whereas Statistics deals with Continuous data.
Large model overfits the training dataset and may lead to higher prediction error with new situations.
Consider if predictor variable would be available and relationship holds in future data.
Cluster analysis is example for Unsupervised learning
Dimension Reduction
Association Rules
Classification is example of Supervised learning
Regression, Regression Trees, Nearest Neighbour - Continuous response.
Logistic Regression, Classification Trees, Nearest Neighbour, Discriminant analysis and Naive Bayes methods are well suited for Categorical response.
Data Mining should be viewed as a process :

Data Storage & PreProcessing
Identify variables for investigation
Screen the outliers and missing values from data
Data need to be partitioned for training, test and evaluation set.
Use Sampling for Large datasets.
Visualize your data - Line, Bar, Scatter, Box, Histogram, Map, Geo
Summary of data - Mean, Median, Mode, Standard Deviation, Correlation, Principal Components
Apply appropriate model - Linear, Logistic, Trees, K-means ...
Verify finding against evaluation set.
Get the insights, Apply the findings! Plan - do - check - act !!

More on the next blog! Thank you!!
https://www.linkedin.com/in/alokawi ( Data Engineer, Analytics Engineer, Data Science )

Subscribe to: Posts (Atom)