Spark Summit East 2016 presentation by Mark Grover and Ted Malaska
SharePoint Orange
Data Science Hadoop HDFS HBase Hive Pig Chukwa MapReduce EC2 SharePoint Spark Storm Kafka Docker
Sunday, October 21, 2018
[Repost] Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Spark Summit East 2016 presentation by Mark Grover and Ted Malaska
Tuesday, June 6, 2017
Thursday, March 30, 2017
Monday, February 27, 2017
Friday, March 18, 2016
Saturday, January 2, 2016
Algorithm : Types, Classification and Definition
- Simple recursive algorithms
- Backtracking algorithms
- Divide and conquer algorithms
- Dynamic programming algorithms
- Greedy algorithms
- Branch and bound algorithms
- Brute force algorithms
- Randomized algorithms
Complexity of Algorithms :
- Constant - O(1)
- Logarithmic - O(log(N))
- Linear - O(N)
- Quadratic - O(N*N)
- Cubic - O(N*N*N)
- ...
- Exponential - O(N!) or O(2^N) or O(N^K) or many others.
Tuesday, December 29, 2015
Road to Machine Learning
- You start with a dataset to analyse. - Purchase / Social / Medical / Travel
- Many variable are typically collected. - Categorical / Continuous / Geo
- Majority of them can be irrelevant and cause noise.
- Data Mining is Statistics at Scale and Speed.
- Applications in Intelligence / Genetics / Natural Sc. / Bussiness.
- Data Mining has origin with Categorical data whereas Statistics deals with Continuous data.
- Large model overfits the training dataset and may lead to higher prediction error with new situations.
- Consider if predictor variable would be available and relationship holds in future data.
- Cluster analysis is example for Unsupervised learning
- Dimension Reduction
- Association Rules
- Classification is example of Supervised learning
- Regression, Regression Trees, Nearest Neighbour - Continuous response.
- Logistic Regression, Classification Trees, Nearest Neighbour, Discriminant analysis and Naive Bayes methods are well suited for Categorical response.
- Data Mining should be viewed as a process :
- Data Storage & PreProcessing
- Identify variables for investigation
- Screen the outliers and missing values from data
- Data need to be partitioned for training, test and evaluation set.
- Use Sampling for Large datasets.
- Visualize your data - Line, Bar, Scatter, Box, Histogram, Map, Geo
- Summary of data - Mean, Median, Mode, Standard Deviation, Correlation, Principal Components
- Apply appropriate model - Linear, Logistic, Trees, K-means ...
- Verify finding against evaluation set.
- Get the insights, Apply the findings! Plan - do - check - act !!
https://www.linkedin.com/in/alokawi ( Data Engineer, Analytics Engineer, Data Science )
Subscribe to:
Posts (Atom)