A%20woman%20at%20the%20office%20overlooking%20the%20city%20skyline_edited.jpg

Become a Data Scientist
from Scratch

Get Ready to Take an Action

This guide will help you understand what you will need to learn in order to become a data scientist and how you can acquire the relevant skills for that.

The training program can take about 2-4 months, depending on your background and motivation. The first part is the theoretical part, and the second and last part is getting a practical experience in some real projects.

Our main book is “Data Science from Scratch” by Joel Grus (2nd edition). We will refer to this book in some sections to get an introduction regarding some of the theoretical knowledge.

The numbers in the parentheses below are the chapter numbers (c) and page numbers (p) from this book. In each part, we will specify how to study and practice these materials. Whenever you feel that the given study wasn’t clear enough, open the book on this specific subject.

Other recommended books, that we will use are:

Book 1: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

Book 2: Pattern Recognition and Machine Learning by Christopher M. Bishop

When we refer to them we will first mention the book number (such as b1 for the first book b2 for the second etc).

Python

Option 1 - learnpython

(English)

Covers: Variables and Types, Lists, Operators, Strings, Conditions, Loops, Functions, Classes, Dicts, Modules and Packages, Generators, Comprehensions, RegEx, Exceptions, Sets, Serialization, Closures, Decorators.

Depth level: 4/5

Option 2 - Kaggle

(English)

Covers: Variables and Types, Lists, Operators, Strings, Conditions, Loops, Functions, Dicts, Modules and Packages, Comprehensions.

You will need to cover also: Classes

Depth level: 2/5

Option 3 - PythonFreeCourse

(Hebrew)

Covers: Variables and Types, Lists, Operators, Strings, Conditions, Loops, Functions, Classes and Inheritance, Dicts, Modules and Packages, Generators, Comprehensions, Exceptions, Sets, Serialization.

Depth level: 5/5

[c.2] In the book you will find a crash course in python which highlights the parts of the language that will be most important to data scientists.

Depth level: 1/5

Visualizing Data

A fundamental part of the data scientist’s skills is data visualization.

Before you can start visualizing data you will need to get familiar with the relevant python packages. So first start with this short numpy tutorial. Then get familiar with pandas. And now go through the data visualization part.

Linear Algebra

[c.4] Linear algebra is a subject by its own. Vectors, matrices and their manipulations are widely used by data scientists. The basics will be covered in the Coursera ML course. This course will also cover many other subjects from below. Make sure to practice it in python with the corresponding git repo that took all this course exercises and translated them to python.

If you want to dig deeper into it you can use these free books:

Linear Algebra by Jim Hefferon
Linear Algebra by David Cherney, Tom Denton, Rohit Thomas, and Andrew Waldron
And a more advanced intro: Linear Algebra Done Wrong by Sergei Treil

Statistics

[c.5] Mean, median, correlations...

As a data scientist you will usually use the statistical methods implemented in the common libraries. But it is important to understand what is behind them. The basic important knowledge will be covered in the suggested ML course.

If you want to dig deeper into it you can use these free books:

Probability

[c.6] Bayes's theorem, continuous distributions and more. In order to be a good data scientist you will need some sort of understanding of probability and its mathematics.

A good free online books for that are:

Introduction to Probability, by Charles M. Grinstead and J. Laurie Snell
[b2.c2] Chapter 2 of `Book 2` from above

Hypothesis and Inference

[c.7] A great deal of being a data scientist is to test whether a certain hypothesis is likely to be true or not. Most of the relevant terms will be discussed in the suggested ML course.

The suggested statistical books cover this subject in depth. And if you would like to go deeper with an online course try the Coursera course Data Analysis and Statistical Inference.

Gradient Descent

[c.8] A part of building ML models will be to use some minimization techniques. Why? you will find out all about it in the suggested ML course.

Machine Learning

Most of this part will be covered by the suggested ML course and its python practice.

Supervised

Fundamentals [c.11]
Algorithms (kNN, Naive Bayes, Linear Reg, Logistic Reg, Decision Trees) [c.12-17]
Deep Learning (NN and the layers) [c.18-19]

Unsupervised

Clustering [c.20]
Density Estimation
Anomaly Detection

If you decided to exercise more and want to follow the [b1] exercises you can use this github repo.

Working with Data

[b1.c2] When working with data, you will usually go through the same steps:

Exploratory Data Analysis
Data Visualization
Cleaning the data

Data Modeling

Data pre-processing and Feature Engineering
Feature Selection
General Model Selection
Progress
Basic Models and their usage
Ensemble Methods [b1.p120 Ensemble Methods]
Model Evaluation
Hyper Parameters Tuning [cross validation, b1.p60 Hyperparameter Tuning and Model Selection]

You will experience most of these steps by going through these Kaggle courses:

NLP

[c.21] NLP is a huge field. To get some intro you can try this Kaggle course.

Computer Vision

[b1.c14] CV is another huge field. To get some intro to that try this Kaggle course.

Working Skills

Databases and SQL [c.24]

Git

Linux

Relevant post

Data Engineering

model deployment
monitoring and logs
data models architecture and patterns

After you finish all of the above you should start working on some data projects.

A good place to start with is the Kaggle micro-challenges.

Then you can move on to other projects in Kaggle or any other data website (like datacamp for example).

If you have any questions or notes, don't hesitate to reach out!

Become a Data Scientistfrom Scratch