Damien Gonot
Home Blog Notes About

Data Science

Homepage / Notes / Computer Science / Data Science

Probabilities

https://probability4datascience.com/

Markov Chains

https://github.com/czekster/markov

Data/CSV/JSON tools

jq

miller

q

http://harelba.github.io/q/ Run SQL directly on CSV or TSV files

xsv

https://github.com/BurntSushi/xsv A fast CSV command line toolkit written in Rust

visidata

https://www.visidata.org/

tv

https://github.com/alexhallam/tv 📺(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.

jless

https://pauljuliusmartinez.github.io/ new command-line JSON viewer

Serving Data

ROAPI

https://roapi.github.io/docs/index.html ROAPI automatically spins up read-only APIs for static datasets without requiring you to write a single line of code.

Datasette

https://datasette.io/ Datasette is a tool for exploring and publishing data. It helps people take data of any shape or size, analyze and explore it, and publish it as an interactive website and accompanying API. https://docs.datasette.io/en/stable/ecosystem.html https://architecturenotes.co/datasette-simon-willison/

Forecasting

Prophet

https://facebook.github.io/prophet/ Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts that can be tuned by hand by data scientists and analysts.

ETL / ELT

Meltano

https://meltano.com/ ELT for the DataOps era Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Airflow

https://airflow.apache.org/

dbt

Storing Data

CSV

line by line file with columns separated by comma

Parquet

"columnar" file, way lighter than CSV, with data types for each column

Exploration

import pandas as pd
return pd.read_csv("~/tmp/train_routes.csv")
           route_id     origin destination price
0   paris-marseille      Paris   Marseille   $65
1   marseille-paris  Marseille       Paris   $65
2  montreal-toronto   Montreal     Toronto   $45
3  toronto-montreal    Toronto    Montreal   $45
4   montreal-ottawa   Montreal      Ottawa   $35
5   ottawa-montreal     Ottawa    Montreal   $35
6    ottawa-toronto     Ottawa     Toronto   $30
7    toronto-ottawa    Toronto      Ottawa   $30
import pandas as pd
df = pd.read_csv("~/tmp/train_routes.csv")
df.to_parquet("~/tmp/train_routes.parquet")
import pandas as pd
return pd.read_parquet("~/tmp/train_routes.parquet")
           route_id     origin destination price
0   paris-marseille      Paris   Marseille   $65
1   marseille-paris  Marseille       Paris   $65
2  montreal-toronto   Montreal     Toronto   $45
3  toronto-montreal    Toronto    Montreal   $45
4   montreal-ottawa   Montreal      Ottawa   $35
5   ottawa-montreal     Ottawa    Montreal   $35
6    ottawa-toronto     Ottawa     Toronto   $30
7    toronto-ottawa    Toronto      Ottawa   $30

Visualizing Data

Gnuplot

f(x) = sin(x)
plot f(x)

YouPlot

https://github.com/red-data-tools/YouPlot

A command line tool that draw plots on the terminal.

Resources

https://github.com/microsoft/Data-Science-For-Beginners

Harvard CS109a: Introduction to Data Science

https://harvard-iacs.github.io/2021-CS109A/ https://harvard-iacs.github.io/2021-CS109A/pages/materials.html

Designing Data-Intensive Applications

by Martin Kleppmann

Exploratory Data Analysis

by Roger D. Peng, and others

Practical Time Series Analysis

by Tural Sadigov, William Thistleton

Computational and Inferential Thinking: The Foundations of Data Science

2nd Edition by Ani Adhikari, John DeNero, David Wagner

https://inferentialthinking.com/chapters/intro.html