Pythonic “Data Science” Specialization

Why The "Data Science" Specialization

  1. 温习统计学的知识, 为更深层次的学习作准备
    Andrew Ng 在 2015 GTC 的演讲中说, deep learning 就是 black magic; 咱们理解50%, 但不知道另外的50%是如何work的. 我在台下想, 对于那能够理解的50%, 我好像都只懂了5%.
  2. 参考"标准高效"的流程
    mine: emacs org mode + emacs magit + bitbucket + python. There must be some room for improvement.


课程用的是R. 我不想再学一门相似的语言了, 我会找出相对应的numpyscipy solution.node

Getting and Cleaning Datapython

Raw data 的来源

  • Website APIs
  • Databases
  • Json
  • Raw texts

Data analysis 流程

  • Raw data --> Processing scripts --> tidy data (often ignored in the classes but really important)git

    • Record the meta data
    • Record the recipes
  • --> data analysis (covered in machine learning classes)web

  • --> data communication算法


  1. Each variable you measure should be in one column, 一个变量占一列.
  2. There should be one table for each "kind" of variable, generally data should be save in one file per table 为何呢? 管理起来不会麻烦麽?
  3. If you have multiple tables, they should include a column in the table thta allows them to be linked. 参见 dataframe.merge dataframe.join in pandas

The code book

代码簿? (⊙o⊙)…express

  • Info about the variables (including units!)
    单位很重要! 没有单位的测量是没有物理意义的!
    但测量时候必需要考虑的有效位数在课程中却没有说起. 大抵是由于pythonR 对于有效位数handle地很好? 不须要像C 里边同样考虑 float 或者 double? 某些极端状况下也会须要像sympy这样的library吧.
  • Info about the summary choice you made
  • Info about the experimental study design you used

代码簿的做用相似于wet lab中的实验记录本. 很庆幸很早就知道了emacsorg mode, 用在这里很适合. 可是 Info about the variables 的重要性被我忽略了.json

若是feature的数量不少, 并且feature自己意义深入, 就须要仔细挑选. 记得一次听报告, 有家金融公司用decision tree 作portfolio, 算法自己稀松日常, 可是对于具体用了哪些feature, lecturer守口如瓶.app

"There are many stages to the design and analysis of a successful study. The last of these steps is the calculation of an inferential statistic such as a P value, and the application of a 'decision rule' to it (for example, P < 0.05). In practice, decisions that are made earlier in data analysis have a much greater impact on results — from experimental design to batch effects, lack of adjustment for confounding factors, or simple measurement error. Arbitrary levels of statistical significance can be achieved by changing the ways in which data are cleaned, summarized or modelled."less

Leek, Jeffrey T., and Roger D. Peng. "Statistics: P values are just the tip of the iceberg." Nature 520.7549 (2015): 612-612.

Downloading Files

我一般都是直接用wget, 可是那样就不容易整合到脚本中. 几个极可能会在download时候用到的python function:

# set up the env

# dowload

# to tag your downloaded files

# an example
import shutil
import ssl
import urllib.request as ur

def download(myurl):
    download to the current directory
    fn = myurl.split('/')[-1]
    context = ssl._create_unverified_context()
    with ur.urlopen(myurl, context=context) as response, open(fn, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)

    return fn

Loading flat files


Reading XML

Here is a very good introduction

Below are my summaries:

python 标准库中自带了xml.etree.ElementTree用来解析xml. 其中, ElementTree 表示整个XML文件, Element表示一个node.

The first element in every XML document is called the root element. 一个XML文件只能又一个root, 所以如下的不符合xml规范:


recursively 遍历

# an excersice 
# find all elements with zipcode equals 21231
xml_fn = download("")
tree = ET.parse(xml_fn)
for child in tree.iter():
    if child.tag == 'zipcode' and child.text == '21231':


  • JSON stands for Javascript Object Notation
  • lightweight data storage

JSON 的格式肉眼看起来就像是nested python dict. python 自带的json的用法相似pickle.

Pattern Matching

Python makes a distinction between matching and searching. Matching looks only at the start of the target string, whereas searching looks for the pattern anywhere in the target.

Always use raw strings for regx.

Character sets
sth like r'[A-Za-z_]' would match an underscore or any uppercase or lowercase ASCII letter.

Characters that have special meanings in other regular expression contexts do not have special meanings within square brackets. The only character with a special meaning inside square brackets is a ^, and then only if it is the first character after the left (open- ing) bracket.

Summarizing Data

import pandas as pd
df = pd.DataFrame
# Look at a bit of the data

# summary

# cov and corr
# DataFrame’s corr and cov methods return a full correlation or covariance matrix as a DataFrame, respectively

# to calcuate pairwise correlation between a DataFrame's columns or rows
dset.corrwith(dset['<one col name>'])

# you can write your own analsis function and apply it to the dataframe, for example:
f = lambda x: x.max() - x.min()
df.apply(f, axis=1)

Check for missing values

# to modify inplace
_ = df.fillna(0, inplace=True)

# fill the nan with the mean
# 或者用naive bayesian的prediction

Exploratory Data Analysis

Analytic graphics

Principles of Analytic Graphics

  1. Show comparisons
    If you build a model that can do some predictions, please come along with the performance of random guess.

  2. Show causality, mechanism, explanation, systematic structure

  3. Show multivariate data
    The world is inherently multivariate

  4. Integration of evidence

  5. Describe and document the evidence with appropriate labels, scales, sources, etc.

Simple Summaries of Data

Two dimensions

  • scatterplots
  • smooth scatterplots

> 2 dimensions

  • Overlayed/multiple 2-D plots; coplots
  • Use color, size, shape to add dimensions
  • Spinning plots
  • Actual 3-D plots (not very useful)

Graphics File Devices

  • pdf: usefule for line-type graphics, resizes well, not efficient if a plot has many objects/points
  • svg: XML-based scalable vector graphics; supports animation and interactivity, potentially useful for web-based plots
  • png: bitmapped format, good for line drawings or images with solid colors, uses lossless compression, most web browers can read this format natively, does not resize well
  • jpeg: good for photographs or natural scenes, uses lossy compression, does not resize well
  • tiff: bitmapped format, supports lossless compression

Simulation in R

  • rnorm:generate random Normal variates with a given mean and standard deviation
  • dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)
  • pnorm: evaluate the cumulative distribution function for a Normal distribution

  • d for density

  • r for random number generation
  • p for cumulative distribution
  • q for quantile function

Setting the random number seed with set.seed ensures reproducibility

> set.seed(1)
> rnorm(5)