DC学院--数据库

时间 2019-11-13

标签学院数据库栏目 SQL 繁體版

原文原文链接

1.文本文件与数据库的比较html

文本文件的好处：简单，直接阅读处理。处理时须要把文件存入内存
数据库：结构化的数据存储，，索引简单。
常见数据库：（1）SQL数据库：Oracle，MS SQL Server，MySQL，SQLite（2）NoSQL数据库，分布式中常见：MongoDB，Cassandra

数据库模式：服务器客户端（MySQL），文件型数据（SQLite），好比firefox浏览器使用文件型数据库。

2. 基于HeidiSQL的数据库操做python

操做包括：建立数据库，导入数据，查询数据，新增数据，修改数据，删除数据mysql

实例操做数据集：Iris（鸢尾花），从UCI Machine Learning下载。sql

增删查改语句：select column_x from table　where condition order by column_i [desc,asc], column数据库

insert into table_name (cname1, cname2) values (v1, v2)
update table_name set colum1=value1, colum2=value2, ... where condition
delete from table where condition浏览器

进阶操做：服务器

(1) distinct，（查找的列组合起来在表里是惟一的）
select distinct column1, column2 from table;
select distinct sepal_length, species from iris;
(2) 比较操做：like，in，between
用在where内指定字段或列特色
where columnN like pattern --pattern包含通配符下划线单纯的一个字符—，一个或多个字符%
select * from iris where species like '%';
where column_name in (value1, value2, value3...) 等同于多个or
where column_name between value1 and value2(包含其实及结束值)
select * from iris where sepal_length between 5 and 6 order by sepal_length;
select * from iris where sepal_length in (2, 3, 6);

(3) 聚合操做：max，min，count，avg，sum
select MIN(column_name)
select min(sepal_length), max(sepal_length), count(sepal_length), avg(sepal_length), sum(sepal_length) from iris;

(4) group by
把列根据属性分红几个类，与聚合操做一块儿，按照不一样类别统计。
select
min(sepal_length), max(sepal_length), count(sepal_length), avg(sepal_length),
sum(sepal_length),count(species)
from iris
group by(species);

(5) 主键、索引
主键（primary key）：能够用来惟一肯定表中的一条记录，如学号、身份证号，也可单独额外生成
索引（index）：对数据库中的某一（几）个字段进行索引，就是加速了若是在查询操做中where有针对该字段的条件。
默认主键会被索引。排序后二分查找
(6) 表的链接 join
select column_name(s)
from table1
inner join table2 on table1.column_name=table2.column_name;
inner join：取两个表的交集
left join：取左边的表，和与右边表的交集
right_join：取右边的表，和与锁边的表的交集数据结构

3. 利用python链接数据库分布式

安装数据包：pip install pymysql，官方文档：https://pymysql.readthedocs.io/en/latest/user/examples.htmlide

步骤： 1.与数据库创建链接

　　　　2.进行sql的增删改查，使用反引号将数据库字段括起来。对数据库有添加、删除等改写操做时，该写完后统一使用commit，实际改动数据库。

　　　　3.关闭数据库链接

 1 import pymysql.cursors
 2 
 3 #connect to the database,以dict类型存储结果
 4 connection = pymysql.connect(host='localhost',
 5                              user='root',
 6                              password='123456root',
 7                              db='iris',
 8                              charset='utf8mb4',
 9                              cursorclass=pymysql.cursors.DictCursor)
10 
11 try:
12     with connection.cursor() as cursor:
13         #read a single record
14         sql = "SELECT * from `iris_with_id` where `id` =%s"
15         cursor.execute(sql, ('3',))
16         result=cursor.fetchone()
17         print result
18         print result['id']
19 finally:
20     connection.close()

View Code

结果：

{u'Petal_width': 0.2, u'sepal_length': 4.7, u'id': 3, u'sepal_width': 3.2, u'petal_length': 1.3, u'species': u'Iris-setosa\r'}
3

cursor.fetchone()只取查询结果中的第一个，cursor.fetchall()取全部的查询结果。cursorclass=pymysql.cursors.DictCursor指定以字典的键值对的形式返回结果。

4. 利用pandas进行数据清理，seaborn数据可视化

数据清理包含四部分：

格式转换
- 数据的原始存储形式未必适合python数据处理
- 例如原始数据存储时间字符串，转化为python表示的数据结构
缺失数据
- 每条记录均可能在某些属性值上缺失
- 应对策略
  - 忽略有缺失数据的记录
  - 直接把这个值标记为“未知”
  - 利用平均值、最常出现值等去填充
异常数据
- 出现不符合常识的数值
- 性别-出现数字，年龄大于100
标准化
- 　　用户可自主输入的一些属性上，可能出现实际是相同值，可是输入不一样

数据清理实践：共享住宿Airbnb

安装 Python工具包

Pandas：核心数据结构：DataFrame

Seaborn：提供可视化功能。

示例：数据源是https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data

#读取数据
import pandas
users = pandas.read_csv("train_users_2.csv")
#首先须要的是对数据的基本查看
users.describe()
#第一行是属性名称，index从0开始，能够指定显示前3行，也能够不给参数，默认显示前5行
users.head(3)
#与head相反，给出数据集末尾的几行
users.tail()
users.shape  #返回整个数据的样子
users.loc[1:5, "age"] #返回第1到5行age字段的值
#格式转换，能够用format指定格式

users['date_account_created'] = pandas.to_datetime(users['date_account_created']) #统一时间格式
users["timestamp_first_active"] = pandas.to_datetime(users["timestamp_first_active"], format = "%Y%m%d%H%M%S")


import seaborn
%matplotlib inline
seaborn.distplot(users['age'].dropna())
users_with_true_age = users[users["age"] < 90]
users_with_true_age = users_with_true_age[users_with_true_age["age"]>10]
seaborn.distplot(users_with_true_age["age"])