R----data.table包介绍学习

相比dplyr包,data.table包可以更大程度地提升数据的处理速度,这里就简单介绍一下data.tale包的使用方法。php

data.table:用于快速处理大数据集的哦css

数据的读取

data.table包中数据读取的函数:fread()git

data.table的建立

library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) DT # x y v # 1: a 1 1 # 2: a 3 2 # 3: a 6 3 # 4: b 1 4 # 5: b 3 5 # 6: b 6 6 # 7: c 1 7 # 8: c 3 8 # 9: c 6 9 

基础操做

行提取

行提取分为单行提取和多行提取。github

单行提取

DT[2] # 2nd row # x y v # 1: a 3 2 DT[2,] # same # x y v # 1: a 3 2 

这里DT [2]和DT [2]是彻底相同的,这里的「,」只是说明还有其余参数可设置,而其余参数按默认值进行计算。下文全部这样的最后一个「,」都再也不写出来。ruby

多行提取

  • 数字提取
DT[1:2] # x y v # 1: a 1 1 # 2: a 3 2 DT[c(2,5)] # x y v #1: a 3 2 #2: b 3 5 
  • 逻辑提取
DT[c(FALSE,TRUE)] # even rows (usual recycling) # x y v # 1: a 3 2 # 2: b 1 4 # 3: b 6 6 # 4: c 3 8 

此时,C(FALSE,TRUE)会本身重复匹配成与DT的行数相同的向量bash

列提取

与行提取相同,列的提取也包含单列提取和多列提取。函数

单列提取

  • 数字提取

数字提取时,必定要把心安理得参数设置为FALSE。大数据

DT[,2,with=FALSE] # 2nd column # y # 1: 1 # 2: 3 # 3: 6 # 4: 1 # 5: 3 # 6: 6 # 7: 1 # 8: 3 # 9: 6 
  • 列名提取
DT[,list(v)] # v column (as data.table # v # 1: 1 # 2: 2 # 3: 3 # 4: 4 # 5: 5 # 6: 6 # 7: 7 # 8: 8 # 9: 9 

列名的修改

列名的修改能够使用setnames()函数,这个函数好像比对data.frame类型数据名更改的名称()和colnames()函数也要快一些。spa

dt = data.table(a=1:2,b=3:4,c=5:6) # compare to data.table try(tracemem(dt)) # by reference, no deep or shallow copies setnames(dt,"b","B") # by name, no match() needed (warning if "b" is missing) setnames(dt,3,"C") # by position with warning if 3 > ncol(dt) setnames(dt,2:3,c("D","E")) # multiple setnames(dt,c("a","E"),c("A","F")) # multiple by name (warning if either "a" or "E" is missing) setnames(dt,c("X","Y","Z")) # replace all (length of names must be == ncol(DT)) 

多列提取

  • 数字提取

如同上面对按数字对单列的提取,对多列提取也要设置与参数为FALSE。code

DT[,2:3,with=FALSE] # y v # 1: 1 1 # 2: 3 2 # 3: 6 3 # 4: 1 4 # 5: 3 5 # 6: 6 6 # 7: 1 7 # 8: 3 8 # 9: 6 9 DT[,c(1,3),with=FALSE] # x v # 1: a 1 # 2: a 2 # 3: a 3 # 4: b 4 # 5: b 5 # 6: b 6 # 7: c 7 # 8: c 8 # 9: c 9 
  • 按列名提取
DT[,list(y, v)] # y v # 1: 1 1 # 2: 3 2 # 3: 6 3 # 4: 1 4 # 5: 3 5 # 6: 6 6 # 7: 1 7 # 8: 3 8 # 9: 6 9 

若是按列名提取时,不使用列表,仍然能对列进行提取,只是结果以向量的形式输出。

DT[,v] # v column (as vector) # [1] 1 2 3 4 5 6 7 8 9 DT[,c(v)] # same # [1] 1 2 3 4 5 6 7 8 9 DT[, c(y, v)] # [1] 1 3 6 1 3 6 1 3 6 1 2 3 4 5 6 7 8 9 

列的添加与删除

列的添加

  • 单列添加
DT
    # x y v # 1: a 1 1 # 2: a 3 2 # 3: a 6 3 # 4: b 1 4 # 5: b 3 5 # 6: b 6 6 # 7: c 1 7 # 8: c 3 8 # 9: c 6 9 DT[, a := 'k'] DT # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 3 5 k # 6: b 6 6 k # 7: c 1 7 k # 8: c 3 8 k # 9: c 6 9 k DT[,c:=8] # add a numeric column, 8 for all rows DT # x y v a c # 1: a 1 1 k 8 # 2: a 3 2 k 8 # 3: a 6 3 k 8 # 4: b 1 4 k 8 # 5: b 3 5 k 8 # 6: b 6 6 k 8 # 7: c 1 7 k 8 # 8: c 3 8 k 8 # 9: c 6 9 k 8 DT[,d:=9L] # add an integer column, 9L for all rows DT[2,d:=10L] # subassign by reference to column d DT # x y v a c d # 1: a 1 1 k 8 9 # 2: a 3 2 k 8 10 # 3: a 6 3 k 8 9 # 4: b 1 4 k 8 9 # 5: b 3 5 k 8 9 # 6: b 6 6 k 8 9 # 7: c 1 7 k 8 9 # 8: c 3 8 k 8 9 # 9: c 6 9 k 8 9 DT[, e := d + 2] DT # x y v a c d e # 1: a 1 1 k 8 9 11 # 2: a 3 2 k 8 10 12 # 3: a 6 3 k 8 9 11 # 4: b 1 4 k 8 9 11 # 5: b 3 5 k 8 9 11 # 6: b 6 6 k 8 9 11 # 7: c 1 7 k 8 9 11 # 8: c 3 8 k 8 9 11 # 9: c 6 9 k 8 9 11 

若是添加的列名,数据中已经包含则是对这一列数据的修改。

  • 多列的添加
DT[, c('f', 'g') := list( d + 1, c)] DT[, ':='( f = d + 1, g = c)] # same DT # x y v a c d e f g # 1: a 1 1 k 8 9 11 10 8 # 2: a 3 2 k 8 10 12 11 8 # 3: a 6 3 k 8 9 11 10 8 # 4: b 1 4 k 8 9 11 10 8 # 5: b 3 5 k 8 9 11 10 8 # 6: b 6 6 k 8 9 11 10 8 # 7: c 1 7 k 8 9 11 10 8 # 8: c 3 8 k 8 9 11 10 8 # 9: c 6 9 k 8 9 11 10 8 

此处,须要注意的是新建立的列只能依照原有数据列,而不能依照新建立的列。例如这个例子中,G = C是能够运行,而摹= F则会提示错误。

列的删除

DT[,c:=NULL] # remove column c DT # x y v a d e f g # 1: a 1 1 k 9 11 10 8 # 2: a 3 2 k 10 12 11 8 # 3: a 6 3 k 9 11 10 8 # 4: b 1 4 k 9 11 10 8 # 5: b 3 5 k 9 11 10 8 # 6: b 6 6 k 9 11 10 8 # 7: c 1 7 k 9 11 10 8 # 8: c 3 8 k 9 11 10 8 # 9: c 6 9 k 9 11 10 8 DT[, c('d', 'e', 'f', 'g'):=NULL] DT # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 3 5 k # 6: b 6 6 k # 7: c 1 7 k # 8: c 3 8 k # 9: c 6 9 k 

列指标的简单操做

简单操做主要包括求和,平均值,方差和标准差等。

DT[2:3,sum(v)] # sum(v) over rows 2 and 3 # [1] 5 DT[2:3,mean(v)] # sum(v) over rows 2 and 3 # [1] 2.5 

索引键

查看和建立索引

索引是对列而言的,索引建立后,数据将自动按索引值进行从新排序,因此每一个数据最多只能有一个索引,可是索引能够由多列组成,这些列能够是数字,因子,字符串或其余格式。

单列索引的建立

## methdod first key(DT) # key # NULL setkey(DT,x) # set a 1-column key. No quotes, for convenience. key(DT) [1] "x" DT # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 3 5 k # 6: b 6 6 k # 7: c 1 7 k # 8: c 3 8 k # 9: c 6 9 k ## method second setkeyv(DT,"y") # same (v in setkeyv stands for vector) key(DT) # [1] "y" 

一旦对数据进行新的索引,原有的索引将消失。

多列索引的建立

## methdod first # key setkey(DT,x,v) # set a 1-column key. No quotes, for convenience. key(DT) # [1] "x" "v" DT # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 3 5 k # 6: b 6 6 k # 7: c 1 7 k # 8: c 3 8 k # 9: c 6 9 k ## method second setkeyv(DT,c("x", "y")) # same (v in setkeyv stands for vector) key(DT) # [1] "x" "v" DT # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 3 5 k # 6: b 6 6 k # 7: c 1 7 k # 8: c 3 8 k # 9: c 6 9 k 

经过索引进行数据的提取

按照索引对数据提取,能够加快提取数据的速度。

单索引

正向提取

setkey(DT, x)
DT["a"] # binary search (fast) # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[.(x=="a")] # same; i.e. binary search (fast) # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[x=="a"] # same; i.e. binary search (fast) # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k 
  • 反向提取
DT[!.("a")] # not join # x y v a # 1: b 1 4 k # 2: b 3 5 k # 3: b 6 6 k # 4: c 1 7 k # 5: c 3 8 k # 6: c 6 9 k DT[!"a"] # same # x y v a # 1: b 1 4 k # 2: b 3 5 k # 3: b 6 6 k # 4: c 1 7 k # 5: c 3 8 k # 6: c 6 9 k DT[!2:4] # all rows other than 2:4 # x y v a # 1: a 1 1 k # 2: b 3 5 k # 3: b 6 6 k # 4: c 1 7 k # 5: c 3 8 k # 6: c 6 9 k 

多索引

  • 正向提取
setkey(DT, x, y)
# Mehtod First DT["a"] # join to 1st column of key # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[.("a")] # same, .() is an alias for list() # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[.("a",3)] # join to 2 columns # x y v a # 1: a 3 2 k DT[.("a",3:6)] # join 4 rows (2 missing) # x y v a # 1: a 3 2 k # 2: a 4 NA NA # 3: a 5 NA NA # 4: a 6 3 k DT[.("a",3:6),nomatch=0] # remove missing # x y v a # 1: a 3 2 k # 2: a 6 3 k DT[.("a",3:6),roll=TRUE] # rolling join (locf) # x y v a # 1: a 3 2 k # 2: a 4 2 k # 3: a 5 2 k # 4: a 6 3 k ## Method Second DT[J('a')] # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[J("a",3)] # binary search (fast) # x y v a # 1: a 3 2 k DT[J("a",3:6)] # same; i.e. binary search (fast) # x y v a # 1: a 3 2 k # 2: a 4 NA NA # 3: a 5 NA NA # 4: a 6 3 k DT[J("a",3:6), nomatch = 0] # x y v a # 1: a 3 2 k # 2: a 6 3 k DT[J("a",3:6), roll = T] # x y v a # 1: a 3 2 k # 2: a 4 2 k # 3: a 5 2 k # 4: a 6 3 k ## Method Third DT[list("a")] # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[list("a",3)] # x y v a # 1: a 3 2 k DT[list("a", 3:6)] # x y v a # 1: a 3 2 k # 2: a 4 NA NA # 3: a 5 NA NA # 4: a 6 3 k DT[list("a", 3:6), nomatch = 0] # x y v a # 1: a 3 2 k # 2: a 6 3 k DT[list("a", 3:6), roll = T] # x y v a # 1: a 3 2 k # 2: a 4 2 k # 3: a 5 2 k # 4: a 6 3 k 
  • 反向提取
DT[x!="b" | y!=3] # not yet optimized, currently vector scans # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 6 6 k # 6: c 1 7 k # 7: c 3 8 k # 8: c 6 9 k DT[!.("b",3)] # same result but much faster # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 6 6 k # 6: c 1 7 k # 7: c 3 8 k # 8: c 6 9 k 

分类汇总

分类汇老是指按某列的分类指标进行简单操做,借助由参数实现。此外,经过参数与索引相互没有影响这里。

单指标的分类汇总

  • 默认汇总名称
DT[,sum(v),by=x] # x V1 # 1: a 6 # 2: b 15 # 3: c 24 DT[,sum(v),by=y] # y V1 # 1: 1 12 # 2: 3 15 # 3: 6 18 
  • 自定义汇总名称
DT[,list(sum.v.x = sum(v)),by=x] # x sum.v.x # 1: a 6 # 2: b 15 # 3: c 24 DT[,list(sum.v.y = sum(v)),by=y] # y sum.v.y # 1: 1 12 # 2: 3 15 # 3: 6 18 DT[,sum.v.y := sum(v) ,by=y] # x y v a sum.v.y # 1: a 1 1 k 12 # 2: a 3 2 k 15 # 3: a 6 3 k 18 # 4: b 1 4 k 12 # 5: b 3 5 k 15 # 6: b 6 6 k 18 # 7: c 1 7 k 12 # 8: c 3 8 k 15 # 9: c 6 9 k 18 
  • 汇总结果与原始数据进行匹配
DT[,sum.v.y := sum(v) ,by=y] # x y v a sum.v.y # 1: a 1 1 k 12 # 2: a 3 2 k 15 # 3: a 6 3 k 18 # 4: b 1 4 k 12 # 5: b 3 5 k 15 # 6: b 6 6 k 18 # 7: c 1 7 k 12 # 8: c 3 8 k 15 # 9: c 6 9 k 18 

多指标的多个分类汇总

  • 默认汇总名称
DT[,list(mean(v),sum(v)),by=list(x,y)] # keyed by # x y V1 V2 # 1: a 1 1 1 # 2: a 3 2 2 # 3: a 6 3 3 # 4: b 1 4 4 # 5: b 3 5 5 # 6: b 6 6 6 # 7: c 1 7 7 # 8: c 3 8 8 # 9: c 6 9 9 
  • 自定义汇总名称
DT[,list(mean.v = mean(v),sum.v = sum(v)),by=list(x,y)] # keyed by # x y mean.v sum.v #1: a 1 1 1 #2: a 3 2 2 #3: a 6 3 3 #4: b 1 4 4 #5: b 3 5 5 #6: b 6 6 6 #7: c 1 7 7 #8: c 3 8 8 #9: c 6 9 9 
  • 汇总结果与原始数据进行匹配
DT[,c("mean.v", "sum.v.y") := list(mean(v),sum(v)) ,by=list(x,y)] # x y v a sum.v.y mean.v # 1: a 1 1 k 1 1 # 2: a 3 2 k 2 2 # 3: a 6 3 k 3 3 # 4: b 1 4 k 4 4 # 5: b 3 5 k 5 5 # 6: b 6 6 k 6 6 # 7: c 1 7 k 7 7 # 8: c 3 8 k 8 8 # 9: c 6 9 k 9 9 

data.table与data.frame的转化

data.table格式加快了处理速度,而data.frame则更为基础。二者的转化能够经过data.table(),setDT()和setDT()来实现,其中data.table()和setDT()函数能够将数据从data.frame转化为data.table,setDF()函数能够将数据从data.table转化为data.frame。注意使用data.table(),setDT()和setDT()时,参数自己的数据类型也会发生变化。

class(DT) # [1] "data.table" "data.frame" class(setDF(DT)) # [1] "data.frame" class(DT) # [1] "data.frame" 

此外,data.table包还能够与基础包中的重复的(),惟一的(),子()函数结合使用。不只如此,data.table包还有一些基础包的替代函数.rbind()升级版的rbindlist(),能够合并列数不一样和列位置不一样的数据。比dplyr包中安排()函数更快的setorder()排序函数。

来源于:http://xukuang.github.io/blog/2016/04/data-table-in-R/

相关文章
相关标签/搜索