目前主要有如下几种数据插入方式:(转自:如何将大规模数据导入Neo4j)
Cypher CREATE 语句,为每一条数据写一个CREATE
Cypher LOAD CSV 语句,将数据转成CSV格式,经过LOAD CSV读取数据。
官方提供的Java API —— Batch Inserter
大牛编写的 Batch Import 工具
官方提供的 neo4j-import 工具javascript
这边重点来讲一下官方最快的neo4j-import
,使用的前提条件:css
比较适用:html
首次导入,没法迭代更新
来看一下官方案例:Use the Import tool java
bin\neo4j start
bin\neo4j stop
bin\neo4j restart
bin\neo4j status
neo4j-admin
的参数:控制内存来源:10.5. Memory recommendations
node
neo4j-admin memrec [--memory=<memory dedicated to Neo4j>] [--database=<name>]
Option | Default | Description |
---|---|---|
–memory | The memory capacity of the machine | The amount of memory to allocate to Neo4j. Valid units are: k, K, m, M, g, G. |
–database | graph.db | The name of the database. This option will generate numbers for Lucene indexes, and for data volume and native indexes in the database. These can be used as an input into more detailed memory analysis. |
参考:linux
还有--pagecache
单条命令指定缓存:git
指的是,再该条导入数据的指令下,缓存设置。github
neo4j-admin
的参数:Dump and load databases - 线下备份执行该两步操做,须要关闭数据库。参考:10.7. Dump and load databasesweb
graph.db
转存到.dump
须要关闭数据库sql
$neo4j-home> bin/neo4j-admin dump --database=graph.db --to=/backups/graph.db/2016-10-02.dump
$neo4j-home> ls /backups/graph.db
$neo4j-home> 2016-10-02.dump
.dump
load进来好像能够不用关闭
$neo4j-home> bin/neo4j stop
Stopping Neo4j.. stopped
$neo4j-home> bin/neo4j-admin load --from=/backups/graph.db/2016-10-02.dump --database=graph.db --force
若是带--force
,那么load以后,会更新全部的存在着的.db(any existing database gets overwritten.
)
neo4j-admin
的参数:backup and restore - 在线备份$neo4j-home> export HEAP_SIZE=2G
$neo4j-home> mkdir /mnt/backup
$neo4j-home> bin/neo4j-admin backup --from=192.168.1.34 --backup-dir=/mnt/backup --name=graph.db-backup --pagecache=4G
backup
进临时文件夹之中。
$neo4j-home> export HEAP_SIZE=2G
$neo4j-home> bin/neo4j-admin backup --from=192.168.1.34 --backup-dir=/mnt/backup --name=graph.db-backup --fallback-to-full=true --check-consistency=true --pagecache=4G
.
movies.csv.
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
其中,title是属性,注意此时须要有双引号;year:int也是属性,只不过该属性是数值型的;
:LABEL
与:ID
同样生成了一个新节点,也就是一套数据能够经过:
生成双节点
actors.csv.
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
roles.csv.
其中,:LABEL
很是有意思,是节点的附属属性,其中personId:ID
必定是惟一的,:LABEL
能够不惟一。
并且,载入以后,:LABEL
单独会成为新的节点,并且是去重的。
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
其中,这个节点的属性,role没有标注:
,role是属性,能够加双引号,也能够不加。最好是指定一下格式,譬如:int
为数值型,还有字符型roles:string[]
linux执行:
neo4j_home$ bin/neo4j-admin import --nodes import/movies.csv --nodes import/actors.csv --relationships import/roles.csv
其中,以前老版本批量导入是:neo4j-import
,如今批量导入是:neo4j-admin
。
window执行:
neo4j-import.bat --into ../data/databases/graph.db --id-type string --nodes:attribute ../import/node_attribute.csv --relationships ../import/product_SecondLeaf.csv --relationships ../import/scene_isDemond.csv
--into
,是指定存入名字,在不一样的尝试,能够修更名字。--nodes:attribute
,其中,nodes:
后面是用来指定节点大类的名称的--id-type string
,,The –id-type string is indicating that all :ID columns contain alphanumeric values (there is an optimization for numeric-only id’s).以前节点ID只能由数字组成,如今容许字符+数字
共同定义。linux最后启动:
./bin/neo4j start
window 最后启动:
neo4j.bat console
1 报错信息留存在bad.log
\data\databases\graph.db\bad.log
global id space
的报错为节点未定义,或者节点重复
2 若是节点不惟一,直接报错:
global id space
,同时后续的内容中端上传,须要删除data/database /graph.db
,从新操做一遍
主要来源于:B.2. Use the Import tool
若是导入的节点信息为:
:START_ID;role;:END_ID;:TYPE
keanu;'Neo';tt0133093;ACTED_IN keanu;'Neo';tt0234215;ACTED_IN
那么能够经过--delimiter
来进行指定。
neo4j_home$ bin/neo4j-admin import --nodes import/movies2.csv --nodes import/actors2.csv --relationships import/roles2.csv --delimiter ";" --array-delimiter "|" --quote "'"
movies5a.csv.
movieId:ID,title,year:int
tt0133093,"The Matrix",1999
sequels5a.csv.
movieId:ID,title,year:int
tt0234215,"The Matrix Reloaded",2003
tt0242653,"The Matrix Revolutions",2003
actors5a.csv.
personId:ID,name
keanu,"Keanu Reeves"
laurence,"Laurence Fishburne"
carrieanne,"Carrie-Anne Moss"
执行语句:
neo4j_home$ bin/neo4j-admin import --nodes:Movie import/movies5a.csv --nodes:Movie:Sequel import/sequels5a.csv --nodes:Actor import/actors5a.csv
执行的时候,把movies5a.csv
定义一个节点名字nodes:Movie
;
在sequels5a.csv
定义节点名字有两个::Movie:Sequel
。
roles5b.csv.
:START_ID,role,:END_ID
keanu,"Neo",tt0133093
keanu,"Neo",tt0234215
keanu,"Neo",tt0242653
laurence,"Morpheus",tt0133093
laurence,"Morpheus",tt0234215
laurence,"Morpheus",tt0242653
carrieanne,"Trinity",tt0133093
执行内容:
neo4j_home$ bin/neo4j-admin import --relationships:ACTED_IN import/roles5b.csv
其中,:ACTED_IN
将关系名称定义为ACTED_IN
;同时定义关系的属性也有role
节点数据集,标题:movies4-header.csv.
movieId:ID,title,year:int,:LABEL
节点数据集,内容模块1:movies4-part1.csv.
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
节点数据集,内容模块2:movies4-part2.csv.
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
关系数据集,标题:roles4-header.csv.
:START_ID,role,:END_ID,:TYPE
关系数据集,内容1:roles4-part1.csv.
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
关系数据集,内容2:roles4-part2.csv.
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
执行:
neo4j_home$ bin/neo4j-admin import --nodes "import/movies4-header.csv,import/movies4-part1.csv,import/movies4-part2.csv" --relationships "import/roles4-header.csv,import/roles4-part1.csv,import/roles4-part2.csv"
标题与内容单独分开,而后由:标题,内容模块1,内容模块2
,分块导入。
这个会比较常常出现,两个节点集合中,拥有相同字段,若是不设置,就会出现报错。
movies7.csv.
movieId:ID(Movie-ID),title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel
其中,(Movie-ID)
,是将ID进行标记
actors7.csv.
personId:ID(Actor-ID),name,:LABEL
1,"Keanu Reeves",Actor
2,"Laurence Fishburne",Actor
3,"Carrie-Anne Moss",Actor
roles7.csv.
:START_ID(Actor-ID),role,:END_ID(Movie-ID)
1,"Neo",1
1,"Neo",2
1,"Neo",3
2,"Morpheus",1
2,"Morpheus",2
2,"Morpheus",3
3,"Trinity",1
3,"Trinity",2
3,"Trinity",3
执行:
neo4j_home$ bin/neo4j-admin import --nodes import/movies7.csv --nodes import/actors7.csv --relationships:ACTED_IN import/roles7.csv
在关联表中定义::START_ID(Actor-ID)
与:END_ID(Movie-ID)
,来指定相应的ID。
错误的关系出现:
roles8a.csv.
:START_ID,role,:END_ID,:TYPE
carrieanne,"Trinity",tt0242653,ACTED_IN emil,"Emil",tt0133093,ACTED_IN
譬如多出了节点,emil
此时执行:
neo4j_home$ bin/neo4j-admin import --nodes import/movies8a.csv --nodes import/actors8a.csv --relationships import/roles8a.csv --ignore-missing-nodes
其中的--ignore-missing-nodes
就是跳过报错的节点,其中,错误信息会记录在bad.log之中:
InputRelationship:
source: roles8a.csv:11
properties: [role, Emil]
startNode: emil (global id space)
endNode: tt0133093 (global id space)
type: ACTED_IN
referring to missing node emil
actors8b.csv.
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
laurence,"Laurence Harvey",Actor
在节点数据集actors8b.csv.
中,由重复的节点:laurence
须要执行:
neo4j_home$ bin/neo4j-admin import --nodes import/actors8b.csv --ignore-duplicate-nodes
其中,–ignore-duplicate-nodes就是重复节点忽略
会在bad.log之中显示报错:
Id 'laurence' is defined more than once in global id space, at least at actors8b.csv:3 and actors8b.csv:5
vv