摘要:本文由社区用户 xrfinbj 贡献,主要介绍 Exchange 工具从 Hive 数仓导入数据到 Nebula Graph 的流程及相关的注意事项。java
公司内部有使用图数据库的场景,内部经过技术选型肯定了 Nebula Graph 图数据库,还须要验证 Nebula Graph 数据库在实际业务场景下的查询性能。因此急迫的须要导入数据到 Nebula Graph 并验证。在这个过程当中发现经过 Exchange 工具从 hive 数仓导入数据到 Nebula Graph 文档不是很全,因此把这个流程中踩到的坑记录下来,回馈社区,避免后人走弯路。git
本文主要基于我以前发在论坛的 2 篇帖子:github
编译后生成 jar 包sql
CREATE SPACE test_hive(partition_num=10, replica_factor=1); --建立图空间,本示例中假设只须要一个副本 USE test_hive; --选择图空间 test CREATE TAG tagA(idInt int, idString string, tboolean bool, tdouble double); -- 建立标签 tagA CREATE TAG tagB(idInt int, idString string, tboolean bool, tdouble double); -- 建立标签 tagB CREATE EDGE edgeAB(idInt int, idString string, tboolean bool, tdouble double); -- 建立边类型 edgeAB
CREATE TABLE `tagA`( `id` bigint, `idInt` int, `idString` string, `tboolean` boolean, `tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n'; insert into tagA select 1,1,'str1',true,11.11; insert into tagA select 2,2,"str2",false,22.22; CREATE TABLE `tagB`( `id` bigint, `idInt` int, `idString` string, `tboolean` boolean, `tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n'; insert into tagB select 3,3,"str 3",true,33.33; insert into tagB select 4,4,"str 4",false,44.44; CREATE TABLE `edgeAB`( `id_source` bigint, `id_dst` bigint, `idInt` int, `idString` string, `tboolean` boolean, `tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n'; insert into edgeAB select 1,3,5,"edge 1",true,55.55; insert into edgeAB select 2,4,6,"edge 2",false,66.66;
注意看exec、fields、nebula.fields、vertex、source、target字段映射数据库
{ # Spark relation config spark: { app: { name: Spark Writer } driver: { cores: 1 maxResultSize: 1G } cores { max: 4 } } # Nebula Graph relation config nebula: { address:{ graph: ["192.168.1.110:3699"] meta: ["192.168.1.110:45500"] } user: user pswd: password space: test_hive connection { timeout: 3000 retry: 3 } execution { retry: 3 } error: { max: 32 output: /tmp/error } rate: { limit: 1024 timeout: 1000 } } # Processing tags tags: [ # Loading from Hive { name: tagA type: { source: hive sink: client } exec: "select id,idint,idstring,tboolean,tdouble from nebula.taga" fields: [id,idstring,tboolean,tdouble] nebula.fields: [idInt,idString,tboolean,tdouble] vertex: id batch: 256 partition: 10 } { name: tagB type: { source: hive sink: client } exec: "select id,idint,idstring,tboolean,tdouble from nebula.tagb" fields: [id,idstring,tboolean,tdouble] nebula.fields: [idInt,idString,tboolean,tdouble] vertex: id batch: 256 partition: 10 } ] # Processing edges edges: [ # Loading from Hive { name: edgeAB type: { source: hive sink: client } exec: "select id_source,id_dst,idint,idstring,tboolean,tdouble from nebula.edgeab" fields: [id_source,idstring,tboolean,tdouble] nebula.fields: [idInt,idString,tboolean,tdouble] source: id_source target: id_dst batch: 256 partition: 10 } ] }
spark-submit --class com.vesoft.nebula.tools.importer.Exchange --master “local[4]” /xxx/exchange-1.0.1.jar -c /xxx/nebula_application.conf -h
./db_dump --mode=stat --space=xxx --db_path=/home/xxx/data/storage0/nebula --limit 20000000
说明:Exchange 目前还不支持 Spark 3,编译后运行报错,因此无法验证 Spark 3 环境apache
Spark Debug 部分参考博客:https://dzone.com/articles/how-to-attach-a-debugger-to-apache-spark微信
经过 Exchange 源码的学习和 Debug 能加深对 Exchange 原理的理解,同时也能发现一些文档描述不清晰的地方,好比 导入 SST 文件 和 Download and Ingest 只有结合源码看才能发现文档描述不清晰逻辑不严谨的问题。app
经过源码 Debug 也能发现一些简单的参数配置问题。socket
进入正题:分布式
步骤一:
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000
步骤二:
spark-submit --class com.vesoft.nebula.tools.importer.Exchange --master “local” /xxx/exchange-1.1.0.jar -c /xxx/nebula_application.conf -h Listening for transport dt_socket at address: 4000
步骤三:IDEA 配置
步骤四:在 IDEA 里面点击 Debug
感谢 vesoft 提供了宇宙性能最强的 Nebula Graph 图数据库,能解决业务中不少实际问题,中途这点痛不算什么(看以前的分享,360 数科他们那个痛才是真痛)。中途遇到的问题都有幸获得社区及时的反馈解答,再次感谢
很期待 Exchange 支持 Nebula Graph 2.0
喜欢这篇文章?来来来,给咱们的 GitHub 点个 star 表鼓励啦~~ 🙇♂️🙇♀️ [手动跪谢]
交流图数据库技术?交个朋友,Nebula Graph 官方小助手微信:NebulaGraphbot 拉你进交流群~~