python实现hive udf

时间 2019-11-16

标签 python 实现 hive udf 栏目 Python 繁體版

原文原文链接

流程

主要分为两个部分，一个部分为Python脚本实现想要实现的功能，另一个部分为HQL部分，调用Python脚本对数据进行处理。python

Python部分

HQL调用Python实现的UDF其实有一个重定向的过程，把数据表中之列的列重定向Python的标准输入中,按行操做，首先将每行按照指定的分割符分开，通常为’\t’，而后剩下的就是对其进行操做，print须要的列,以’\t’分割。app

example:spa

import sys

ans = {}

for line in sys.stdin:
        line = line.split()
        shopid = line[0]
        if shopid not in ans:
                ans[shopid] = []
                ans[shopid].append(line[1])
        else:
                ans[shopid].append(line[1])

for shop in ans:
        print shop,'\t',ans[shop]

HQL部分

这里主要就是一个调用的过程：code

--首先须要添加Python文件
add file pythonfile_location;
--而后经过transform(指定的列) ，指定的列是须要处理的列
select transform(指定的列)
using "python filename" 
as (newname) 
--newname指输出的列的别名

注意: 使用transform的时候不能查询别的列
好比：orm

select a,trans(b,c)
using "python udf.py"
as(d,e)
from table1
where hp_statdate='2016-05-10'

这样就是错的，不能选择a,若是须要a的话能够直接放到transform里，而后将其不做处理，直接输出便可。it