Amazon Athena学习笔记

时间 2021-01-19

标签 html 数据库 json 数组 less 函数 atom spa scala 栏目硅谷繁體版

原文原文链接

Amazon Athena概览

快速了解Athena 是什么？关键字：html

交互式查询服务
ad-hoc查询
支持标准SQL
指定S3中的数据造成表(相似hive)
快速响应(seconds级别)
serverless
支持JDBC链接和Java API链接

Amazon Athena is an interactive query service that lets you use standard SQL to analyze data directly in Amazon S3. You can point Athena at your data in Amazon S3 and run ad-hoc queries and get results in seconds. Athena is serverless, so there is no infrastructure to set up or manage. You pay only for the queries you run. Athena scales automatically—executing queries in parallel—so results are fast, even with large datasets and complex queries.数据库

If you connect to Athena using the JDBC driver, use version 1.1.0 of the driver or later with the Amazon Athena API. Earlier version drivers do not support the API. For more information and to download the driver, see Accessing Amazon Athena with JDBC.json

For code samples using the AWS SDK for Java, see Examples and Code Samples数组

Athena数据库名，表名，字段名规范

数据库名字，表名字，列名字必须是小写less
特殊字符"_"支持，其余的则不支持函数
若是名字以"_"开头，则须要使用``来修饰ui

建立Athena表加载数据

1.数据在s3，建立athena表经过location参数指定加载s3上的数据atom

NOTE：这个好像必须建立外部表才行，后续验证spa

CREATE EXTERNAL TABLE IF NOT EXISTS default.self_learning_old(rowkey STRING,windspd INT,directh INT,directv INT,func STRING,value INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://com.kong.bp.cn.test/test_folder/'

2.基于已有的表，建立分区表demoscala

CREATE table self_learning
WITH (format='PARQUET',
parquet_compression='SNAPPY',
partitioned_by=array['year'],
external_location = 's3://com.kong.bp.cn.test/test_folder/self_learning_old/')
AS
SELECT
       windspd,
       directh,
       directv,
       func,
       value,
　　　　 cast(substr(split(rowkey,':')[2],1,4) AS bigint) as year
FROM default.self_learning_old

Athena查询json数据

关于Athena加载json数据参考文档中的：Querying JSON

JSON样例数据：

{
 "name": "Bob Smith",
 "org": "engineering",
 "projects": [{
  "name": "project1",
  "completed": false
 }, {
  "name": "project2",
  "completed": true
 }]
}

1.使用json_extract函数解析数据：

WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},
{"name":"project2", "completed":true}]}'
AS blob
)
SELECT
json_extract(blob, '$.name') AS name,
json_extract(blob, '$.projects') AS projects
FROM dataset

返回结果：

2.使用json_extract_scalar函数

json_extract_scalar相似json_extract函数，可是json_extract_scalar只返回scalar values (Boolean, number, or string)。

NOTE：此函数不适用于arrays, maps, or structs，这里的"scalar"我理解为对应的数据类型

好比使用json_extract_scalar解析出对应的数据：

WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT
json_extract_scalar(blob, '$.name') AS name,
json_extract_scalar(blob, '$.projects') AS projects
FROM dataset

查询的结果：

+---------------------------+
| name       | projects   |
+---------------------------+
| Susan Smith |             |
+---------------------------+

由于json中的projects是一个数组类型，因此这里使用json_extract_scalar没法识别

3.使用json_array_get函数

对于这种数组类型，能够使用json_array_get函数，好比：

WITH dataset AS (
SELECT '{"name": "Bob Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT json_array_get(json_extract(blob, '$.projects'), 0) AS item
FROM dataset

先使用json_extract函数得到projects项数据，获得的是一个数组类型，再使用json_array_get函数按下标(index)来获取。返回的结果：

+---------------------------------------+
| item                                 |
+---------------------------------------+
| {"name":"project1","completed":false} |
+---------------------------------------+