Data Analysis with EMR.html
Video demo: Run Spark Application(Scala) on Amazon EMR (Elastic MapReduce) cluster【EMR 5.3.1】git
一个实战为王的年代,嘿嘿嘿~github
建立 Amazon S3 存储桶安全
建立 Amazon EC2 密钥对服务器
Goto: Create Cluster - Quick Options网络
关于服务器型号的选择,参见:[AWS] EC2 & GPUapp
Spark 2.4.3 on Hadoop 2.8.5ssh
YARN with Ganglia 3.7.2 and Zeppelin 0.8.1机器学习
/* 须要找个教程全面学习下 */ide
在 Network and hardware (网络和硬件) 下,查找 Master (主) 和 Core (核心) 实例状态。
集群建立过程当中,状态将经历 Provisioning (正在预置) 到 Bootstrapping (正在引导启动) 到 Waiting (正在等待) 三个阶段。
一旦您看到 Security groups for Master (主节点的安全组) 和 Security Groups for Core & Task (核心与任务节点的安全组) 对应的连接,便可转至下一步,但您可能须要一直等到集群成功启动且处于 Waiting (正在等待) 状态。
进入master的ec2主机,而后经过ssh登陆。
记得在 Security Group 中修改配置,支持ssh。
关于准备数据集,参见:[AWS] S3 Bucket
需终止集群并删除 Amazon S3 存储桶以避免产生额外费用。
数据比较大,如何预处理大数据。
Ref: Preprocessing data with Scalding and Amazon EMR
the possiblity of building a model to predict the probability of chronic disease given the claim codes for a patient. Pandas over IPython was okay for doing analysis with a subset (single file) of data, but got a bit irritating with the full dataset because of frequent hangs and subsequent IPython server restarts.
预测一个病人是否患有这个疾病,是个预测模型。
Pandas适合将一个单独的表做为输入,数据过多容易致使系统崩溃。
代码展现:Medicaid Dataset - Basic Data Analysis of Benefit Summary Data
对数据集内容作了大概的了解。
This post describes a mixture of Python and Scala/Scalding code that I hooked up to convert the raw Benefits and Inpatient Claims data from the Medicare/Medicaid dataset into data for an X matrix and multiple y vectors, each y corresponding to a single chronic condition. Scalding purists would probably find this somewhat inelegant and prefer a complete Scalding end-to-end solution, but my Scalding-fu extends only so far - hopefully it will improve with practice.
From: Data Processing and Text Mining Technologies on Electronic Medical Records: A Review
Abstract
应用之潜力:Currently, medical institutes generally use EMR to record patient’s condition, including diagnostic information, procedures performed, and treatment results. EMR has been recognized as a valuable resource for large-scale analysis.
存在的问题:However, EMR has the characteristics of diversity, incompleteness, redundancy, and privacy, which make it difficult to carry out data mining and analysis directly.
预处理魔力:Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results.
Different types of data require different processing technologies.
结构化数据:Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation, and data reduction.
非结构化化:For semistructured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods.
The task of information extraction for medical texts mainly includes NER (named-entity recognition) and RE (relation extraction).
This paper focuses on the process of EMR processing and emphatically analyzes the key techniques. In addition, we make an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work.
很是好的一篇文章,拿出一个案例为样品对大数据处理的各个流程作了阐述。
有时间不妨细看下。
一些新技术阐述:AWS Big Data Blog
很好的代码样例:https://github.com/aws-samples/aws-big-data-blog
一个经典的案例:Nasdaq’s Architecture using Amazon EMR and Amazon S3 for Ad Hoc Access to a Massive Data Set
The Nasdaq Group has been a user of Amazon Redshift since it was released and we are extremely happy with it. We’ve discussed our usage of that system at re:Invent several times, the most recent of which was FIN401 Seismic Shift: Nasdaq’s Migration to Amazon Redshift. Currently, our system is moving an average of 5.5 billion rows into Amazon Redshift every day (14 billion on a peak day in October of 2014).
可见数据量很是庞大且高频。
We can avoid these problems by using Amazon S3 and Amazon EMR, allowing us to separate compute and storage for our data warehouse and scale each independently.
/* 略,没啥意思 */
From: 使用 AWS Glue 和 Amazon Athena 实现无服务器的自主型机器学习
使用 AWS Glue 提取位于 Amazon S3 上有关出租车行驶状况的数据集,
并使用 K-means 根据行车坐标将数据分红 100 个不一样的集群。
而后,我会使用 Amazon Athena 查询行驶次数和每一个集群的大概区域。
最后,我会使用 Amazon Athena 来计算行驶次数最多的四个区域的坐标。
使用 AWS Glue 和 Amazon Athena 均可以执行这些任务,无需预置或管理服务器。
该脚本使用 Spark 机器学习 K-means 集群库,基于坐标划分数据集。
该脚本经过加载绿色出租车数据并添加指示每一行被分配到哪一个集群的列来执行做业。
该脚本采用 parquet 格式将表保存到 Amazon s3 存储桶 (目标文件)。
可使用 Amazon Athena 查询存储桶。
Glue抓取数据,而后kmean后将结果保存在S3中。
而后经过Athena作一些简单的查询。
问题来了:pySpark.kmean是否 base on EMR。
结论:
先把glue用熟,积累一些简单的pipeline的处理经验后,再进阶使用emr处理复炸的Pipeline,或者也就是自定义搭建复杂pipeline。
End.