Azure HDInsight 和 Spark 大数据实战(二)

时间 2019-12-01

标签 azure hdinsight spark 数据实战栏目 Spark 繁體版

原文原文链接

HDInsight cluster on Linux

登陆 Azure portal (https://manage.windowsazure.com )html

点击左下角的 NEW 按钮，而后点击 DATA SERVICES 按钮，点击 HDINSIGHT，选择 HADOOP ON LINUX，以下图所示。node

输入集群名称，选择集群大小和帐号，设定集群的密码和存储帐号，下表是各个参数的含义和配置说明。python

Namelinux	Valueapache
Cluster Namewindows	Name of the cluster.浏览器
Cluster Size服务器	Number of data nodes you want to deploy. The default value is 4. But the option to use 1 or 2 data nodes is also available from the drop-down. Any number of cluster nodes can be specified by using the Custom Create option. Pricing details on the billing rates for various cluster sizes are available. Click the ? symbol just above the drop-down box and follow the link on the pop-up.工具
Passwordoop	The password for the HTTP account (default user name: admin) and SSH account (default user name: hdiuser). Note that these are NOT the administrator accounts for the virtual machines on which the clusters are provisioned.
Storage Account	Select the Storage account you created from the drop-down box. Once a Storage account is chosen, it cannot be changed. If the Storage account is removed, the cluster will no longer be available for use. The HDInsight cluster is co-located in the same datacenter as the Storage account.

点击 CREATE HDINSIGHT CLUSTER 便可建立运行于 Azure 的 Hadoop 集群。

上述过程快速建立一个运行Hadoop 的 Linux 集群，且默认 SSH 用户名称为 hdiuser，HTTP 帐户默认名称为 admin。若要用自定义选项，例如使用 SSH 密钥进行身份验证建立群集或使用额外的存储空间，请参阅 Provision Hadoop Linux clusters in HDInsight using custom options ( https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-provision-linux-clusters/ ) 。

Installing Spark

在 HDInsight 中点击建立的 Hadoop集群（在本例中集群名称为 Hadooponlinux ），进入 dashboard，以下图所示。

在 quick glance 中拷贝 Cluster Connection String的值，此为登陆 Hadoop on Linux 配置控制台 Ambari的地址，在浏览器中粘贴 Cluster Connection String的值，此时出现登陆用户名和密码的验证。此时的用户名为上一步中快速建立hadoop集群时默认HTTP用户名admin，密码为快速建立hadoop集群时设置的密码。

正确输入用户名和密码后，出现 Ambari的登陆用户名和密码验证，此时输入用户名 admin 密码为hadoop便可进入Ambari的管理控制台。

下图展现了使用 Ambari 安装Spark的过程。

The following diagram shows the Spark installation process using Ambari.

选择 Ambari "Services" 选项卡。

在 Ambari "Actions" 下拉菜单中选择 "Add Service." 这将启动添加服务向导。

选择 "Spark"，而后点击 "Next" 。

(For HDP 2.2.4, Ambari will install Spark version 1.2.1, not 1.2.0.2.2.)

Ambari 将显示警告消息，确认集群运行的是 HDP 2.2.4 或更高版本，而后单击 "Proceed"。

	Note
	You can reconfirm component versions in Step 6 before finalizing the upgrade.

选择Spark 历史服务器节点，点击 Click "Next" 继续。

指定 Spark 的Slaves ，点击 "Next" 继续。
在客户化服务界面建议您使用默认值为您的初始配置，而后点击 "Next" 继续。
Ambari 显示确认界面，点击 "Deploy" 继续。

	Important
	On the Review screen, make sure all HDP components are version 2.2.4 or later.

Ambari 显示安装、启动和测试界面，其状态栏和消息则指示进度。
当Ambari安装完成，点击 "Complete" 完成 Spark 的整个安装过程。

Run Spark

经过 SSH 登陆 Hadoop 的 Linux 集群，执行如下的Linux 指令下载文档，为后面的Spark程序运行使用。

wget http://en.wikipedia.org/wiki/Hortonworks

将数据拷贝至 Hadoop 集群的HDFS中，

hadoop fs -put ~/Hortonworks /user/guest/Hortonworks

在不少Spark的例子中采用Scala和Java的应用程序演示，本例中使用 PySpark 来演示基于Python语音的Spark使用方法。

pyspark

第一步使用 Spark Context 即 sc 建立RDD，代码以下：

myLines = sc.textFile('hdfs://sandbox.hortonworks.com/user/guest/Hortonworks')

如今咱们实例化了RDD，下面咱们对RDD作转化的操做。为此咱们使用python lambda表达式作筛选。

myLines_filtered = myLines.filter( lambda x: len(x) > 0 )

请注意，以上的python语句不会引起任何RDD的执行操做，只有出现类型如下代码的count()行为才会引起真正的RDD运算。

myLines_filtered.count()

最终Spark Job运算的结果以下所示。

341.

Data Science with Spark

对于数据科学家而言Spark是一种高度有效的数据处理工具。数据科学家常常相似Notebook ( 如 iPython http://ipython.org/notebook.html ) 的工具来快速建立原型并分享他们的工做。许多数据科学家喜爱使用 R语言，可喜的是Spark与R的集成即 SparkR已成为 Spark 新兴的能力。Apache Zeppelin (https://zeppelin.incubator.apache.org/ ) 是一种新兴的工具，提供了基于 Spark 的 Notebook 功能，这里是Apache Zeppelin 提供的易用于 Spark的用户界面视图。

做者：雪松

Microsoft MVP -- Windows Platform Development,

Hortonworks Certified Apache Hadoop 2.0 Developer