【原创】大数据基础之Alluxio（1）简介、安装、使用

时间 2019-11-21

标签原创数据基础 alluxio 简介安装使用繁體版

原文原文链接

Alluxio 1.8.1html

官方：http://www.alluxio.org/node

一简介

Open Source Memory Speed Virtual Distributed Storage
Alluxio, formerly Tachyon, enables any application to interact with any data from any storage system at memory speed.shell

alluxio是一个开源的拥有内存访问速度的虚拟分布式存储；以前叫Tachyon，可使应用像访问内存数据同样访问任何存储系统中的数据。apache

1 优点

Storage Unification and Abstraction

Alluxio unifies data access to different systems, and seamlessly bridges computation frameworks and underlying storage.json

Remote Data Acceleration

Decouple compute and storage without any loss in performance.服务器

将计算和存储分离，而且不会损失性能；app

2 部署结构

Alluxio can be divided into three components: masters, workers, and clients. A typical setup consists of a single leading master, multiple standby masters, and multiple workers. The master and worker processes constitute the Alluxio servers, which are the components a system administrator would maintain. The clients are used to communicate with the Alluxio servers by applications such as Spark or MapReduce jobs, Alluxio command-line, or the FUSE layer.less

alluxio由master、worker组成，其中master若是有多个，只有一个是leading master，其余为standby master；ssh

3 角色

Master

The Alluxio master service can be deployed as one leading master and several standby masters for fault tolerance. When the leading master goes down, a standby master is elected to become the new leading master.curl

1）Leading Master

Only one master process can be the leading master in an Alluxio cluster. The leading master is responsible for managing the global metadata of the system. This includes file system metadata (e.g. the file system inode tree), block metadata (e.g. block locations), and worker capacity metadata (free and used space). Alluxio clients interact with the leading master to read or modify this metadata. All workers periodically send heartbeat information to the leading master to maintain their participation in the cluster. The leading master does not initiate communication with other components; it only responds to requests via RPC services. The leading master records all file system transactions to a distributed persistent storage to allow for recovery of master state information; the set of records is referred to as the journal.

alluxio中只有一个leading master，leading master负责管理全部的元数据，包括文件系统元数据、block元数据和worker元数据；worker会按期向leading master发送心跳；leading master会记录全部的文件操做到日志中；

2）Standby Masters

Standby masters read journals written by the leading master to keep their own copies of the master state up-to-date. They also write journal checkpoints for faster recovery in the future. They do not process any requests from other Alluxio components.

standby master会及时同步读取leader master的日志；

Worker

Alluxio workers are responsible for managing user-configurable local resources allocated to Alluxio (e.g. memory, SSDs, HDDs). Alluxio workers store data as blocks and serve client requests that read or write data by reading or creating new blocks within their local resources. Workers are only responsible for managing blocks; the actual mapping from files to blocks is only stored by the master.

worker负责管理资源，好比内存、ssd等；worker负责将数据存储为block同时响应client的读写请求；实际的file和block的映射关系保存在master中；

Because RAM usually offers limited capacity, blocks in a worker can be evicted when space is full. Workers employ eviction policies to decide which data to keep in the Alluxio space.

Client

The Alluxio client provides users a gateway to interact with the Alluxio servers. It initiates communication with the leading master to carry out metadata operations and with workers to read and write data that is stored in Alluxio.

client先向leading master请求元数据信息，而后向worker发送读写请求；

二安装

1 下载

$ wget http://downloads.alluxio.org/downloads/files//1.8.1/alluxio-1.8.1-hadoop-2.6-bin.tar.gz
$ tar xvf alluxio-1.8.1-hadoop-2.6-bin.tar.gz
$ cd alluxio-1.8.1-hadoop-2.6

2 配置本机ssh登陆

便可以 ssh localhost
详见：http://www.javashuo.com/article/p-rndbchju-bd.html

3 配置

$ cp conf/alluxio-site.properties.template conf/alluxio-site.properties
$ vi conf/alluxio-site.properties
alluxio.master.hostname=localhost

4 初始化

$ ./bin/alluxio validateEnv local
$ ./bin/alluxio format

5 启动

$ ./bin/alluxio-start.sh local SudoMount

若是报错：

Formatting RamFS: /mnt/ramdisk (44849277610)
ERROR: mkdir /mnt/ramdisk failed

须要添加sudo权限

# visudo -f /etc/sudoers
$user ALL=(ALL) NOPASSWD: /bin/mount * /mnt/ramdisk, /bin/umount * /mnt/ramdisk, /bin/mkdir * /mnt/ramdisk, /bin/chmod * /mnt/ramdisk

三使用

1 命令行

文件系统操做

$ ./bin/alluxio fs
$ ./bin/alluxio fs ls /
$ ./bin/alluxio fs copyFromLocal LICENSE /LICENSE
$ ./bin/alluxio fs cat /LICENSE

看起来和hdfs命令很像

admin操做

$ bin/alluxio fsadmin report

Alluxio cluster summary:

    Master Address: localhost/127.0.0.1:19998

    Web Port: 19999

    Rpc Port: 19998

    Started: 01-24-2019 10:28:59:433

    Uptime: 0 day(s), 1 hour(s), 24 minute(s), and 42 second(s)

    Version: 1.8.1

    Safe Mode: false

    Zookeeper Enabled: false

    Live Workers: 1

    Lost Workers: 0

    Total Capacity: 10.00GB

        Tier: MEM Size: 10.00GB

    Used Capacity: 9.36GB

        Tier: MEM Size: 9.36GB

    Free Capacity: 651.55MB

查看统计信息

$ curl http://$master:19999/metrics/json

2 UFS（Under File Storage）

UFS=LocalFileSystem

1 默认配置

$ cat conf/alluxio-site.properties
alluxio.underfs.address=${alluxio.work.dir}/underFSStorage

2 命令示例

$ ls ./underFSStorage/
$ ./bin/alluxio fs persist /LICENSE
$ ls ./underFSStorage
LICENSE

With the default configuration, Alluxio uses the local file system as its under file storage (UFS). The default path for the UFS is ./underFSStorage.

Alluxio is currently writing data only into Alluxio space, not to the UFS.Configure Alluxio to persist the file from Alluxio space to the UFS by using the persist command.

Alluxio默认用的是本地文件系统做为UFS，只有执行persist命令以后，文件才会持久化到UFS中；

UFS=HDFS

1 配置

$ cat conf/alluxio-site.properties
alluxio.underfs.address=hdfs://<NAMENODE>:<PORT>/alluxio/data

若是你想对hdfs上所有数据进行加速而且路径不变，能够配置为hdfs的根目录

2 配置hadoop

1）连接

$ ln -s $HADOOP_CONF_DIR/core-site.xml conf/core-site.xml
$ ln -s $HADOOP_CONF_DIR/hdfs-site.xml conf/hdfs-site.xml

Copy or make symbolic links from hdfs-site.xml and core-site.xml from your Hadoop installation into ${ALLUXIO_HOME}/conf

2）直接配置路径

alluxio.underfs.hdfs.configuration=/path/to/hdfs/conf/core-site.xml:/path/to/hdfs/conf/hdfs-site.xml

3 命令

$ bin/alluxio fs ls /

能够看到hdfs上全部的目录了

4 文件映射

这时能够经过访问

alluxio://$alluxio_server:19998/test.log

来访问底层存储

hdfs://$namenode_server/alluxio/data/test.log

注意：这里须要指定$alluxio_server和端口，存在单点问题，后续ha方式部署以后能够解决这个问题。

3 Spark访问

1 准备：（二选一）

1）配置

spark.driver.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar
spark.executor.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar

This Alluxio client jar file can be found at /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar

2）拷贝jar

$ cp client/alluxio-1.8.1-client.jar $SPARK_HOME/jars/

2 访问

$ spark-shell
scala> val s = sc.textFile("alluxio://localhost:19998/derby.log")
s: org.apache.spark.rdd.RDD[String] = alluxio://localhost:19998/derby.log MapPartitionsRDD[1] at textFile at <console>:24

scala> s.foreach(println)
----------------------------------------------------------------
Thu Jan 10 11:05:45 CST 2019:

参考：http://www.alluxio.org/docs/1.8/en/compute/Spark.html

4 hive访问

拷贝jar

$ cp client/alluxio-1.8.1-client.jar $HIVE_HOME/lib/

$ cp client/alluxio-1.8.1-client.jar $HADOOP_HOME/share/hadoop/common/lib/

重启metastore和hiveserver2

5 部署方式

1 集群ha部署

即多worker+多master+zookeeper

1 配置集群服务器间ssh可达

同上

2 配置

$ cat conf/alluxio-site.properties
#alluxio.master.hostname=<MASTER_HOSTNAME>
alluxio.zookeeper.enabled=true
alluxio.zookeeper.address=<ZOOKEEPER_ADDRESS>
alluxio.master.journal.folder=hdfs://$namenode_server/alluxio/journal/
alluxio.worker.memory.size=20GB

将配置同步到集群全部服务器

3 配置masters和workers

$ conf/masters
$master1
$master2

$ conf/workers
$worker1
$worker2
$worker3

4 启动

$ ./bin/alluxio-start.sh all SudoMount

5 访问方式

alluxio://zkHost1:2181;zkHost2:2181;zkHost3:2181/path

若是client启动时增长环境变量

-Dalluxio.zookeeper.address=zkHost1:2181,zkHost2:2181,zkHost3:2181 -Dalluxio.zookeeper.enabled=true

则能够直接这样访问

alluxio:///path

6 与hdfs互通

拷贝jar

$ cp client/alluxio-1.8.1-client.jar $HADOOP_HOME/share/hadoop/common/lib/

将如下配置添加到 $HADOOP_CONF_DIR/core-site.xml

alluxio.zookeeper.enabled
alluxio.zookeeper.address

和

<property>

<name>fs.alluxio.impl</name>

<value>alluxio.hadoop.FileSystem</value>

</property>

则能够经过hdfs客户端访问alluxio

$ hadoop fs -ls alluxio:///directory

参考：http://www.alluxio.org/docs/1.8/en/deploy/Running-Alluxio-On-a-Cluster.html

2 Alluxio on Yarn部署

Alluxio还有不少种部署方式，其中一种是Alluxio on Yarn，对于相似Spark on Yarn的用户来讲，很是容易使用Alluxio来加速Spark。

详见：

http://www.alluxio.org/docs/1.8/en/deploy/Running-Alluxio-On-Yarn.html