使用FileSystem API读取数据

时间 2019-12-04

标签使用 filesystem api 读取数据繁體版

原文原文链接

如前一小节所解释的，有时不能在应用中设置URLStreamHandlerFactory。这时，咱们须要用FileSystem API来打开一个文件的输入流。java

文件在Hadoop文件系统中显示为一个Hadoop Path对象(不是一个java.io.File对象，由于它的语义与本地文件系统关联太紧密)。咱们能够把一个路径视为一个Hadoop文件系统URI，如hdfs://localhost/user/tom/quangle.txt。apache

FileSystem是一个普通的文件系统API，因此首要任务是检索咱们要用的文件系统实例，这里是HDFS。取得FileSystem实例有两种静态工厂方法：数组

1. public static FileSystem get(Configuration conf)
throws IOException 安全

2. ublic static FileSystem get(URI uri,
Configuration conf) throws IOException 服务器

Configuration对象封装了一个客户端或服务器的配置，这是用从类路径读取而来的配置文件(如conf/core-site.xml)来设置的。第一个方法返回的是默认文件系统(在conf/core-site.xml中设置的，若是没有设置过，则是默认的本地文件系统)。第二个方法使用指定的URI方案及决定所用文件系统的权限，若是指定URI中没有指定方案，则退回默认的文件系统。ide

有了FileSystem实例后，咱们调用open()来获得一个文件的输入流：函数

1. public FSDataInputStream open(Path f) throws IOException oop

2. ublic abstract FSDataInputStream open(Path f,
int bufferSize) throws IOException spa

第一个方法使用默认4 KB的缓冲大小。线程

将它们合在一块儿，咱们能够在例3-2中重写例3-1。

例3-2：直接使用FileSystem以标准输出格式显示Hadoop文件系统的文件

1. public class FileSystemCat {

2. public static void main(String[] args) throws Exception {

3. String uri = args[0];

4. Configuration conf = new Configuration();

5. FileSystem fs = FileSystem.get(URI.create(uri), conf);

6. InputStream in = null;

7. try {

8. in = fs.open(new Path(uri));

9. IOUtils.copyBytes(in, System.out, 4096, false);

10. } finally {

11. IOUtils.closeStream(in);

12. }

13. }

14. }

程序运行结果以下：

1. % hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt

2. On the top of the Crumpetty Tree

3. The Quangle Wangle sat,

4. But his face you could not see,

5. On account of his Beaver Hat.

6. FSDataInputStream

FileSystem中的open()方法实际上返回的是一个FSDataInputStream，而不是标准的java.io类。这个类是java.io.DataInputStream的一个子类，支持随机访问，这样就能够从流的任意位置读取数据了。

1. package org.apache.hadoop.fs;

3. public class FSDataInputStream extends DataInputStream

4. implements Seekable, PositionedReadable {

5. // implementation elided

6. }

Seekable接口容许在文件中定位，并提供一个查询方法，用于查询当前位置相对于文件开始处的偏移量(getPos())：

1. public interface Seekable {

2. void seek(long pos) throws IOException;

3. long getPos() throws IOException;

4. boolean seekToNewSource(long targetPos) throws IOException;

5. }

调用seek()来定位大于文件长度的位置会致使IOException异常。与java.io.InputStream中的skip()不一样，seek()并无指出数据流当前位置以后的一点，它能够移到文件中任意一个绝对位置。

应用程序开发人员并不经常使用seekToNewSource()方法。此方法通常倾向于切换到数据的另外一个副本并在新的副本中寻找targetPos指定的位置。HDFS内部就采用这样的方法在数据节点故障时为客户端提供可靠的数据输入流。

例3-3是例3-2的简单扩展，它将一个文件两次写入标准输出：在写一次后，定位到文件的开头再次读入数据流。

例3-3：经过使用seek两次以标准输出格式显示Hadoop文件系统的文件

1. public class FileSystemDoubleCat {

3. public static void main(String[] args) throws Exception {

4. String uri = args[0];

5. Configuration conf = new Configuration();

6. FileSystem fs = FileSystem.get(URI.create(uri), conf);

7. FSDataInputStream in = null;

8. try {

9. in = fs.open(new Path(uri));

10. IOUtils.copyBytes(in, System.out, 4096, false);

11. in.seek(0); // go back to the start of the file

12. IOUtils.copyBytes(in, System.out, 4096, false);

13. } finally {

14. IOUtils.closeStream(in);

15. }

16. }

17. }

在一个小文件上运行获得如下结果：

1. % hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt

2. On the top of the Crumpetty Tree

3. The Quangle Wangle sat,

4. But his face you could not see,

5. On account of his Beaver Hat.

6. On the top of the Crumpetty Tree

7. The Quangle Wangle sat,

8. But his face you could not see,

9. On account of his Beaver Hat.

FSDataInputStream也实现了PositionedReadable接口，从一个指定位置读取一部分数据：

1. public interface PositionedReadable {

3. public int read(long position, byte[] buffer,
int offset, int length)

4. throws IOException;

6. public void readFully(long position, byte[]
buffer, int offset, int length)

7. throws IOException;

9. public void readFully(long position, byte[]
buffer) throws IOException;

10. }

read()方法从指定position读取指定长度的字节放入缓冲buffer的指定偏离量offset。返回值是实际读到的字节数：调用者须要检查这个值，它有可能小于指定的长度。readFully()方法会读出指定字节由length指定的数据到buffer中或在只接受buffer字节数组的版本中，再读取buffer.length字节(这儿指的是第三个函数)，若已经到文件末，将会抛出EOFException。

全部这些方法会保留文件当前位置而且是线程安全的，所以它们提供了在读取文件(多是元数据)的主要部分时访问其余部分的便利方法。其实，这只是使用Seekable接口的实现，格式以下：

1. long oldPos = getPos();

2. try {

3. seek(position);

4. // read data

5. } finally {

6. seek(oldPos);

7. }

最后务必牢记，seek()是一个相对高开销的操做，须要慎重使用。咱们须要依靠流数据构建应用访问模式(如使用MapReduce)，而不要大量执行seek操做。

更多分享请关注：bbs.superwu.cn