举一个例子:使用mapreduce统计一个月或者两个的日志文件,这里可能有大量的日志文件。如何快速的提取文件路径?
在HDFS中,可使用通配符来解决这个问题。与linux shell的通配符相同。java
例如:linux
Tables | Are |
---|---|
2016/* | 2016/05 2016/04 |
2016/0[45] | 2016/05 2016/04 |
2016/0[4-5] | 2016/05 2016/04 |
代码:正则表达式
public static void globFiles(String pattern){ try { FileSystem fileSystem = FileSystem.get(configuration); FileStatus[] statuses = fileSystem.globStatus(new Path(pattern)); Path[] listPaths = FileUtil.stat2Paths(statuses); for (Path path : listPaths){ System.out.println(path); } } catch (IOException e) { e.printStackTrace(); } }
hdfs 还提供了一个PathFilter 对咱们获取的文件路径进行过滤,与java.io.FileFilter相似shell
/** * Return an array of FileStatus objects whose path names match pathPattern * and is accepted by the user-supplied path filter. Results are sorted by * their path names. * Return null if pathPattern has no glob and the path does not exist. * Return an empty array if pathPattern has a glob and no path matches it. * * @param pathPattern * a regular expression specifying the path pattern * @param filter * a user-supplied path filter * @return an array of FileStatus objects * @throws IOException if any I/O error occurs when fetching file status */ public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException { return new Globber(this, pathPattern, filter).glob(); }
hdfs自身提供了许多filter,在hadoop权威指南中,提供一种 正则表达式filter的实现express
public class RegexExcludePathFilter implements PathFilter { private String regex; public RegexExcludePathFilter(String regex) { this.regex = regex; } @Override public boolean accept(Path path) { return !path.toString().matches(regex); } }
利用正则表达式优化结果ide
fileSystem.listStatus(new Path(uri),new RegexExcludePathFilter("^.*/2016/0$"));
结果输出以下:oop
hdfs://hadoop:9000/hadoop/2016/04 hdfs://hadoop:9000/hadoop/2016/05
过滤器由Path表示,只能做用于文件名以及路径。fetch