用PHP写Hadoop的MapReduce程序

时间 2019-11-06

原文原文链接

Hadoop流

虽然Hadoop是用java写的，可是Hadoop提供了Hadoop流，Hadoop流提供一个API, 容许用户使用任何语言编写map函数和reduce函数.
Hadoop流动关键是，它使用UNIX标准流做为程序与Hadoop之间的接口。所以，任何程序只要能够从标准输入流中读取数据，而且能够把数据写入标准输出流中，那么就能够经过Hadoop流使用任何语言编写MapReduce程序的map函数和reduce函数。
例如：bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper /usr/local/hadoop/mapper.php -reducer /usr/local/hadoop/reducer.php -input test/* -output out4
Hadoop流引入的包：hadoop-streaming-0.20.203.0.jar,Hadoop根目录下是没有hadoop-streaming.jar的，由于 streaming是一个contrib，因此要去contrib下面找，以hadoop-0.20.2为例，它在这里：
-input：指明输入hdfs文件的路径
-output：指明输出hdfs文件的路径
-mapper：指明map函数
-reducer：指明reduce函数 php

mapper函数

mapper.php文件，写入以下代码： java

[php]

#!/usr/local/php/bin/php
<?php
$word2count = array();
// input comes from STDIN (standard input)
// You can this code :$stdin = fopen(“php://stdin”, “r”);
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace and lowercase
$line = strtolower(trim($line));
// split the line into words while removing any empty string
$words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
// increase counters
foreach ($words as $word) {
$word2count[$word] += 1;
}
}
// write the results to STDOUT (standard output)
// what we output here will be the input for the
// Reduce step, i.e. the input for reducer.py
foreach ($word2count as $word => $count) {
// tab-delimited
echo $word, chr(9), $count, PHP_EOL;
}
?>

这段代码的大体意思是：把输入的每行文本中的单词找出来，并以” python

hello 1
world 1″

这样的形式输出出来。 linux

和以前写的PHP基本没有什么不一样，对吧，可能稍微让你感到陌生有两个地方： shell

PHP做为可执行程序

第一行的 bash

[php]

#!/usr/local/php/bin/php

告诉linux，要用#!/usr/local/php/bin/php这个程序做为如下代码的解释器。写过linux shell的人应该很熟悉这种写法了，每一个shell脚本的第一行都是这样: #!/bin/bash, #!/usr/bin/python
有了这一行，保存好这个文件之后，就能够像这样直接把mapper.php看成cat, grep同样的命令执行了：./mapper.php app