spark Local环境搭建,第一个DEMO程序的编写html
去 http://spark.apache.org/downloads.html 网站下载spark,我下载的是spark-1.6.1-bin-hadoop2.6,spark版本是1.6.1,同时下载hadoop-2.6.0.tar.gzjava
spark是基于hadoop之上的,运行过程当中会调用相关hadoop库,若是没配置相关hadoop运行环境,会出错.git
至此,在cmd命令下输入spark-shell.正常输出便是成功.shell
POM:apache
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>org.credo</groupId> <artifactId>spark-test</artifactId> <version>1.0-SNAPSHOT</version> <dependencies> <!-- http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.6.1</version> </dependency> </dependencies> <build> <pluginManagement> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.5.1</version> <configuration> <source>1.8</source> <target>1.8</target> <encoding>UTF-8</encoding> <compilerArgument>-proc:none</compilerArgument> </configuration> </plugin> </plugins> </pluginManagement> </build> </project>
main方法:windows
package org.credo; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import scala.Tuple2; import java.util.Arrays; import java.util.UUID; /** * Created by ZhaoQian on 2016/6/12. */ public class spark { public static void main(String[] args) { System.out.println("================spark begin=============================="); System.setProperty("hadoop.home.dir", "D:\\software\\bigdata\\hadoop-2.6.0"); //建立一个Java版本的spark Context SparkConf sparkConf=new SparkConf().setAppName("wordCount"); JavaSparkContext javaSparkContext=new JavaSparkContext(sparkConf); //读取某个文件 JavaRDD<String> input=javaSparkContext.textFile("D:\\logger\\server.log2"); /**普通的写法*/ // JavaRDD<String> words=input.flatMap( // new FlatMapFunction<String, String>() { // @Override // public Iterable<String> call(String s) throws Exception { // return Arrays.asList(s.split(" ")); // } // } // ); // //转换为键值对并计数 // JavaPairRDD<String,Integer> counts=words.mapToPair(new PairFunction<String, String, Integer>() { // @Override // public Tuple2<String, Integer> call(String s) throws Exception { // return new Tuple2<String, Integer>(s,1); // } // }).reduceByKey(new Function2<Integer, Integer, Integer>() { // @Override // public Integer call(Integer v1, Integer v2) throws Exception { // return v1+v2; // } // }); //切分为单词,上面是默认方法,下面是lambda表达式. JavaRDD<String> words=input .flatMap((FlatMapFunction<String, String>) s -> Arrays.asList(s.split(" "))); JavaPairRDD<String,Integer> counts=words .mapToPair((PairFunction<String, String, Integer>) s -> new Tuple2<>(s,1)) .reduceByKey((Function2<Integer, Integer, Integer>) (v1, v2) -> v1+v2); //在文件中显示统计的单词信息 ("某单词","单词统计出的次数") counts.saveAsTextFile("D://logger//"+ UUID.randomUUID().toString()); System.out.println("================spark end=============================="); } }
解决A master URL must be set in your configuration错误api
在运行spark的测试程序SparkPi时,点击运行,出现了以下错误:多线程
从提示中能够看出找不到程序运行的master,此时须要配置环境变量。 传递给spark的master url能够有以下几种:app
VM options中输入“-Dspark.master=local”,指示本程序本地单线程运行,再次运行便可。dom
_Failed to locate the winutils binary in the hadoop binary path java.io.IOExc [权限或文件缺失,或者是hadoop环境未配置正确引发]: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html