ApacheSpark Installation

Apache Spark Installation

As you already know that Spark is a separate framework which doesn’t need Hadoop to work.

But yes, we generally see it as part of BigData ecosystem, why?
Because it is made to support distributed processing in a very efficient manner. Though you can use Spark on your single
machine without Hadoop but its true power is unleashed when we bring it in BigData world.

Anyhow, we are interested to explore Spark first without worrying about the setup of Hadoop or some Resource/Cluster
manager, as Spark also provides its own Standalone Resource Manager which is fine to use for your own learning
but not for Production work.
Lets start installation of Spark on Windows10, as I have used Windows10 only for this installation.
But you can use your own OS & check the steps for it over internet.
First download Spark from - https://spark.apache.org/downloads.html
And I selected below options first, then you will see corresponding file at “Download Spark:” as shown below –

Click on the shown file & download it. Then extract its content to the folder having no spaces in name. I extracted it to “spark2.4.6”, such “bin” folder is directly under it.
Now open cmd from “bin” folder & type – spark-shell
You will see below error before you get scala> shell
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
So what is it & how to resolve this issue?
Though Spark doesn’t need Hadoop but still it looks for this file during start-up. So I found its solution from - https://stackoverflow.com/questions/35652665/java-io-ioexception-could-not-locate-executable-null-bin-winutils-exe-in-the-ha

Download winutils.exe
Create folder, say C:\winutils\bin
Copy winutils.exe inside C:\winutils\bin
Set environment variable HADOOP_HOME to C:\winutils

If you have downloaded "hadoop-common-2.2.0-bin-master" then can point HADOOP_HOME to it. No need to have
above folder separately.

I did download & extracted it in spark folder itself & spark folder looks like –

Restart CMD after this change & execute spark-shell again, now it starts clean.
So lets verify, if Spark is working –
At this scala> prompt, let’s create some data; a simple sequence of numbers from 1 to 50,000.
val data = 1 to 50000
Now, let’s place these 50,000 numbers into a Resilient Distributed Dataset
(RDD) which we’ll call sparkSample. It is this RDD upon which Spark can perform
analysis.
val sparkSample = sc.parallelize(data)
Now we can filter the data in the RDD to find any values of less than 10.

Spark should report the result, with an array containing any values less than 10.

But what if you get the error - unsupported class file major version 55

For this check which Java version you are using on your machine.
This Spark version works for jdk1.8
So change Java version on your machine & restart CMD & try above steps again.

Exception while deleting Spark temp dir in Windows

Now you might be seeing exceptions on console when you exit from spark-shell using either Ctrl+D or :q or :quit

17/01/24 15:37:53 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\415387\AppData\Local\Temp\spark-b1672cf6-989f-4890-93a0-c945ff147554 java.io.IOException: Failed to delete: C:\Users\415387\AppData\Local\Temp\spark-b1672cf6-989f-4890-93a0-c945ff147554 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:929) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65) at .....

Another issue, arggggg...
And yes if we see, this way the size of temp folder will keep on increasing & we have to manually clean this folder.
A kind of pain in ass for many. So we can make some modification in our Spark setup to avoid this.
After getting the solution from https://stackoverflow.com/questions/41825871/exception-while-deleting-spark-temp-dir-in-windows-7-64-bit, I followed below steps & created one temp folder in extracted Spark folder -

Steps taken -
Update spark-defaults.conf or create a copy of spark-defaults.conf.template & rename it to spark-defaults.conf
Add following line like -
spark.local.dir=E:\\spark2.4.6\\tempDir
Via above line we are setting the temp folder for Spark to use.

Similarly update log4j.properties in your spark setup like above, with the below lines -
log4j.logger.org.apache.spark.util.ShutdownHookManager=OFF
log4j.logger.org.apache.spark.SparkEnv=ERROR
Now ShutdownHookManager will not be used during exit causing those error lines on console.

Now how to clean the temp folder then?
So for that add below lines in bin/spark-shell.cmd file -
rmdir /q /s "E:/spark2.4.6/tempDir"
del C:\Users\nitin\AppData\Local\Temp\jansi*.*

By having above updates, I can see clean exit with temp folders clean-up also.