Here, I will not be talking about Spark history, mechanism or its working or its architecture or algorithms being used.
You can check below link to read about Spark quickly, I am still going through this & it is quit helpful
https://data-flair.training/blogs/fault-tolerance-in-apache-spark/
Here I will be putting the commands which you can find helpful during your initial encounters with Spark.
I will also be trying to put the solutions for the issues which I faced during the journey on the road of Spark.
Note :- Before you try your hands on RDD or Dataframes or DataSets, first please check about-
a) Tranformations
b) Actions
And also see that here 'Lazy Initialization' is done & transformation actually takes place when an action is called.
So initially, below are some commands which I tried my hands on for spark-shell i.e. using Scala for these, and will keep on adding here as I get new-
> val myData = sc.textFile("file:///home/cloudera/Public/emp.txt") ---------------to read the text file from the local machine & not from HDFS
> val read = sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/Data_Files/emp.txt") ----------------to read the data from hdfs while using Cloudera
> val myData = sc.textFile("file:///home/cloudera/Public/emp.txt", 5) ----------- to read the file in 6 partitions & now it will be having 6 partitions
> myData.getNumPartitions ------------------------------to get the number of partitions in given RDD
> val newD = myData.coalesce(3) ---------------- it will merge the 6 partitions of myData to create new RDD being referred by newD here
> myData.take(2) -----------prints first 2 records
> myData.map(l=>l.toUpperCase()).filter(l=>l.startsWith("1")||l.startsWith("3"))
> myData.collect --------------------to show the array of data
> myData.toDebugString ----------To print the lineage of RDD created
> data=["Nitin","is","great"] ---------- it is applicable for python only to create the collection
> val data = List("Nitin","is","here") ---- it is applicable for scala to create the collection
> val rdd = sc.parallelize(data) ------ to parallelize the data present in the collection created above
> val myData = sc.textFile("Data_Files/emp.txt").map(l=>l.split(',')).map(f=>(f(0),f(0),f(1),f(2),f(3))).take(1) ---------gives the same output as below statement
> val myData = sc.textFile("Data_Files/emp.txt").map(l=>l.split(',')).take(1)
> val no = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) ----------to create an array in scala
> val dt = no.map(d=>(d*2)) ---------to multiply each element of no by 2
> val myData = sc.textFile("file:///home/cloudera/Public/emp.txt").map(d=>d.contains("M")) -------------will create RDD containing true or false for each item
> myData.partitions.length -------------------------to count the number of partitions in RDD
> myData.cache() --------------------------cache the RDD in the memory & all future computation will work on the in-memory data, which saves disk seeks and improves the performance.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Like shown above to cache the RDD but I checked that in above case RDD 'myData' will not be cached as it is the one reading the contents from the file & it is not some intermediate RDD, which as per Spark docs can be cached. But I can cache other RDD which is being created by the transformation on myData above.
While using caching, one must consider the memory available & which RDDs are being used for multiple transformations, so cache only such RDDs which are being used many times for transformations.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
One interesting issue I faced, how to read the CSV file to fetch its data. Obviously, I had to look a lot across google pages.
So I am sharing one way here when you have proper csv file to read with delimiter as ",". Other examples/ways I got which I will share soon with some examples.
Before starting spark shell, use below command to start the spark shell taken from - https://spark-packages.org/package/databricks/spark-csvwhen you want to read the csv file -
> spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
once spark shell started, use below command to read the csv file, I have header in csv else you can make the concerned field as 'false'
scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("mode","DROPMALFORMED").option("inferSchema","true").option("delimiter",";").load("/user/nitin/Project1/banking/bank.csv")
Once dataframe created, you can register its data as some temporary table to access its data via sql queries. This registration we do like -
scala>df.registerTempTable("Bank")
Now I have one table named "Bank" which I can query like shown below -
scala>sqlContext.sql("select * from Bank")
A very good example & solution for k-means in spark-scala. Everyone must see to get understanding of its working at -
gist.github.com/umbertogriffo/b599aa9b9a156bb1e8775c8cdbfb688a
Hope you get a good idea about k-means, next time I will put the sample question for which I used the above solution.
Keep watching this space.
gist.github.com/umbertogriffo/b599aa9b9a156bb1e8775c8cdbfb688a
Hope you get a good idea about k-means, next time I will put the sample question for which I used the above solution.
Keep watching this space.