Here I will write about some work using RDD by giving some examples, hope these will help.
Example 1 :We have following 3 files, which you can create by yourself as given schema -
a) employees : It has comma separated data with many such rows like E01,Nitin
b) salary : It has comma separated data with many such rows like E01,10000
c) managers : It has comma separated data with many such rows like E01,ABC
Question : Need to combine the data from above all 3 files to check about the salary & manager of each employee.
Solution : Below I have used the RDDs & joins there to join the data of all the above files on the basis of id of employees.
First, let us read & parse the data in above files by the following statements -
val empData = sc.textFile("file:///home/cloudera/Public/employees").map(l=>(l.split(",")(0), l.split(",")(1)))
val sal = sc.textFile("file:///home/cloudera/Public/salary").map(l=>(l.split(",")(0), l.split(",")(1)))
val man = sc.textFile("file:///home/cloudera/Public/managers").map(l=>(l.split(",")(0), l.split(",")(1)))
Now we have above 3 RDDs for all 3 files. So now we will use join() which joins the records from each RDD as per the first common field in RDDs, like here in all RDDs we have employee id as common. Then I need to format the result to get output in order like shown below -
empData.join(sal).join(man).map{case(id, ((name, salary), manager))=>(id.toString, name, salary.toInt, manager)}.foreach(println)
Its output will be like shown below & contents will depend on your files' contents -
(E06,Rakesh,45000,Shreyas)
(E02,Karthik,50000,Dhanashekar)
(E08,Amer,10000,Shekhar)
(E04,Kavitha,45000,Ashish)
(E09,Santhosh,10000,Vinod)
(E03,Deshdeep,45000,Roopesh)
(E07,Priyamwadha,50000,Tanvir)
(E05,Vikram,50000,Mayank)
(E10,Pradeep,10000,Jitendra)
(E01,Anil,50000,Vivek)
Note :- I have included every operation in single line, just to make simple solution look complex, which we generally see people doing ;)
But I will suggest to break above statement as per the related operations to make it more readable & understandable.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Example 2 : In this section of RDD, I will also mention 1 or 2 samples to create the data using scala features & create RDDs from them.
Tip :- In this world of new technologies & everyday new APIs/methods being added, its difficult to remember their names. So while using spark-shell, which gives scala prompt, suppose we have one RDD named rdd & want to know the features available to it. Do -
scala>rdd. <press key TAB>
It will present the list of features available to given RDD.
Now one example :
Suppose we want to have a collection of Person type data having certain attributes. And we can have any random data in those attributes.
So lets follow below steps to achieve the target.
scala> import scala.util.Random._ //This will help to generate the Random values
scala> case class Person(name:String, age:Int, salary:Double) //Creating a class to store the data of Person type
scala> val population = Array.fill(1000)(Person(nextString(3), nextInt(90), 50000*nextDouble())) // This will create the array populated with Person type
objects with random data in those.
scala> val pop = sc.parallelize(population,4) // parallelizing the array to have the data in RDD and have that data in 4 partitions
scala> pop.partitions.length // to know the partitions for the given RDD
scala> pop.count // to know the number of records in the given RDD
scala> val persons = pop.filter(_.age < 25) //now applying filter on the above RDD to get the only records where age < 25
scala> persons.foreach(println) // this will print each record from persons RDD in new line on console
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Now like we created RDD & populated the data there also in an Array. We even did the filtering, but what about if we want to create DataFrame from that & query the data via sql like queries. We can create the dataframes & datasets in a few ways from RDDs available, lets see those at -
http://www.nitinagrawal.com/dataframes.html
Example 1 :We have following 3 files, which you can create by yourself as given schema -
a) employees : It has comma separated data with many such rows like E01,Nitin
b) salary : It has comma separated data with many such rows like E01,10000
c) managers : It has comma separated data with many such rows like E01,ABC
Question : Need to combine the data from above all 3 files to check about the salary & manager of each employee.
Solution : Below I have used the RDDs & joins there to join the data of all the above files on the basis of id of employees.
First, let us read & parse the data in above files by the following statements -
val empData = sc.textFile("file:///home/cloudera/Public/employees").map(l=>(l.split(",")(0), l.split(",")(1)))
val sal = sc.textFile("file:///home/cloudera/Public/salary").map(l=>(l.split(",")(0), l.split(",")(1)))
val man = sc.textFile("file:///home/cloudera/Public/managers").map(l=>(l.split(",")(0), l.split(",")(1)))
Now we have above 3 RDDs for all 3 files. So now we will use join() which joins the records from each RDD as per the first common field in RDDs, like here in all RDDs we have employee id as common. Then I need to format the result to get output in order like shown below -
empData.join(sal).join(man).map{case(id, ((name, salary), manager))=>(id.toString, name, salary.toInt, manager)}.foreach(println)
Its output will be like shown below & contents will depend on your files' contents -
(E06,Rakesh,45000,Shreyas)
(E02,Karthik,50000,Dhanashekar)
(E08,Amer,10000,Shekhar)
(E04,Kavitha,45000,Ashish)
(E09,Santhosh,10000,Vinod)
(E03,Deshdeep,45000,Roopesh)
(E07,Priyamwadha,50000,Tanvir)
(E05,Vikram,50000,Mayank)
(E10,Pradeep,10000,Jitendra)
(E01,Anil,50000,Vivek)
Note :- I have included every operation in single line, just to make simple solution look complex, which we generally see people doing ;)
But I will suggest to break above statement as per the related operations to make it more readable & understandable.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Example 2 : In this section of RDD, I will also mention 1 or 2 samples to create the data using scala features & create RDDs from them.
Tip :- In this world of new technologies & everyday new APIs/methods being added, its difficult to remember their names. So while using spark-shell, which gives scala prompt, suppose we have one RDD named rdd & want to know the features available to it. Do -
scala>rdd. <press key TAB>
It will present the list of features available to given RDD.
Now one example :
Suppose we want to have a collection of Person type data having certain attributes. And we can have any random data in those attributes.
So lets follow below steps to achieve the target.
scala> import scala.util.Random._ //This will help to generate the Random values
scala> case class Person(name:String, age:Int, salary:Double) //Creating a class to store the data of Person type
scala> val population = Array.fill(1000)(Person(nextString(3), nextInt(90), 50000*nextDouble())) // This will create the array populated with Person type
objects with random data in those.
scala> val pop = sc.parallelize(population,4) // parallelizing the array to have the data in RDD and have that data in 4 partitions
scala> pop.partitions.length // to know the partitions for the given RDD
scala> pop.count // to know the number of records in the given RDD
scala> val persons = pop.filter(_.age < 25) //now applying filter on the above RDD to get the only records where age < 25
scala> persons.foreach(println) // this will print each record from persons RDD in new line on console
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Now like we created RDD & populated the data there also in an Array. We even did the filtering, but what about if we want to create DataFrame from that & query the data via sql like queries. We can create the dataframes & datasets in a few ways from RDDs available, lets see those at -
http://www.nitinagrawal.com/dataframes.html