Nitin Agrawal
Contact -
  • Home
  • Interviews
    • Secret Receipe
    • InterviewFacts
    • Resume Thoughts
    • Daily Coding Problems
    • BigShyft
    • CompanyInterviews >
      • InvestmentBanks >
        • ECS
        • Bank Of America
        • WesternUnion
        • WellsFargo
      • ProductBasedCompanies >
        • CA Technologies
        • Model N India
        • Verizon Media
        • Oracle & GoJek
        • IVY Computec
        • Nvidia
        • ClearWaterAnalytics
        • ADP
        • ServiceNow
        • Pubmatic
        • Expedia
        • Amphora
        • CDK Global
        • CDK Global
        • Epic
        • Sincro-Pune
        • Whiz.AI
        • ChargePoint
      • ServiceBasedCompanies >
        • Altimetrik
        • ASG World Wide Pvt Ltd
        • Paraxel International & Pramati Technologies Pvt Ltd
        • MitraTech
        • Intelizest Coding Round
        • EPAM
    • Interviews Theory
  • Programming Languages
    • Java Script >
      • Tutorials
      • Code Snippets
    • Reactive Programming >
      • Code Snippets
    • R
    • DataStructures >
      • LeetCode Problems
      • AnagramsSet
    • Core Java >
      • Codility
      • Program Arguments OR VM arguments & Environment variables
      • Java Releases
      • Threading >
        • ThreadsOrder
        • ProducerConsumer
        • Finalizer
        • RaceCondition
        • Executors
        • Future Or CompletableFuture
      • Important Points
      • Immutability
      • Dictionary
      • URL Validator
    • Julia
    • Python >
      • Decorators
      • String Formatting
      • Generators_Threads
      • JustLikeThat
    • Go >
      • Tutorial
      • CodeSnippet
      • Go Routine_Channel
      • Suggestions
    • Methodologies & Design Patterns >
      • Design Principles
      • Design Patterns >
        • TemplatePattern
        • Adapter Design Pattern
        • Decorator
        • Proxy
        • Lazy Initialization
        • CombinatorPattern
        • RequestChaining
        • Singleton >
          • Singletons
  • Frameworks
    • Apache Velocity
    • Spring >
      • Spring Boot >
        • CustomProperties
        • ExceptionHandling
        • Issues
      • Quick View
    • Rest WebServices >
      • Interviews
      • Swagger
    • Cloudera BigData >
      • Ques_Ans
      • Hive
      • Apache Spark >
        • ApacheSpark Installation
        • SparkCode
        • Sample1
        • DataFrames
        • RDDs
        • SparkStreaming
        • SparkFiles
    • Integration >
      • Apache Camel
    • Testing Frameworks >
      • JUnit >
        • JUnit Runners
      • EasyMock
      • Mockito >
        • Page 2
      • TestNG
    • Blockchain >
      • Ethereum Smart Contract
      • Blockchain Java Example
    • Microservices >
      • Messaging Formats
      • Design Patterns
    • AWS >
      • Honeycode
    • Dockers >
      • GitBash
      • Issues
  • Databases
    • MySql
    • Oracle >
      • Interview1
      • SQL Queries
    • Elastic Search
  • Random issues
    • TOAD issue
    • Architect's suggestions
  • Your Views
Here I will write about some work using RDD by giving some examples, hope these will help.

Example 1 :We have following 3 files, which you can create by yourself as given  schema -
                    a) employees : It has comma separated data with many such rows like E01,Nitin
​                    b) salary         : It has comma separated data with many such rows like E01,10000
​                    c) managers   : It has comma separated data with many such rows like E01,ABC
​         Question : Need to combine the data from above all 3 files to check about the salary & manager of each employee.
​         Solution : Below I have used the RDDs & joins there to join the data of all the above files on the basis of id of employees.
​                First, let us read & parse the data in above files by the following statements -
     
val empData = sc.textFile("file:///home/cloudera/Public/employees").map(l=>(l.split(",")(0), l.split(",")(1)))
​     val sal = sc.textFile("file:///home/cloudera/Public/salary").map(l=>(l.split(",")(0), l.split(",")(1)))
     val man = sc.textFile("file:///home/cloudera/Public/managers").map(l=>(l.split(",")(0), l.split(",")(1)))

​Now we have above 3 RDDs for all 3 files. So now we will use join() which joins the records from each RDD as per the first common field in RDDs, like here in all RDDs we have employee id as common. Then I need to format the result to get output in order like shown below -
empData.join(sal).join(man).map{case(id, ((name, salary), manager))=>(id.toString, name, salary.toInt, manager)}.foreach(println)
​
​Its output will be like shown below & contents will depend on your files' contents -
(E06,Rakesh,45000,Shreyas)
(E02,Karthik,50000,Dhanashekar)
(E08,Amer,10000,Shekhar)
(E04,Kavitha,45000,Ashish)
(E09,Santhosh,10000,Vinod)
(E03,Deshdeep,45000,Roopesh)
(E07,Priyamwadha,50000,Tanvir)
(E05,Vikram,50000,Mayank)
(E10,Pradeep,10000,Jitendra)
(E01,Anil,50000,Vivek)


​Note :- I have included every operation in single line, just to make simple solution look complex, which we generally see people doing ;)
​            But I will suggest to break above statement as per the related operations to make it more readable & understandable.

​------------------------------------------------------------------------------------------------​-----------------------------------------------------------------------
Example 2 : In this section of RDD, I will also mention 1 or 2 samples to create the data using scala features & create RDDs from them.

​Tip :-  In this world of new technologies & everyday new APIs/methods being added, its difficult to remember their names. So while using spark-shell, which gives scala prompt, suppose we have one RDD named rdd & want to know the features available to it. Do -
scala>rdd.  <press key TAB>
​It will present the list of features available to given RDD.
Now one example :
​Suppose we want to have a collection of Person type data having certain attributes. And we can have any random data in those attributes.
​So lets follow below steps to achieve the target.
​scala> import scala.util.Random._           //This will help to generate the Random values
​scala> case
class Person(name:String, age:Int, salary:Double)​         //Creating a class to store the data of Person type
​scala> val population = Array.fill(1000)(Person(nextString(3), nextInt(90), 50000*nextDouble()))      // This will create the array populated with Person type
                                                                                                                                                             objects with random data in those.
scala> val pop = sc.parallelize(population,4)           // parallelizing the array to have the data in RDD and have that data in 4 partitions
​scala> pop.partitions.length                                   // to know the partitions for the given RDD
​scala> pop.count                                                    // to know the number of records in the given RDD
​scala> val persons = pop.filter(_.age < 25)          //now applying filter on the above RDD to get the only records where age < 25
​scala> persons.foreach(println)                            // this will print each record from persons RDD in new line on console

------------------------------------------------------------------------------------------------​-----------------------------------------------------------------------
​Now like we created RDD & populated the data there also in an Array. We even did the filtering, but what about if we want to create DataFrame from that & query the data via sql like queries. We can create the dataframes & datasets in a few ways from RDDs available, lets see those at -

http://www.nitinagrawal.com/dataframes.html
Powered by Create your own unique website with customizable templates.