Nitin Agrawal
Contact -
  • Home
  • Interviews
    • Secret Receipe
    • InterviewFacts
    • Resume Thoughts
    • Daily Coding Problems
    • BigShyft
    • CompanyInterviews >
      • InvestmentBanks >
        • ECS
        • Bank Of America
        • WesternUnion
        • WellsFargo
      • ProductBasedCompanies >
        • CA Technologies
        • Model N India
        • Verizon Media
        • Oracle & GoJek
        • IVY Computec
        • Nvidia
        • ClearWaterAnalytics
        • ADP
        • ServiceNow
        • Pubmatic
        • Expedia
        • Amphora
        • CDK Global
        • CDK Global
        • Epic
        • Sincro-Pune
        • Whiz.AI
        • ChargePoint
      • ServiceBasedCompanies >
        • Altimetrik
        • ASG World Wide Pvt Ltd
        • Paraxel International & Pramati Technologies Pvt Ltd
        • MitraTech
        • Intelizest Coding Round
        • EPAM
    • Interviews Theory
  • Programming Languages
    • Java Script >
      • Tutorials
      • Code Snippets
    • Reactive Programming >
      • Code Snippets
    • R
    • DataStructures >
      • LeetCode Problems
      • AnagramsSet
    • Core Java >
      • Codility
      • Program Arguments OR VM arguments & Environment variables
      • Java Releases
      • Threading >
        • ThreadsOrder
        • ProducerConsumer
        • Finalizer
        • RaceCondition
        • Executors
        • Future Or CompletableFuture
      • Important Points
      • Immutability
      • Dictionary
      • URL Validator
    • Julia
    • Python >
      • Decorators
      • String Formatting
      • Generators_Threads
      • JustLikeThat
    • Go >
      • Tutorial
      • CodeSnippet
      • Go Routine_Channel
      • Suggestions
    • Methodologies & Design Patterns >
      • Design Principles
      • Design Patterns >
        • TemplatePattern
        • Adapter Design Pattern
        • Decorator
        • Proxy
        • Lazy Initialization
        • CombinatorPattern
        • RequestChaining
        • Singleton >
          • Singletons
  • Frameworks
    • Apache Velocity
    • Spring >
      • Spring Boot >
        • CustomProperties
        • ExceptionHandling
        • Issues
      • Quick View
    • Rest WebServices >
      • Interviews
      • Swagger
    • Cloudera BigData >
      • Ques_Ans
      • Hive
      • Apache Spark >
        • ApacheSpark Installation
        • SparkCode
        • Sample1
        • DataFrames
        • RDDs
        • SparkStreaming
        • SparkFiles
    • Integration >
      • Apache Camel
    • Testing Frameworks >
      • JUnit >
        • JUnit Runners
      • EasyMock
      • Mockito >
        • Page 2
      • TestNG
    • Blockchain >
      • Ethereum Smart Contract
      • Blockchain Java Example
    • Microservices >
      • Messaging Formats
      • Design Patterns
    • AWS >
      • Honeycode
    • Dockers >
      • GitBash
      • Issues
  • Databases
    • MySql
    • Oracle >
      • Interview1
      • SQL Queries
    • Elastic Search
  • Random issues
    • TOAD issue
    • Architect's suggestions
  • Your Views

Here, I will not be talking about Spark history, mechanism or its working or its architecture or algorithms being used.
​You can check below link to read about Spark quickly, I am still going through this & it is quit helpful
https://data-flair.training/blogs/fault-tolerance-in-apache-spark/
​Here I will be putting the commands which you can find helpful during your initial encounters with Spark.
​I will also be trying to put the solutions for the issues which I faced during the journey on the road of Spark.

​Note :- Before you try your hands on RDD or Dataframes or DataSets, first please check about-
​a) Tranformations
b) Actions
​And also see that here 'Lazy Initialization' is done & transformation actually takes place when an action is called.


​So initially, below are some commands which I tried my hands on for spark-shell i.e. using Scala for these, and will keep on adding here as I get new-

>   val myData = sc.textFile("file:///home/cloudera/Public/emp.txt") ---------------to read the text file from the local machine & not from HDFS
>   val read = sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/Data_Files/emp.txt") ----------------to read the data from hdfs while using Cloudera
>   val myData = sc.textFile("file:///home/cloudera/Public/emp.txt", 5) ----------- to read the file in 6 partitions & now it will be having 6 partitions
>   myData.getNumPartitions ------------------------------to get the number of partitions in given RDD
>   val newD = myData.coalesce(3)  ---------------- it will merge the 6 partitions of myData to create new RDD being referred by newD here
>   myData.take(2) -----------prints first 2 records
>   myData.map(l=>l.toUpperCase()).filter(l=>l.startsWith("1")||l.startsWith("3"))
>   myData.collect --------------------to show the array of data
>   myData.toDebugString ----------To print the lineage of RDD created
>   data=["Nitin","is","great"] ---------- it is applicable for python only to create the collection
>   val data = List("Nitin","is","here") ---- it is applicable for scala to create the collection
>   val rdd = sc.parallelize(data) ------ to parallelize the data present in the collection created above
>   val myData = sc.textFile("Data_Files/emp.txt").map(l=>l.split(',')).map(f=>(f(0),f(0),f(1),f(2),f(3))).take(1) ---------gives the same output as below statement
>   val myData = sc.textFile("Data_Files/emp.txt").map(l=>l.split(',')).take(1)
>   val no = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) ----------to create an array in scala
>   val dt = no.map(d=>(d*2)) ---------to multiply each element of no by 2
>   val myData = sc.textFile("file:///home/cloudera/Public/emp.txt").map(d=>d.contains("M")) -------------will create RDD containing true or false for each item
>   myData.partitions.length -------------------------to count the number of partitions in RDD
>   myData.cache() --------------------------cache the RDD in the memory & all future computation will work on the in-memory data, which saves disk seeks and improves the performance.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
​Like shown above to cache the RDD but I checked that in above case RDD 'myData' will not be cached as it is the one reading the contents from the file & it is not some intermediate RDD, which as per Spark docs can be cached. ​But I can cache other RDD which is being created by the transformation on myData above.
​While using caching, one must consider the memory available & which RDDs are being used for multiple transformations, so cache only such RDDs which are being used many times for transformations.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
One interesting issue I faced, how to read the CSV file to fetch its data. Obviously, I had to look a lot across google pages.
​So I am sharing one way here when you have proper csv file to read with delimiter as ",". Other examples/ways I got which I will share soon with some examples.

​Before starting spark shell, use below command to start the spark shell taken from - https://spark-packages.org/package/databricks/spark-csvwhen you want to read the csv file -

> spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
​once spark shell started, use below command to read the csv file, I have header in csv else you can make the concerned field as 'false'
​scala>​ val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("mode","DROPMALFORMED").option("inferSchema","true").option("delimiter",";").load("/user/nitin/Project1/banking/bank.csv")

​Once dataframe created, you can register its data as some temporary table to access its data via sql queries. This registration we do like -
scala>​df.registerTempTable("Bank")

​Now I have one table named "Bank" which I can query like shown below -
​scala>sqlContext.sql("select * from Bank")
A very good example & solution for k-means in spark-scala. Everyone must see to get understanding of its working at -
gist.github.com/umbertogriffo/b599aa9b9a156bb1e8775c8cdbfb688a

​Hope you get a good idea about k-means, next time I will put the sample question for which I used the above solution.
​Keep watching this space.
Picture
Powered by Create your own unique website with customizable templates.