Nitin Agrawal
Contact -
  • Home
  • Interviews
    • Secret Receipe
    • InterviewFacts
    • Resume Thoughts
    • Daily Coding Problems
    • BigShyft
    • Companies
    • Interviews Theory
  • Programming Languages
    • Java Script >
      • Tutorials
      • Code Snippets
    • Reactive Programming >
      • Code Snippets
    • R
    • DataStructures >
      • LeetCode Problems >
        • Problem10
        • Problem300
      • AnagramsSet
    • Core Java >
      • Codility
      • Program Arguments OR VM arguments & Environment variables
      • Java Releases >
        • Java8 >
          • Performance
          • NasHorn
          • WordCount
          • Thoughts
        • Java9 >
          • ServiceLoaders
          • Lambdas
          • List Of Objects
          • Code Snippets
        • Java14 >
          • Teeing
          • Pattern
          • Semaphores
        • Java17 >
          • Switches
          • FunctionalStreams
          • Predicate
          • Consumer_Supplier
          • Collectors in Java
        • Java21 >
          • Un-named Class
          • Virtual Threads
          • Structured Concurrency
      • Threading >
        • ThreadsOrder
        • ProducerConsumer
        • Finalizer
        • RaceCondition
        • Executors
        • Future Or CompletableFuture
      • Important Points
      • Immutability
      • Dictionary
      • Sample Code Part 1 >
        • PatternLength
        • Serialization >
          • Kryo2
          • JAXB/XSD
          • XStream
        • MongoDB
        • Strings >
          • Reverse the String
          • Reverse the String in n/2 complexity
          • StringEditor
          • Reversing String
          • String Puzzle
          • Knuth Morris Pratt
          • Unique characters
          • Top N most occurring characters
          • Longest Common Subsequence
          • Longest Common Substring
        • New methods in Collections
        • MethodReferences
        • Complex Objects Comparator >
          • Performance
        • NIO >
          • NIO 2nd Sample
        • Date Converter
        • Minimum cost path
        • Find File
      • URL Validator
    • Julia
    • Python >
      • Decorators
      • String Formatting
      • Generators_Threads
      • JustLikeThat
    • Go >
      • Tutorial
      • CodeSnippet
      • Go Routine_Channel
      • Suggestions
    • Methodologies & Design Patterns >
      • Design Principles
      • Design Patterns >
        • TemplatePattern
        • Adapter Design Pattern
        • Proxy
        • Lazy Initialization
        • CombinatorPattern
        • Singleton >
          • Singletons
        • Strategy
  • Frameworks
    • Apache Velocity
    • React Library >
      • Tutorial
    • Spring >
      • Spring Boot >
        • CustomProperties
        • ExceptionHandling
        • Custom Beans
        • Issues
      • Quick View
    • Rest WebServices >
      • Interviews
      • Swagger
    • Cloudera BigData >
      • Ques_Ans
      • Hive
      • Apache Spark >
        • ApacheSpark Installation
        • SparkCode
        • Sample1
        • DataFrames
        • RDDs
        • SparkStreaming
        • SparkFiles
    • Integration >
      • Apache Camel
    • Testing Frameworks >
      • JUnit >
        • JUnit Runners
      • EasyMock
      • Mockito >
        • Page 2
      • TestNG
    • Blockchain >
      • Ethereum Smart Contract
      • Blockchain Java Example
    • Microservices >
      • Messaging Formats
      • Design Patterns
    • AWS >
      • Honeycode
    • Dockers >
      • GitBash
      • Issues
      • Kubernetes
  • Databases
    • MySql
    • Oracle >
      • Interview1
      • SQL Queries
    • Elastic Search
  • Random issues
    • TOAD issue
    • Architect's suggestions
  • Your Views

Here, I will not be talking about Spark history, mechanism or its working or its architecture or algorithms being used.
​You can check below link to read about Spark quickly, I am still going through this & it is quit helpful
https://data-flair.training/blogs/fault-tolerance-in-apache-spark/
​Here I will be putting the commands which you can find helpful during your initial encounters with Spark.
​I will also be trying to put the solutions for the issues which I faced during the journey on the road of Spark.

​Note :- Before you try your hands on RDD or Dataframes or DataSets, first please check about-
​a) Tranformations
b) Actions
​And also see that here 'Lazy Initialization' is done & transformation actually takes place when an action is called.


​So initially, below are some commands which I tried my hands on for spark-shell i.e. using Scala for these, and will keep on adding here as I get new-

>   val myData = sc.textFile("file:///home/cloudera/Public/emp.txt") ---------------to read the text file from the local machine & not from HDFS
>   val read = sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/Data_Files/emp.txt") ----------------to read the data from hdfs while using Cloudera
>   val myData = sc.textFile("file:///home/cloudera/Public/emp.txt", 5) ----------- to read the file in 6 partitions & now it will be having 6 partitions
>   myData.getNumPartitions ------------------------------to get the number of partitions in given RDD
>   val newD = myData.coalesce(3)  ---------------- it will merge the 6 partitions of myData to create new RDD being referred by newD here
>   myData.take(2) -----------prints first 2 records
>   myData.map(l=>l.toUpperCase()).filter(l=>l.startsWith("1")||l.startsWith("3"))
>   myData.collect --------------------to show the array of data
>   myData.toDebugString ----------To print the lineage of RDD created
>   data=["Nitin","is","great"] ---------- it is applicable for python only to create the collection
>   val data = List("Nitin","is","here") ---- it is applicable for scala to create the collection
>   val rdd = sc.parallelize(data) ------ to parallelize the data present in the collection created above
>   val myData = sc.textFile("Data_Files/emp.txt").map(l=>l.split(',')).map(f=>(f(0),f(0),f(1),f(2),f(3))).take(1) ---------gives the same output as below statement
>   val myData = sc.textFile("Data_Files/emp.txt").map(l=>l.split(',')).take(1)
>   val no = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) ----------to create an array in scala
>   val dt = no.map(d=>(d*2)) ---------to multiply each element of no by 2
>   val myData = sc.textFile("file:///home/cloudera/Public/emp.txt").map(d=>d.contains("M")) -------------will create RDD containing true or false for each item
>   myData.partitions.length -------------------------to count the number of partitions in RDD
>   myData.cache() --------------------------cache the RDD in the memory & all future computation will work on the in-memory data, which saves disk seeks and improves the performance.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
​Like shown above to cache the RDD but I checked that in above case RDD 'myData' will not be cached as it is the one reading the contents from the file & it is not some intermediate RDD, which as per Spark docs can be cached. ​But I can cache other RDD which is being created by the transformation on myData above.
​While using caching, one must consider the memory available & which RDDs are being used for multiple transformations, so cache only such RDDs which are being used many times for transformations.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
One interesting issue I faced, how to read the CSV file to fetch its data. Obviously, I had to look a lot across google pages.
​So I am sharing one way here when you have proper csv file to read with delimiter as ",". Other examples/ways I got which I will share soon with some examples.

​Before starting spark shell, use below command to start the spark shell taken from - https://spark-packages.org/package/databricks/spark-csvwhen you want to read the csv file -

> spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
​once spark shell started, use below command to read the csv file, I have header in csv else you can make the concerned field as 'false'
​scala>​ val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("mode","DROPMALFORMED").option("inferSchema","true").option("delimiter",";").load("/user/nitin/Project1/banking/bank.csv")

​Once dataframe created, you can register its data as some temporary table to access its data via sql queries. This registration we do like -
scala>​df.registerTempTable("Bank")

​Now I have one table named "Bank" which I can query like shown below -
​scala>sqlContext.sql("select * from Bank")
A very good example & solution for k-means in spark-scala. Everyone must see to get understanding of its working at -
gist.github.com/umbertogriffo/b599aa9b9a156bb1e8775c8cdbfb688a

​Hope you get a good idea about k-means, next time I will put the sample question for which I used the above solution.
​Keep watching this space.
Picture
Powered by Create your own unique website with customizable templates.