Here I will be putting some practical questions & their answers which may help you & also please suggest the better questions or answers which can be put here...
I will also be putting the questions from various YouTube or other videos, for which I will also give the links here for you to get more details.
Now lets start.....
Example 1 : Suppose you are given a list of numbers - 2,3,1,4,2,5,8,6,34,21,45
And you are required to get the below output, then how you will do that?
Array[Array[Int]] = Array(Array(2, 3, 1), Array(4, 2, 5, 8), Array(6, 34, 21, 45))
Solution : Here, we can see array of arrays, so can we consider that we are partitioning the given list into 3 partitions?
If yes, then we resolved 1st part, that we can break the given input into 3 partitions. And we can simply break into 3 partitions because
we don't need to make changes in the order or input, just breaking into 3 partitions will do the job here.
Step 1 : To create a RDD with 3 partitions & the above data
scala> val a = sc.parallelize(List(2,3,1,4,2,5,8,6,34,21,45), 3)
If you the content of RDD by : a.collect, it will be
Array[Int] = Array(2, 3, 1, 4, 2, 5, 8, 6, 34, 21, 45)
Step 2 : So till here we are getting a simple array of above values. But in the output we don't see any information of number of partitions in
the RDD, & this information we can get in following 2 ways -
a) scala> a.partitions.size
b) scala> a.getNumPartitions
And in both the cases we will get the below result -
Int = 3
Now comes the final part to create the given array of arrays. Here, we can use glom, which will get the data of each partition in an array.
So, we now that the above RDD is created with 3 partitions & we just need to represent the data of each partition as Array, which we
can do like shown below with its result next in 'RED' -
scala> a.glom.collect
res24: Array[Array[Int]] = Array(Array(2, 3, 1), Array(4, 2, 5, 8), Array(6, 34, 21, 45))
You can further read more about glom from below blog -
http://blog.madhukaraphatak.com/glom-in-spark/
Same way as given above blog, we can find the max or min or can take any other operation where we can utilize the partitions behaviour.
We can find the max like show with the result -
scala> a.glom.map((v:Array[Int])=>v.max).reduce(_ max _)
res33: Int = 45
This is one way we can work on partitions, other way is to use mapPartitions available to RDDs as shown below -
First define below method which will provide the maximum element from each partition -
scala> def maxNum(nums: Iterator[Int]) : Iterator[Int] = {
return List(nums.toList.max).iterator
}
Then we can call this function from mapPartitions as shown below to get the result -
scala> a.mapPartitions(num => maxNum(num)).max
res44: Int = 45
Both these ways you can use when there is huge data & partitioning can improve the performance. Both ways should give same performance.
Feeling interesting?
Let me take one more flavour to get the maximum number from each partition also & also check if the defined function is being called for each partition
rather being called for each element as it happens when you use map, i.e. it does shuffling for each element which can be bad when you are having
lot of numbers, & in such cases we can use partitions to enhance the performance. So below sample will confirm that & results are shown in 'RED'-
scala> def maxNum(index: Int, nums: Iterator[Int]) : Iterator[Int] = {
val a = nums.toList.max
println("Called in Partition -> " + index + " : having max number as :" + a)
return List(a).iterator
}
maxNum: (index: Int, nums: Iterator[Int])Iterator[Int]
scala> a.mapPartitionsWithIndex((index,num) => maxNum(index,num)).max
Called in Partition -> 0 : having max number as :3
Called in Partition -> 1 : having max number as :8
Called in Partition -> 2 : having max number as :45
res53: Int = 45
Hope, it helps.
If you get other way to get the solution, please do let me know, as I was not able restructure the solution.
===============================================================================================
Example 2 : How to get the below output -
Array((even,CompactBuffer(2, 4, 6, 8, 10, 12, 14, 16, 18)), (odd,CompactBuffer(1, 3, 5, 7, 9, 11, 13, 15, 17)))
Solution : val nums = sc.parallelize( 1 to 18,3)
nums.groupBy(x => {if(x%2 == 0) "even" else "odd"}).collect
===============================================================================================
Following are the way you can upload the data files to HDFS -
a) Use HUE, it will easier as it gives UI & no need to write the commands to transfer the files.
b) Using command line i.e. terminal window & can use any of the given commands
1) to copy file from local to HDFS
hadoop fs -copyFromLocal <source> <destination>
or
hadoop fs -put <source> <destination>
2) to copy fom hdfs to local
hadoop fs -copyToLocal <source> <destination>
or
hadoop fs - get <source> <destination>
While getting the files, source will be your hdfs path & file name & destination will your local path.
Instead of hadoop fs, you can can also use hdfs -dfs. Try this also.
===============================================================================================
What are the main hadoop configuration files?
hadoop-env.sh - define JAVA_HOME & HADOOP_HOME
core-site.xml - define address of name nodes to run
hdfs-site.xml - define the replication factor & the physical addresses of node
yarn-site.xml - define type of cluster managers, whether it is standalone, or mesos or yarn
mapred-site.xml
masters
slaves
===============================================================================================
If there are too many small files in HDFS, then it takes too much metadata & too many blocks.
If this much partition is not required then we can archive these files like shown below -
> hadoop archive -archive_name sample.har <Input location> <Output location>
.har is hadoop archive
How to copy a file into HDFS with different block size to that existing block size configuration?
hadoop fs -dfs.blocksize=1234 -copyFromLocal <source> <destination>
1234 here is size in bytes
How to check the block size of the copied file
hadoop fs -stat <file_name with path in HDFS>
===============================================================================================
What is the process of spilling in MapReduce?
Output of Map task is written to RAM. And the default size is set to 100MB as specified in mapreduce.task.io.sort.mb
There is some threshold defined for the RAM space beyond which RAM can be used to write the data, as some RAM is need to be left for other critical tasks.
So once this threshold is reached, Map results are copied to local disk and this process is called Spilling.
So, Spilling is a process of copying the data from memory buffer to local disk after a certain threshold is reached.
Default threshold is 0.8 which is specified in mapreduce.map.sort.spill.percent
===============================================================================================
What is blockscanner in HDFS
BlockScanner maintains the integrity of the of the data blocks.
Rest you can see over Google for the details.
===============================================================================================
Ques : How will you prevent a file from splitting in case you want the whole file to be processed by the same mapper?
Solution :
Method 1 : Increase the minimum split size to be larger than the largest file inside the driver section-
a) conf.set(“mapred.min.split.size”,”size_larger_than_file_size”);
b) Input Split Computation Formula – max(minimumSize, min(maximumSize, blockSize))
Method 2 : Modify the InputFormat class that you want to use :
a) Subclass the concrete subclass of FileInputFormat and override the isSplitable() method to return false.
===============================================================================================
Ques : What do you mean by MapReduce task running in uber mode?
Solution : If a job is small, ApplicationMaster chooses to run the tasks in its own JVM are called uber tasks.
===============================================================================================
Ques : How will you enhance the performance of MapReduce job when dealing with too many small files?
Solution : CombineFileInputFormat can be used to solve this problem, it packs all the small files into input splits where each split is processed by single mapper. It considers the node & rack locality when deciding which blocks to be placed in the same split.
===============================================================================================
Ques : Where does the data of a Hive table gets stored?
Solution : By default Hive table data is stored in a HDFS directory : /user/hive/warehouse
It is specified in hive.metastore.warehouse.dir.configuration parameter present in hive-site.xml
===============================================================================================
Ques : What is the difference between external table and managed table?
Solution : Hive manages the ‘Managed table’ & if table is dropped then its metadata & data will be deleted while in ‘External table’, Hive just manages the metadata only, so if Hive table is dropped then only metadata will be deleted & table data will be safe.
===============================================================================================
Ques : When should we use SORT BY instead of ORDER BY?
Solution : When dataset is huge then use SORT BY, as it will be using multiple reducers for better performance. While ORDER BY uses one reducer only.
===============================================================================================
Ques : Display the contents of the Dataframes with no truncates.
Solution : Like shown below, you see data is truncated -
scala> deptDF.show
+-----+--------------------+
| id| name|
+-----+--------------------+
|SALES| SALES DEPARTMENT|
| IT|INFORMATION TECHN...|
| HR| HUMAN RESOURCE|
+-----+--------------------+
But we don't want any truncates then execute the above command like -
scala> deptDF.show(false)
+-----+----------------------+
|id |name |
+-----+----------------------+
|SALES|SALES DEPARTMENT |
|IT |INFORMATION TECHNOLOGY|
|HR |HUMAN RESOURCE |
+-----+----------------------+
===============================================================================================
===============================================================================================
===============================================================================================
Example 1 : Suppose you are given a list of numbers - 2,3,1,4,2,5,8,6,34,21,45
And you are required to get the below output, then how you will do that?
Array[Array[Int]] = Array(Array(2, 3, 1), Array(4, 2, 5, 8), Array(6, 34, 21, 45))
Solution : Here, we can see array of arrays, so can we consider that we are partitioning the given list into 3 partitions?
If yes, then we resolved 1st part, that we can break the given input into 3 partitions. And we can simply break into 3 partitions because
we don't need to make changes in the order or input, just breaking into 3 partitions will do the job here.
Step 1 : To create a RDD with 3 partitions & the above data
scala> val a = sc.parallelize(List(2,3,1,4,2,5,8,6,34,21,45), 3)
If you the content of RDD by : a.collect, it will be
Array[Int] = Array(2, 3, 1, 4, 2, 5, 8, 6, 34, 21, 45)
Step 2 : So till here we are getting a simple array of above values. But in the output we don't see any information of number of partitions in
the RDD, & this information we can get in following 2 ways -
a) scala> a.partitions.size
b) scala> a.getNumPartitions
And in both the cases we will get the below result -
Int = 3
Now comes the final part to create the given array of arrays. Here, we can use glom, which will get the data of each partition in an array.
So, we now that the above RDD is created with 3 partitions & we just need to represent the data of each partition as Array, which we
can do like shown below with its result next in 'RED' -
scala> a.glom.collect
res24: Array[Array[Int]] = Array(Array(2, 3, 1), Array(4, 2, 5, 8), Array(6, 34, 21, 45))
You can further read more about glom from below blog -
http://blog.madhukaraphatak.com/glom-in-spark/
Same way as given above blog, we can find the max or min or can take any other operation where we can utilize the partitions behaviour.
We can find the max like show with the result -
scala> a.glom.map((v:Array[Int])=>v.max).reduce(_ max _)
res33: Int = 45
This is one way we can work on partitions, other way is to use mapPartitions available to RDDs as shown below -
First define below method which will provide the maximum element from each partition -
scala> def maxNum(nums: Iterator[Int]) : Iterator[Int] = {
return List(nums.toList.max).iterator
}
Then we can call this function from mapPartitions as shown below to get the result -
scala> a.mapPartitions(num => maxNum(num)).max
res44: Int = 45
Both these ways you can use when there is huge data & partitioning can improve the performance. Both ways should give same performance.
Feeling interesting?
Let me take one more flavour to get the maximum number from each partition also & also check if the defined function is being called for each partition
rather being called for each element as it happens when you use map, i.e. it does shuffling for each element which can be bad when you are having
lot of numbers, & in such cases we can use partitions to enhance the performance. So below sample will confirm that & results are shown in 'RED'-
scala> def maxNum(index: Int, nums: Iterator[Int]) : Iterator[Int] = {
val a = nums.toList.max
println("Called in Partition -> " + index + " : having max number as :" + a)
return List(a).iterator
}
maxNum: (index: Int, nums: Iterator[Int])Iterator[Int]
scala> a.mapPartitionsWithIndex((index,num) => maxNum(index,num)).max
Called in Partition -> 0 : having max number as :3
Called in Partition -> 1 : having max number as :8
Called in Partition -> 2 : having max number as :45
res53: Int = 45
Hope, it helps.
If you get other way to get the solution, please do let me know, as I was not able restructure the solution.
===============================================================================================
Example 2 : How to get the below output -
Array((even,CompactBuffer(2, 4, 6, 8, 10, 12, 14, 16, 18)), (odd,CompactBuffer(1, 3, 5, 7, 9, 11, 13, 15, 17)))
Solution : val nums = sc.parallelize( 1 to 18,3)
nums.groupBy(x => {if(x%2 == 0) "even" else "odd"}).collect
===============================================================================================
Following are the way you can upload the data files to HDFS -
a) Use HUE, it will easier as it gives UI & no need to write the commands to transfer the files.
b) Using command line i.e. terminal window & can use any of the given commands
1) to copy file from local to HDFS
hadoop fs -copyFromLocal <source> <destination>
or
hadoop fs -put <source> <destination>
2) to copy fom hdfs to local
hadoop fs -copyToLocal <source> <destination>
or
hadoop fs - get <source> <destination>
While getting the files, source will be your hdfs path & file name & destination will your local path.
Instead of hadoop fs, you can can also use hdfs -dfs. Try this also.
===============================================================================================
What are the main hadoop configuration files?
hadoop-env.sh - define JAVA_HOME & HADOOP_HOME
core-site.xml - define address of name nodes to run
hdfs-site.xml - define the replication factor & the physical addresses of node
yarn-site.xml - define type of cluster managers, whether it is standalone, or mesos or yarn
mapred-site.xml
masters
slaves
===============================================================================================
If there are too many small files in HDFS, then it takes too much metadata & too many blocks.
If this much partition is not required then we can archive these files like shown below -
> hadoop archive -archive_name sample.har <Input location> <Output location>
.har is hadoop archive
How to copy a file into HDFS with different block size to that existing block size configuration?
hadoop fs -dfs.blocksize=1234 -copyFromLocal <source> <destination>
1234 here is size in bytes
How to check the block size of the copied file
hadoop fs -stat <file_name with path in HDFS>
===============================================================================================
What is the process of spilling in MapReduce?
Output of Map task is written to RAM. And the default size is set to 100MB as specified in mapreduce.task.io.sort.mb
There is some threshold defined for the RAM space beyond which RAM can be used to write the data, as some RAM is need to be left for other critical tasks.
So once this threshold is reached, Map results are copied to local disk and this process is called Spilling.
So, Spilling is a process of copying the data from memory buffer to local disk after a certain threshold is reached.
Default threshold is 0.8 which is specified in mapreduce.map.sort.spill.percent
===============================================================================================
What is blockscanner in HDFS
BlockScanner maintains the integrity of the of the data blocks.
Rest you can see over Google for the details.
===============================================================================================
Ques : How will you prevent a file from splitting in case you want the whole file to be processed by the same mapper?
Solution :
Method 1 : Increase the minimum split size to be larger than the largest file inside the driver section-
a) conf.set(“mapred.min.split.size”,”size_larger_than_file_size”);
b) Input Split Computation Formula – max(minimumSize, min(maximumSize, blockSize))
Method 2 : Modify the InputFormat class that you want to use :
a) Subclass the concrete subclass of FileInputFormat and override the isSplitable() method to return false.
===============================================================================================
Ques : What do you mean by MapReduce task running in uber mode?
Solution : If a job is small, ApplicationMaster chooses to run the tasks in its own JVM are called uber tasks.
===============================================================================================
Ques : How will you enhance the performance of MapReduce job when dealing with too many small files?
Solution : CombineFileInputFormat can be used to solve this problem, it packs all the small files into input splits where each split is processed by single mapper. It considers the node & rack locality when deciding which blocks to be placed in the same split.
===============================================================================================
Ques : Where does the data of a Hive table gets stored?
Solution : By default Hive table data is stored in a HDFS directory : /user/hive/warehouse
It is specified in hive.metastore.warehouse.dir.configuration parameter present in hive-site.xml
===============================================================================================
Ques : What is the difference between external table and managed table?
Solution : Hive manages the ‘Managed table’ & if table is dropped then its metadata & data will be deleted while in ‘External table’, Hive just manages the metadata only, so if Hive table is dropped then only metadata will be deleted & table data will be safe.
===============================================================================================
Ques : When should we use SORT BY instead of ORDER BY?
Solution : When dataset is huge then use SORT BY, as it will be using multiple reducers for better performance. While ORDER BY uses one reducer only.
===============================================================================================
Ques : Display the contents of the Dataframes with no truncates.
Solution : Like shown below, you see data is truncated -
scala> deptDF.show
+-----+--------------------+
| id| name|
+-----+--------------------+
|SALES| SALES DEPARTMENT|
| IT|INFORMATION TECHN...|
| HR| HUMAN RESOURCE|
+-----+--------------------+
But we don't want any truncates then execute the above command like -
scala> deptDF.show(false)
+-----+----------------------+
|id |name |
+-----+----------------------+
|SALES|SALES DEPARTMENT |
|IT |INFORMATION TECHNOLOGY|
|HR |HUMAN RESOURCE |
+-----+----------------------+
===============================================================================================
===============================================================================================
===============================================================================================