partition interview questions
Top partition frequently asked interview questions
I have launched my cluster this way:
/usr/lib/spark/bin/spark-submit --class MyClass --master yarn-cluster--num-executors 3 --driver-memory 10g --executor-memory 10g --executor-cores 4 /path/to/jar.jar
The first thing I do is read a big text file, and count it:
val file = sc.textFile("/path/to/file.txt.gz")
println(file.count())
When doing this, I see that only one of my nodes is actually reading the file and executing the count (because I only see one task). Is that expected? Should I repartition my RDD afterwards, or when I use map reduce functions, will Spark do it for me?
Source: (StackOverflow)
I'm having a hard time trying to achieve the following:
I have a list (say, [a,b,c,d]) and I need to partition it into pairs and unique elements in every possible way (order is not important), i.e.:
[a,b,c,d], [(a,b), c,d], [(a,b), (c,d)], [a, (b,c), d], [(a,d), (b, c)]...
and so on. This thread solves the problem when only pairs are used, but I need also the unique elements and I cannot get it to do it.
Any idea will be much appreciated.
Thanks!
Source: (StackOverflow)
Suppose we have a vector of pairs:
std::vector<std::pair<A,B>> v;
where for type A
only equality is defined:
bool operator==(A const & lhs, A const & rhs) { ... }
How would you sort it that all pairs with the same first
element will end up close? To be clear, the output I hope to achieve should be the same as does something like this:
std::unordered_multimap<A,B> m(v.begin(),v.end());
std::copy(m.begin(),m.end(),v.begin());
However I would like, if possible, to:
- Do the sorting in place.
- Avoid the need to define a hash function for equality.
Edit: additional concrete information.
In my case the number of elements isn't particularly big (I expect N = 10~1000), though I have to repeat this sorting many times ( ~400) as part of a bigger algorithm, and the datatype known as A
is pretty big (it contains among other things an unordered_map
with ~20 std::pair<uint32_t,uint32_t>
in it, which is the structure preventing me to invent an ordering, and making it hard to build a hash function)
Source: (StackOverflow)
``(I've read A Big Data Modeling Methodology for Apache Cassandra for data modeling for my project database, which uses Cassandra. So, I use Query-Driven methodology.)
I will have a search customers as below: (This is just an example, The real page has more search parameters. Also, none of the search parameters are required-parameter.)

The sample Customers table in my Cassandra key-space: (The primary-key is selected according to mentioned article)
//---------Create Customers Table
USE testKeySpace;
CREATE TABLE IF NOT EXISTS customers(
id varint
name text
birthday date,
gender text,
education text,
PRIMARY KEY ((id,name,gender,education),birthday)
);
Questions are:
- What 's the best Indexing model for this table?
- How can I write a query to support optional search parameters?
Source: (StackOverflow)
How to write a Java Program to divide a set of numbers into two sets such that the difference of the sum of their individual numbers, is minimum.
For example, I have an array containing integers- [5,4,8,2]. I can divide it into two arrays- [8,2] and [5,4]. Assuming that the given set of numbers, can have a unique solution like in above example, how to write a Java program to achieve the solution. It would be fine even if I am able to find out that minimum possible difference.
Let's say my method receives an array as parameter. That method has to first divide the array received into two arrays, and then add the integers contained in them. Thereafter, it has to return the difference between them, such that the difference is minimum possible.
P.S.- I have had a look around here, but couldn't find any specific solution to this. Most probable solution seemed to be given here- divide an array into two sets with minimal difference . But I couldn't gather from that thread how can I write a Java program to get a definite solution to the problem.
EDIT:
After looking at the comment of @Alexandru Severin, I tried a java program. It works for one set of numbers [1,3,5,9], but doesn't work for another set [4,3,5,9, 11]. Below is the program. Please suggest changes:-
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class FindMinimumDifference {
public static void main(String[] args) {
int[] arr= new int[]{4,3,5,9, 11};
FindMinimumDifference obj= new FindMinimumDifference();
obj.returnMinDiff(arr);
}
private int returnMinDiff(int[] array){
int diff=-1;
Arrays.sort(array);
List<Integer> list1= new ArrayList<>();
List<Integer> list2= new ArrayList<>();
int sumOfList1=0;
int sumOfList2=0;
for(int a:array){
for(Integer i:list1){
sumOfList1+=i;
}
for(Integer i:list2){
sumOfList2+=i;
}
if(sumOfList1<=sumOfList2){
list1.add(a);
}else{
list2.add(a);
}
}
List<Integer> list3=new ArrayList<>(list1);
List<Integer> list4= new ArrayList<>(list2);
Map<Integer, List<Integer>> mapOfProbables= new HashMap<Integer, List<Integer>>();
int probableValueCount=0;
for(int i=0; i<list1.size();i++){
for(int j=0; j<list2.size();j++){
if(abs(list1.get(i)-list2.get(j))<
abs(getSumOfEntries(list1)-getSumOfEntries(list2))){
List<Integer> list= new ArrayList<>();
list.add(list1.get(i));
list.add(list2.get(j));
mapOfProbables.put(probableValueCount++, list);
}
}
}
int minimumDiff=abs(getSumOfEntries(list1)-getSumOfEntries(list2));
List resultList= new ArrayList<>();
for(List probableList:mapOfProbables.values()){
list3.remove(probableList.get(0));
list4.remove(probableList.get(1));
list3.add((Integer)probableList.get(1));
list4.add((Integer)probableList.get(0));
if(minimumDiff>abs(getSumOfEntries(list3)-getSumOfEntries(list4))){
// valid exchange
minimumDiff=abs(getSumOfEntries(list3)-getSumOfEntries(list4));
resultList=probableList;
}
}
System.out.println(minimumDiff);
if(resultList.size()>0){
list1.remove(resultList.get(0));
list2.remove(resultList.get(1));
list1.add((Integer)resultList.get(1));
list2.add((Integer)resultList.get(0));
}
System.out.println(list1+""+list2); // the two resulting set of
// numbers with modified data giving expected result
return minimumDiff;
}
private static int getSumOfEntries(List<Integer> list){
int sum=0;
for(Integer i:list){
sum+=i;
}
return sum;
}
private static int abs(int i){
if(i<=0)
i=-i;
return i;
}
}
Source: (StackOverflow)
I have a list:
ip_info = ['10.0.0.2/10.10.10.1', '10.0.111.1/10.10.121.4', '10.0.145.15/10.99.10.1', '10.99.0.1/10.44.155.4', '10.0.10.1/10.10.110.1']
I want to be able to strip all characters after the /
character for each item in the list.
For a output of:
ip_info = ['10.0.0.2/', '10.0.111.1/', '10.0.145.15/', '10.99.0.1/', '10.0.110.1/']
From there I will be able to remove the /
without issue as they are all static and can be removed easily.
I have attempted:
for x in ip_info:
''.join(ip_info.partition('/')[0:2])
I don't think this is correct. As it needs to happen for each item in the list. Help?
Source: (StackOverflow)
I was looking at how to split a set in two based on the contents of a third set. Accidentally I stumbled upon this solution:
val s = Set(1,2,3)
val s2 = Set(4,5,6)
val s3 = s ++ s2
s3.partition(s)
res0: (scala.collection.immutable.Set[Int],scala.collection.immutable.Set[Int]) = (Set(1, 2, 3),Set(5, 6, 4))
The signature of partition
is as follows:
def partition(p: A => Boolean): (Repr, Repr)
Can someone explain to me how providing a set instead of a function works?
Thanks in advance
Source: (StackOverflow)
somebody can help me?
I want to split a string with two words like "word1 word2" using split and partition and print (using a for) the words separately like:
Partition:
word1
word2
Split:
word1
word2
But it's not working, can somebody please help me?
This is my code:
print("Hello World")
name = raw_input("Type your name: ")
train = 1,2
train1 = 1,2
print("Separation with partition: ")
for i in train1:
print name.partition(" ")
print("Separation with split: ")
for i in train1:
print name.split(" ")
This is happening:
Separation with partition:
('word1', ' ', 'word2')
('word1', ' ', 'word2')
Separation with split:
['word1', 'word2']
['word1', 'word2']
Source: (StackOverflow)
I have a table structure like-
CREATE TABLE `cdr` (`id` bigint(20) NOT NULL AUTO_INCREMENT,
`dataPacketDownLink` bigint(20) DEFAULT NULL,
`dataPacketUpLink` bigint(20) DEFAULT NULL,
`dataPlanEndTime` datetime DEFAULT NULL,
`dataPlanStartTime` datetime DEFAULT NULL,
`dataVolumeDownLink` bigint(20) DEFAULT NULL,
`dataVolumeUpLink` bigint(20) DEFAULT NULL,
`dataplan` varchar(255) DEFAULT NULL,
`dataplanType` varchar(255) DEFAULT NULL,
`createdOn` datetime DEFAULT NULL,
`deviceName` varchar(500) DEFAULT NULL,
`duration` int(11) NOT NULL,
`effectiveDuration` int(11) NOT NULL,
`hour` int(11) DEFAULT NULL,
`eventDate` datetime DEFAULT NULL,
`msisdn` bigint(20) DEFAULT NULL,
`quarter` int(11) DEFAULT NULL,
`validDays` int(11) DEFAULT NULL,
`dataLeft` bigint(20) DEFAULT NULL,
`completedOn` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `msisdn_index` (`msisdn`),
KEY `eventdate_index` (`eventDate`)
) ENGINE=MyISAM AUTO_INCREMENT=55925171 DEFAULT CHARSET=latin1
and when i am creating partition -
ALTER TABLE cdr PARTITION BY RANGE (TO_DAYS(eventdate)) (
PARTITION p01 VALUES LESS THAN (TO_DAYS('2013-09-01')),
PARTITION p02 VALUES LESS THAN (TO_DAYS('2013-09-15')),
PARTITION p03 VALUES LESS THAN (TO_DAYS('2013-09-30')),
PARTITION p04 VALUES LESS THAN (MAXVALUE));
getting the
error 1503: A primary key must include all columns in the table's partitioning function
i have read everywhere about this but not getting anything so please let me know how to partition this table. i have 20+ million records in it.
Thank you.
Source: (StackOverflow)
While analyzing the HashJoin part in the source code of PostgreSQL, I got confused with the meaning of the word "batch".
what is the meaing of batch ?
This image is screen shot of the Postgresql9.4.1/nodeHashjoin.c ExecHashJoinNewBatch part
also I have a question about the progress in PostgreSQL's Hashjoin I study about this part but It's too hard to understanding
Please help ...
Thank you

Source: (StackOverflow)
I already have a working example which does exactly what I need.
Now the problem is, that I'm not really a fan of subqueries and I think there could be a better solution to this problem.
So here is my (already) working example:
with t as
(
select 'Group1' as maingroup,'Name 1' as subgroup, 'random' as random, 500 as subgroupbudget from dual
union all
select 'Group1','Name 1','random2',500 from dual
union all
select 'Group1','Name 2','random3', 500 from dual
union all
select 'Group2','Name 3','random4', 500 from dual
union all
select 'Group2','Name 4','random5',500 from dual
union all
select 'Group2','Name 5', 'random6',500 from dual
)
select
maingroup,
subgroup,
random,
(select distinct sum(subgroupbudget) over(partition by maingroup) from t b where a.maingroup=b.maingroup group by maingroup,subgroup,subgroupbudget) groupbudget
from t a
group by maingroup, subgroup ,subgroupbudget, random
order by maingroup, subgroup
As you can see, the with-clause shows a simplified table with data. Now the problem is that the last column is the budget of the subgroup. In the result I need the budget of the maingroup. That means I have to sum all values within the maingroup, but only if the subgroups are different (Here I need some kind of distinct).
Unfortunately a simple
sum(distinct subgroupbudget) over(partition by maingroup)
won't work because the numbers (subgroupbudget) can be the same (like in the example)
I hope my question is understandable.
Thanks!
Source: (StackOverflow)
I have a huge table that stores many tracked events, such as a user click.
The table is already in the 10's of millions, and its growing larger everyday.
The queries are starting to get slower when i try to fetch events from a large timeframe, and after reading quite a bit on the subject i understand that partitioning the table may boost the performance.
What i want to do is partition the table on a per month basis.
I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?
If not, what is the command to do it manually considering my partitioned by column is a datetime?
Source: (StackOverflow)
I have data organized in directories in a particular format (shown below) and want to add these to hive table. I want to add all data of 2012 directory.
All below names are directory names, and the inner most dir (3rd level) has the actual data files.
Is there any way to pick in the data directly without having to change this dir structure.
Any pointers are appreciated.
/2012/
|
|---------2012-01
|---------2012-01-01
|---------2012-01-02
|...
|...
|---------2012-01-31
|
|---------2012-02
|---------2012-02-01
|---------2012-02-02
|...
|...
|---------2012-02-28
|
|---------2012-03
|...
|...
|---------2012-12
Queries tried so far without luck:
CREATE EXTERNAL TABLE sampledata
(datestr string, id string, locations string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/path/to/data/2012/*/*';
CREATE EXTERNAL TABLE sampledata
(datestr string, id string, locations string)
partitioned by (ystr string, ymstr string, ymdstr string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
ALTER TABLE sampledata
ADD
PARTITION (ystr ='2012')
LOCATION '/path/to/data/2012/';
SOLUTION:
This small parameter fixes my issue. Adding to the question where it might be beneficial for others:
SET mapred.input.dir.recursive=true;
Source: (StackOverflow)
I'm a computer science student (just started), I was working on writing from pseudocode a randomized pivot version of Quicksort. I've written and tested it, and it all works perfectly however...
The partition part looks a bit too complicated, as it feels I have missed something or overthought it. I can't understand if it's ok or if I made some avoidable mistakes.
So long story short: it works, but how to do better?
Thanks in advance for all the help
void partition(int a[],int start,int end)
{
srand (time(NULL));
int pivotpos = 3; //start + rand() % (end-start);
int i = start; // index 1
int j = end; // index 2
int flag = 1;
int pivot = a[pivotpos]; // sets the pivot's value
while(i<j && flag) // main loop
{
flag = 0;
while (a[i]<pivot)
{
i++;
}
while (a[j]>pivot)
{
j--;
}
if(a[i]>a[j]) // swap && sets new pivot, and restores the flag
{
swap(&a[i],&a[j]);
if(pivotpos == i)
pivotpos = j;
else if(pivotpos == j)
pivotpos = i;
flag++;
}
else if(a[i] == a[j]) // avoids getting suck on a mirror of values (fx pivot on pos 3 of : 1-0-0-1-1)
{
if(pivotpos == i)
j--;
else if(pivotpos == j)
i++;
else
{
i++;
j--;
}
flag++;
}
}
}
Source: (StackOverflow)
I'm trying to find which partition is used for what, e.g. /boot
, /recovery
, /system
, from adb shell
. While this is trivial for partitions currently mounted (using the mount
or df
commands, see e.g. how to identify names of the partitions), this appears to be tricky when it comes to partitions not currently mounted (like /recovery
when booted in "user mode").
There's a tutorial at XDA, but it didn't work out for any of the devices I've tried:
cat /proc/mtd
: this is empty or non-existing
cat /proc/emmc
: this is empty or non-existing
cat /proc/dumchar_info
: non existing (MTK/MediaTek)
ls -al /dev/block/platform/*/by-name
: either non-existing, or not having the wanted details
parted
just yielded an Error: Can't have a partition outside the disk!
on /dev/block/mmcblk1
(while simply missing the "name" column for /dev/block/mmcblk0
).
So I'm at a loss. I know there are apps like DiskInfo which can show those details, so there must be stored somewhere on the device. However, modifying the device (by installing an app) is not an option in my case.
So basically my question burns down to:
Where on the Android device is this information stored?
If possible, a generic approach is preferred. If not, a "try-and-err" of several approaches (if..elseif..fi
) would do as well.
For background: an example use would be "I want to retrieve the /boot
partition only" (get an image of it via dd
). It wouldn't do to first grab all partitions, and evaluate later – too time consuming, and too much data produced ;) – This already describes the intention: writing a little tool to retrieve a particular disk image.
Source: (StackOverflow)