EzDevInfo.com

partition interview questions

Top partition frequently asked interview questions

Spark: Repartition strategy after reading text file

I have launched my cluster this way:

/usr/lib/spark/bin/spark-submit --class MyClass --master yarn-cluster--num-executors 3 --driver-memory 10g --executor-memory 10g --executor-cores 4 /path/to/jar.jar

The first thing I do is read a big text file, and count it:

val file = sc.textFile("/path/to/file.txt.gz")
println(file.count())

When doing this, I see that only one of my nodes is actually reading the file and executing the count (because I only see one task). Is that expected? Should I repartition my RDD afterwards, or when I use map reduce functions, will Spark do it for me?


Source: (StackOverflow)

python - split a list in pairs and unique elements

I'm having a hard time trying to achieve the following: I have a list (say, [a,b,c,d]) and I need to partition it into pairs and unique elements in every possible way (order is not important), i.e.:

[a,b,c,d], [(a,b), c,d], [(a,b), (c,d)], [a, (b,c), d], [(a,d), (b, c)]...

and so on. This thread solves the problem when only pairs are used, but I need also the unique elements and I cannot get it to do it. Any idea will be much appreciated. Thanks!


Source: (StackOverflow)

Advertisements

Sort when only equality is available

Suppose we have a vector of pairs:

std::vector<std::pair<A,B>> v;

where for type A only equality is defined:

bool operator==(A const & lhs, A const & rhs) { ... }

How would you sort it that all pairs with the same first element will end up close? To be clear, the output I hope to achieve should be the same as does something like this:

std::unordered_multimap<A,B> m(v.begin(),v.end());
std::copy(m.begin(),m.end(),v.begin());

However I would like, if possible, to:

  • Do the sorting in place.
  • Avoid the need to define a hash function for equality.

Edit: additional concrete information.

In my case the number of elements isn't particularly big (I expect N = 10~1000), though I have to repeat this sorting many times ( ~400) as part of a bigger algorithm, and the datatype known as A is pretty big (it contains among other things an unordered_map with ~20 std::pair<uint32_t,uint32_t> in it, which is the structure preventing me to invent an ordering, and making it hard to build a hash function)


Source: (StackOverflow)

Best Indexing model in Cassandra table

``(I've read A Big Data Modeling Methodology for Apache Cassandra for data modeling for my project database, which uses Cassandra. So, I use Query-Driven methodology.)

I will have a search customers as below: (This is just an example, The real page has more search parameters. Also, none of the search parameters are required-parameter.)

Sample Search Customers

The sample Customers table in my Cassandra key-space: (The primary-key is selected according to mentioned article)

//---------Create Customers Table
USE testKeySpace;
CREATE TABLE IF NOT EXISTS customers(
id varint
name text
birthday date,
gender text,
education text,
PRIMARY KEY ((id,name,gender,education),birthday)
);

Questions are:

  • What 's the best Indexing model for this table?
  • How can I write a query to support optional search parameters?

Source: (StackOverflow)

How to divide a set of numbers into two sets such that the difference of their sum is minimum

How to write a Java Program to divide a set of numbers into two sets such that the difference of the sum of their individual numbers, is minimum.

For example, I have an array containing integers- [5,4,8,2]. I can divide it into two arrays- [8,2] and [5,4]. Assuming that the given set of numbers, can have a unique solution like in above example, how to write a Java program to achieve the solution. It would be fine even if I am able to find out that minimum possible difference. Let's say my method receives an array as parameter. That method has to first divide the array received into two arrays, and then add the integers contained in them. Thereafter, it has to return the difference between them, such that the difference is minimum possible.

P.S.- I have had a look around here, but couldn't find any specific solution to this. Most probable solution seemed to be given here- divide an array into two sets with minimal difference . But I couldn't gather from that thread how can I write a Java program to get a definite solution to the problem.

EDIT:

After looking at the comment of @Alexandru Severin, I tried a java program. It works for one set of numbers [1,3,5,9], but doesn't work for another set [4,3,5,9, 11]. Below is the program. Please suggest changes:-

 import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class FindMinimumDifference {
public static void main(String[] args) {
    int[] arr= new int[]{4,3,5,9, 11};  
    FindMinimumDifference obj= new FindMinimumDifference();
    obj.returnMinDiff(arr);
}

private int  returnMinDiff(int[] array){


    int diff=-1;
    Arrays.sort(array);
    List<Integer> list1= new ArrayList<>();
    List<Integer> list2= new ArrayList<>();
    int sumOfList1=0;
    int sumOfList2=0;
    for(int a:array){
        for(Integer i:list1){
            sumOfList1+=i;
        }
        for(Integer i:list2){
            sumOfList2+=i;
        }
        if(sumOfList1<=sumOfList2){
        list1.add(a);
        }else{
            list2.add(a);
        }
    }

    List<Integer> list3=new ArrayList<>(list1);   
    List<Integer> list4= new ArrayList<>(list2);   
    Map<Integer, List<Integer>> mapOfProbables= new HashMap<Integer, List<Integer>>();
    int probableValueCount=0;
    for(int i=0; i<list1.size();i++){  
        for(int j=0; j<list2.size();j++){
            if(abs(list1.get(i)-list2.get(j))<
abs(getSumOfEntries(list1)-getSumOfEntries(list2))){
                List<Integer> list= new ArrayList<>();
                list.add(list1.get(i));
                list.add(list2.get(j));    
                mapOfProbables.put(probableValueCount++, list);
            }
        }
    }
    int minimumDiff=abs(getSumOfEntries(list1)-getSumOfEntries(list2));
    List resultList= new ArrayList<>();
    for(List probableList:mapOfProbables.values()){  
        list3.remove(probableList.get(0));
        list4.remove(probableList.get(1));
        list3.add((Integer)probableList.get(1));
        list4.add((Integer)probableList.get(0));
        if(minimumDiff>abs(getSumOfEntries(list3)-getSumOfEntries(list4))){ 
// valid exchange 
                minimumDiff=abs(getSumOfEntries(list3)-getSumOfEntries(list4));
                resultList=probableList;
        }

    }

    System.out.println(minimumDiff);

    if(resultList.size()>0){
        list1.remove(resultList.get(0));
        list2.remove(resultList.get(1));
        list1.add((Integer)resultList.get(1));
        list2.add((Integer)resultList.get(0));
    }

    System.out.println(list1+""+list2);  // the two resulting set of 
// numbers with modified data giving expected result

    return minimumDiff;
}

private static int getSumOfEntries(List<Integer> list){
    int sum=0;
    for(Integer i:list){
        sum+=i;
    }
    return sum;
}
private static int abs(int i){
    if(i<=0) 
        i=-i;
    return i;
}
}

Source: (StackOverflow)

Strip Text in all List Items after Character in each list Item Python

I have a list:

ip_info = ['10.0.0.2/10.10.10.1', '10.0.111.1/10.10.121.4', '10.0.145.15/10.99.10.1', '10.99.0.1/10.44.155.4', '10.0.10.1/10.10.110.1']

I want to be able to strip all characters after the / character for each item in the list.

For a output of:

ip_info = ['10.0.0.2/', '10.0.111.1/', '10.0.145.15/', '10.99.0.1/', '10.0.110.1/']

From there I will be able to remove the / without issue as they are all static and can be removed easily.

I have attempted:

for x  in ip_info:
    ''.join(ip_info.partition('/')[0:2])

I don't think this is correct. As it needs to happen for each item in the list. Help?


Source: (StackOverflow)

Scala partition a set

I was looking at how to split a set in two based on the contents of a third set. Accidentally I stumbled upon this solution:

val s = Set(1,2,3)
val s2 = Set(4,5,6)
val s3 = s ++ s2

s3.partition(s)
res0: (scala.collection.immutable.Set[Int],scala.collection.immutable.Set[Int]) = (Set(1, 2, 3),Set(5, 6, 4))

The signature of partition is as follows:

def partition(p: A => Boolean): (Repr, Repr)

Can someone explain to me how providing a set instead of a function works?

Thanks in advance


Source: (StackOverflow)

Python partition and split

somebody can help me? I want to split a string with two words like "word1 word2" using split and partition and print (using a for) the words separately like:

Partition:
word1
word2

Split:
word1
word2

But it's not working, can somebody please help me? This is my code:

print("Hello World")
name = raw_input("Type your name: ")

train = 1,2
train1 = 1,2
print("Separation with partition: ")
for i in train1:
    print name.partition(" ")

print("Separation with split: ")
for i in train1:
    print name.split(" ")

This is happening: Separation with partition:

('word1', ' ', 'word2')
('word1', ' ', 'word2')

Separation with split:

['word1', 'word2']
['word1', 'word2']

Source: (StackOverflow)

getting error 1503: A primary key must include all columns in the table's partitioning function

I have a table structure like-

CREATE TABLE `cdr` (`id` bigint(20) NOT NULL AUTO_INCREMENT,
                    `dataPacketDownLink` bigint(20) DEFAULT NULL,
                    `dataPacketUpLink` bigint(20) DEFAULT NULL,
                    `dataPlanEndTime` datetime DEFAULT NULL,
                    `dataPlanStartTime` datetime DEFAULT NULL,
                    `dataVolumeDownLink` bigint(20) DEFAULT NULL,
                    `dataVolumeUpLink` bigint(20) DEFAULT NULL,  
                    `dataplan` varchar(255) DEFAULT NULL,  
                    `dataplanType` varchar(255) DEFAULT NULL,  
                    `createdOn` datetime DEFAULT NULL,  
                    `deviceName` varchar(500) DEFAULT NULL,  
                    `duration` int(11) NOT NULL,  
                    `effectiveDuration` int(11) NOT NULL,  
                    `hour` int(11) DEFAULT NULL,  
                    `eventDate` datetime DEFAULT NULL,  
                    `msisdn` bigint(20) DEFAULT NULL,  
                    `quarter` int(11) DEFAULT NULL,  
                    `validDays` int(11) DEFAULT NULL,  
                    `dataLeft` bigint(20) DEFAULT NULL,  
                    `completedOn` datetime DEFAULT NULL,   
                PRIMARY KEY (`id`),   
                KEY `msisdn_index` (`msisdn`),   
                KEY `eventdate_index` (`eventDate`)   
            ) ENGINE=MyISAM AUTO_INCREMENT=55925171 DEFAULT CHARSET=latin1

and when i am creating partition -

ALTER TABLE cdr PARTITION BY RANGE (TO_DAYS(eventdate))  (
    PARTITION p01 VALUES LESS THAN (TO_DAYS('2013-09-01')),  
    PARTITION p02 VALUES LESS THAN (TO_DAYS('2013-09-15')),  
    PARTITION p03 VALUES LESS THAN (TO_DAYS('2013-09-30')),   
    PARTITION p04 VALUES LESS THAN (MAXVALUE));

getting the

error 1503: A primary key must include all columns in the table's partitioning function

i have read everywhere about this but not getting anything so please let me know how to partition this table. i have 20+ million records in it.

Thank you.


Source: (StackOverflow)

What is the meaning of "batch" in PostgreSQL HashJoin

While analyzing the HashJoin part in the source code of PostgreSQL, I got confused with the meaning of the word "batch". what is the meaing of batch ? This image is screen shot of the Postgresql9.4.1/nodeHashjoin.c ExecHashJoinNewBatch part

also I have a question about the progress in PostgreSQL's Hashjoin I study about this part but It's too hard to understanding Please help ... Thank you ExecHashJoinNewBatch


Source: (StackOverflow)

Oracle Get sum of distinct group without subquery

I already have a working example which does exactly what I need. Now the problem is, that I'm not really a fan of subqueries and I think there could be a better solution to this problem.

So here is my (already) working example:

with t as
(
select  'Group1' as maingroup,'Name 1' as subgroup, 'random' as random, 500 as subgroupbudget from dual
union all
select 'Group1','Name 1','random2',500 from dual
union all
select 'Group1','Name 2','random3', 500 from dual
union all
select 'Group2','Name 3','random4', 500 from dual
union all
select 'Group2','Name 4','random5',500 from dual
union all
select 'Group2','Name 5', 'random6',500 from dual
)
select
maingroup,
subgroup,
random,
(select distinct sum(subgroupbudget) over(partition by maingroup) from t b where a.maingroup=b.maingroup group by maingroup,subgroup,subgroupbudget) groupbudget
from t a
group by  maingroup, subgroup ,subgroupbudget, random
order by maingroup, subgroup

As you can see, the with-clause shows a simplified table with data. Now the problem is that the last column is the budget of the subgroup. In the result I need the budget of the maingroup. That means I have to sum all values within the maingroup, but only if the subgroups are different (Here I need some kind of distinct).

Unfortunately a simple

sum(distinct subgroupbudget) over(partition by maingroup)

won't work because the numbers (subgroupbudget) can be the same (like in the example)

I hope my question is understandable.

Thanks!


Source: (StackOverflow)

MySQL table partition by month

I have a huge table that stores many tracked events, such as a user click.

The table is already in the 10's of millions, and its growing larger everyday. The queries are starting to get slower when i try to fetch events from a large timeframe, and after reading quite a bit on the subject i understand that partitioning the table may boost the performance.

What i want to do is partition the table on a per month basis.

I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?

If not, what is the command to do it manually considering my partitioned by column is a datetime?


Source: (StackOverflow)

How to pick up all data into hive from subdirectories

I have data organized in directories in a particular format (shown below) and want to add these to hive table. I want to add all data of 2012 directory. All below names are directory names, and the inner most dir (3rd level) has the actual data files. Is there any way to pick in the data directly without having to change this dir structure. Any pointers are appreciated.

/2012/
|
|---------2012-01
            |---------2012-01-01
            |---------2012-01-02
            |...
            |...
            |---------2012-01-31
|
|---------2012-02
            |---------2012-02-01
            |---------2012-02-02
            |...
            |...
            |---------2012-02-28
|
|---------2012-03
|...
|...
|---------2012-12

Queries tried so far without luck:

CREATE EXTERNAL TABLE sampledata
(datestr string, id string, locations string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/path/to/data/2012/*/*'; 

CREATE EXTERNAL TABLE sampledata
(datestr string, id string, locations string)
partitioned by (ystr string, ymstr string, ymdstr string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

ALTER TABLE sampledata
ADD 
PARTITION (ystr ='2012') 
LOCATION '/path/to/data/2012/';

SOLUTION: This small parameter fixes my issue. Adding to the question where it might be beneficial for others:

SET mapred.input.dir.recursive=true;

Source: (StackOverflow)

C randomized pivot quicksort (improving the partition function)

I'm a computer science student (just started), I was working on writing from pseudocode a randomized pivot version of Quicksort. I've written and tested it, and it all works perfectly however...

The partition part looks a bit too complicated, as it feels I have missed something or overthought it. I can't understand if it's ok or if I made some avoidable mistakes.

So long story short: it works, but how to do better?

Thanks in advance for all the help

void partition(int a[],int start,int end)
{
    srand (time(NULL));
    int pivotpos = 3;   //start + rand() % (end-start);
    int i = start;    // index 1
    int j = end;      // index 2
    int flag = 1;
    int pivot = a[pivotpos];   // sets the pivot's value
    while(i<j && flag)      // main loop
    {
        flag = 0;
        while (a[i]<pivot)
        {
            i++;
        }
        while (a[j]>pivot)
        {
            j--;
        }
        if(a[i]>a[j]) // swap && sets new pivot, and restores the flag
        {
            swap(&a[i],&a[j]);
            if(pivotpos == i)
                pivotpos = j;
            else if(pivotpos == j)
                pivotpos = i;
            flag++;
        }
        else if(a[i] == a[j])       // avoids getting suck on a mirror of values (fx pivot on pos 3 of : 1-0-0-1-1)
        {
            if(pivotpos == i) 
                j--;
            else if(pivotpos == j)
                i++;
            else
            {
                i++;
                j--;
            }
            flag++;
        }
    }
}

Source: (StackOverflow)

How can I identify partitions of an Android device from the shell?

I'm trying to find which partition is used for what, e.g. /boot, /recovery, /system, from adb shell. While this is trivial for partitions currently mounted (using the mount or df commands, see e.g. how to identify names of the partitions), this appears to be tricky when it comes to partitions not currently mounted (like /recovery when booted in "user mode").

There's a tutorial at XDA, but it didn't work out for any of the devices I've tried:

  • cat /proc/mtd: this is empty or non-existing
  • cat /proc/emmc: this is empty or non-existing
  • cat /proc/dumchar_info: non existing (MTK/MediaTek)
  • ls -al /dev/block/platform/*/by-name: either non-existing, or not having the wanted details
  • parted just yielded an Error: Can't have a partition outside the disk! on /dev/block/mmcblk1 (while simply missing the "name" column for /dev/block/mmcblk0).

So I'm at a loss. I know there are apps like DiskInfo which can show those details, so there must be stored somewhere on the device. However, modifying the device (by installing an app) is not an option in my case.

So basically my question burns down to:

Where on the Android device is this information stored?

If possible, a generic approach is preferred. If not, a "try-and-err" of several approaches (if..elseif..fi) would do as well.

For background: an example use would be "I want to retrieve the /boot partition only" (get an image of it via dd). It wouldn't do to first grab all partitions, and evaluate later – too time consuming, and too much data produced ;) – This already describes the intention: writing a little tool to retrieve a particular disk image.


Source: (StackOverflow)