EzDevInfo.com

warehouse

Next Generation Python Package Repository Welcome to Warehouse’s documentation! — Warehouse 15.0.dev0 documentation

Get it on Github
Language : Python

https://warehouse.readthedocs.org/

Data Warehouse - business hours

I'm working on a Data Warehouse which, in the end, will require me to create reports based on business hours. Currently, my time dimension is granular to the hour. I'm wondering if I should be modifying my Time dimension to include a bit field for "business hour" or should I be creating some sort of calculated measure for it on the analysis end? Any examples would be super magnificent?

Source: (StackOverflow)

Warehouse: Store (and count) non-fact records?

How to store records that don't contain any fact? For example, let's say that a shop wants to count how many people have entered inside a store (and that they take info on every person that goes inside the shop). In warehouse, I guess there would be dimension table "Person" with different attributes, but how would fact table look like? Would it contain only foreign keys?

Source: (StackOverflow)

Advertisements

What do you use as a good alternative to Team System?

I would like to gauge what solutions other people put in place to get Team System functionality. We all know that Team System can be pricey for some of us. I know they offer a small team edition with five licenses with a MSDN subscription, but what if your team is bigger than five or you don't want to use Team System?

Source: (StackOverflow)

Multiple Warehouse - WooCommerce & Table Rate Shipping Plugin

I currently have two warehouses (one on the east coast and west coast of USA). The problem I am trying to solve is finding the optimal method of shipping based on the user's shipping address and our two warehouses. It is unwise for us to ship a product from the west coast all the way over to a consumer on the east coast and vice versa.

We currently run WooCommerce and have the Table Rate Shipping plugin installed. I've created two zones (one for west and one for east) to divide our two shipping areas and I understand you can create a shipping class for each WooCommerce product but you are unable to create more than one under a product. If this was possible, I was thinking of creating two shipping classes under each product and finding an optimal shipping method that way.

I believe there is TradeGecko but it is a costly service that provides much more functionality than what I technically need. Does anyone here know of an ideal solution to optimally ship our products from two warehouses? Help or insight would be appreciated, thank you.

Source: (StackOverflow)

Storing and reloading large multidimensional data sets in Python

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to stackoverflow for some advice. Here are the details:

My data will basically take the form of a multidimensional array where each entry will look something like this:

data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]

Each argument has roughly the following numbers of potential values:

stringArg1: 50

stringArg2: 20

stringArg3: 6

stringArg4: 24

intArg1: 10,000

Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.

So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.

I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:

1.) a relational database

2.) PyTables

3.) Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)

There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.

What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...

That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.

I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.

Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts. Thanks in advance to anyone and everyone who replies.

Source: (StackOverflow)

Data mart vs cubes

I've got confused with warehousing process... I'm in a process of building a data mart but the part I don't really understand is related to cubes. I've read some tutorials about SSAS but I don't see how can I use this data in other applications. What I need is following:

A warehouse (data mart) that contains all data needed for analysis (drill down and aggregated data, like a daily revenue and YTD revenue)
A .NET web service that can take this data so that many different apps can use it

A part I don't understand are cubes. I see that many people use SSAS to build cubes. What are these cubes in SSAS? Are they objects? Are they tables where data is stored? How can my web service access the data from cubes?

Are there any alternatives to SSAS? Would it be practical to just build cubes in a data mart and load them during ETL process?

Thanks in advance for the answer!

Source: (StackOverflow)

Converting task into the linear programming

I have such a problem with organizing a non-automated warehouse (with forklifts). In the start of the day, there are some pallets in pallet racks in warehouse and during the day there are some specific number of lorries importing / exporting pallets to / from warehouse. And I want to minimize travel distances of forklifts during the day and (or) minimize waiting time of lorries that are processing outgoing deliveries (they are waiting for filling up their lorry with pallets).

I have suggested some quite intuitive algorithms, but they are not producing good results if I compare them to most intuitive method - putting imported pallets to the nearest free rack in warehouse. I tried to convert this problem to linear programming, but I didnt succeed - I know how to find minimized forklift paths for individual lorry, but then I dont know how to put it together because each time lorry export/import some pallets the warehouse state is changed (different pallet layout in warehouse). I also tried brute-force method for finding the best result by systematically checking every possibility, but this isnt producing results in a reasonable time...

Does anyone have some idea please (about converting the problem to linear programming)? Thanks for any help ...

Source: (StackOverflow)

TFS 2010 warehouse job never leaves the running state

I am at my wit's end here after spending most of this week tearing my hair out and swearing at TFS so I hope someone can help :)

We have recently migrated to TFS 2010 using the MSF For Agile process template and we make use of such reports as the Burndown, User Stories progress etc. Up until 13/10/10, our warehousing worked perfectly and all our reports displayed upto date data. However, after this date, the reports started displaying old data and on looking at the status of the warehousing jobs using the GetProcessingStatus() method on the WarehouseControlWebService, we can see that the Work Item Tracking Sync job seems to be stuck in the 'Running' state.

Indeed, when you put a profiler on the database, you can see the same stored procs being called again and again, with the same parameters, as if it is stuck in a loop. While this is happening, the CPU usage is 50% and above. It stayed in this state for over 24 hours before I decided to kill it.

There is nothing particularly crazy about our setup - we did a clean TFS install and imported work items from TFS 2008 using Excel. We also have a custom work item template 'Support Ticket' which our support team use to log calls from customers. All importing was done with the proper TFS command line tools or Excel.

Has anyone experienced anything like this before? I have seen a couple of posts where people have had similar issues but not seen an answer.

Thanks for your time

Dan

Source: (StackOverflow)

Can 2 Cubes in a Data Warehouse be directly compared against each other?

Is there a way to compare all information (aggregates, down to the detail level) between two OLAP cubes? For example, say I wanted to compare one cube created to work with sql server 2000 to that same cube, but migrated to run on sql server 2005/2008 - technically they should both return the same information for all dimension / measure combinations but I need a way to verify.

I am definitly NOT a developer, but I do have access to enterprise manager, and potentially SAS tools etc. and I know a bit of SQL but not much else. I know that you can compare two dimensional (ie tables) data sets with sql queries, and also with SAS - but I have never heard of a way to compare three dimensional cubes.

Am I out of luck on this one? The last thing that I want to have to do is view both cubes and compare all possible results side by side via excel or something, I hope that it can be automated somehow.

Source: (StackOverflow)

Hive can't import the flume tweet data to the Warehouse (HDFS)

I'm using Cloudera CDH 5.0.2 and wanted to import the flume data into the Hive metastore/warehouse@HDFS. But it's not working.

I used this following JSON SerDe: http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar

I'm using this code to create the table using the hive editor:

CREATE EXTERNAL TABLE tweets (
  id BIGINT,
  created_at STRING,
  source STRING,
  favorited BOOLEAN,
  retweeted_status STRUCT<
    text:STRING,
    user:STRUCT<screen_name:STRING,name:STRING>,
    retweet_count:INT>,
  entities STRUCT<
    urls:ARRAY<STRUCT<expanded_url:STRING>>,
    user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    hashtags:ARRAY<STRUCT<text:STRING>>>,
  text STRING,
  user STRUCT<
    screen_name:STRING,
    name:STRING,
    friends_count:INT,
    followers_count:INT,
    statuses_count:INT,
    verified:BOOLEAN,
    utc_offset:INT,
    time_zone:STRING>,
  in_reply_to_screen_name STRING
    ) 
    PARTITIONED BY (datehour INT)
    ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
    LOCATION '/user/flume/tweets';

And when I execute the query using HIVE Editor I get the following log:

14/06/29 01:30:54 INFO log.PerfLogger: <PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO log.PerfLogger: <PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO parse.ParseDriver: Parsing command: CREATE EXTERNAL TABLE tweets3 (
  id BIGINT,
  created_at STRING,
  source STRING,
  favorited BOOLEAN,
  retweeted_status STRUCT<
    text:STRING,
    user:STRUCT<screen_name:STRING,name:STRING>,
    retweet_count:INT>,
  entities STRUCT<
    urls:ARRAY<STRUCT<expanded_url:STRING>>,
    user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    hashtags:ARRAY<STRUCT<text:STRING>>>,
  text STRING,
  user STRUCT<
    screen_name:STRING,
    name:STRING,
    friends_count:INT,
    followers_count:INT,
    statuses_count:INT,
    verified:BOOLEAN,
    utc_offset:INT,
    time_zone:STRING>,
  in_reply_to_screen_name STRING
) 
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets'
14/06/29 01:30:54 INFO parse.ParseDriver: Parse Completed
14/06/29 01:30:54 INFO log.PerfLogger: </PERFLOG method=parse start=1404030654781 end=1404030654788 duration=7 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO log.PerfLogger: <PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
14/06/29 01:30:54 INFO parse.SemanticAnalyzer: Creating table tweets3 position=22
14/06/29 01:30:54 INFO ql.Driver: Semantic Analysis Completed
14/06/29 01:30:54 INFO log.PerfLogger: </PERFLOG method=semanticAnalyze start=1404030654788 end=1404030654791 duration=3 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
14/06/29 01:30:54 INFO log.PerfLogger: </PERFLOG method=compile start=1404030654781 end=1404030654791 duration=10 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/06/29 01:30:54 INFO log.PerfLogger: <PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO log.PerfLogger: <PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO ql.Driver: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
14/06/29 01:30:54 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost.localdomain:2181 sessionTimeout=600000 watcher=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager$DummyWatcher@66b05d6
14/06/29 01:30:54 WARN ZooKeeperHiveLockManager: Unexpected ZK exception when creating parent node /hive_zookeeper_namespace_hive
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hive_zookeeper_namespace_hive
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
    at org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager.setContext(ZooKeeperHiveLockManager.java:121)
    at org.apache.hadoop.hive.ql.Driver.createLockManager(Driver.java:174)
    at org.apache.hadoop.hive.ql.Driver.checkConcurrency(Driver.java:154)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1047)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:926)
    at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:144)
    at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:64)
    at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:177)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
14/06/29 01:30:54 INFO log.PerfLogger: <PERFLOG method=acquireReadWriteLocks from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO log.PerfLogger: </PERFLOG method=acquireReadWriteLocks start=1404030654913 end=1404030654914 duration=1 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO log.PerfLogger: <PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO ql.Driver: Starting command: CREATE EXTERNAL TABLE tweets3 (
  id BIGINT,
  created_at STRING,
  source STRING,
  favorited BOOLEAN,
  retweeted_status STRUCT<
    text:STRING,
    user:STRUCT<screen_name:STRING,name:STRING>,
    retweet_count:INT>,
  entities STRUCT<
    urls:ARRAY<STRUCT<expanded_url:STRING>>,
    user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    hashtags:ARRAY<STRUCT<text:STRING>>>,
  text STRING,
  user STRUCT<
    screen_name:STRING,
    name:STRING,
    friends_count:INT,
    followers_count:INT,
    statuses_count:INT,
    verified:BOOLEAN,
    utc_offset:INT,
    time_zone:STRING>,
  in_reply_to_screen_name STRING
    ) 
    PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets'
14/06/29 01:30:54 INFO log.PerfLogger: </PERFLOG method=TimeToSubmit start=1404030654810 end=1404030654914 duration=104 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO log.PerfLogger: <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO log.PerfLogger: <PERFLOG method=task.DDL.Stage-0 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:54 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
14/06/29 01:30:54 WARN security.ShellBasedUnixGroupsMapping: got exception trying to get groups for user HDFS
org.apache.hadoop.util.Shell$ExitCodeException: id: HDFS: No such user
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:83)
at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:52)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:139)
at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1409)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:312)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:169)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1161)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:62)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2407)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2418)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:598)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3697)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:253)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1485)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1263)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1091)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:926)
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:144)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:64)
at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:177)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/06/29 01:30:54 WARN security.UserGroupInformation: No groups available for user HDFS
14/06/29 01:30:54 INFO hive.metastore: Connected to metastore.
14/06/29 01:30:54 INFO cli.CLIService: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=9e8711ee-2e1d-474d-a2cc-082bd92b9ce7]: getOperationStatus()
14/06/29 01:30:55 INFO cli.CLIService: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=9e8711ee-2e1d-474d-a2cc-082bd92b9ce7]: getOperationStatus()
14/06/29 01:30:55 INFO cli.CLIService: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=9e8711ee-2e1d-474d-a2cc-082bd92b9ce7]: getOperationStatus()
14/06/29 01:30:56 INFO cli.CLIService: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=9e8711ee-2e1d-474d-a2cc-082bd92b9ce7]: getOperationStatus()
14/06/29 01:30:56 INFO log.PerfLogger: </PERFLOG method=task.DDL.Stage-0 start=1404030654914 end=1404030656188 duration=1274 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:56 INFO log.PerfLogger: </PERFLOG method=runTasks start=1404030654914 end=1404030656188 duration=1274 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:56 INFO log.PerfLogger: </PERFLOG method=Driver.execute start=1404030654914 end=1404030656188 duration=1274 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:56 INFO ql.Driver: OK
14/06/29 01:30:56 INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:56 INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1404030656189 end=1404030656189 duration=0 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:56 INFO log.PerfLogger: </PERFLOG method=Driver.run start=1404030654810 end=1404030656189 duration=1379 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 01:30:56 INFO cli.CLIService: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=9e8711ee-2e1d-474d-a2cc-082bd92b9ce7]: getOperationStatus()

and when I go to the HDFS and browse the warehouse I can't see any files. It seems that no data is imported to the warehouse.

I'm using PostgreSQL for the metastore.

And when I try to import the data using this query:

06/29/14 00:31:09 LOAD DATA INPATH '/user/flume/tweets/FlumeData.1404026375345' INTO TABLE 'default.tweets' PARTITION (datehour='1404026375345')

I get the following error message:

14/06/29 00:31:09 INFO log.PerfLogger: <PERFLOG method=compile     from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO log.PerfLogger: <PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO parse.ParseDriver: Parsing command: LOAD DATA INPATH '/user/flume/tweets/FlumeData.1404026375345' INTO TABLE `default.tweets` PARTITION (datehour='1404026375345')
14/06/29 00:31:09 INFO parse.ParseDriver: Parse Completed
14/06/29 00:31:09 INFO log.PerfLogger: </PERFLOG method=parse start=1404027069010 end=1404027069030 duration=20 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO log.PerfLogger: <PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO ql.Driver: Semantic Analysis Completed
14/06/29 00:31:09 INFO log.PerfLogger: </PERFLOG method=semanticAnalyze start=1404027069030 end=1404027069464 duration=434 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
14/06/29 00:31:09 INFO log.PerfLogger: </PERFLOG method=compile start=1404027069010 end=1404027069464 duration=454 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/06/29 00:31:09 INFO log.PerfLogger: <PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO log.PerfLogger: <PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO ql.Driver: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
14/06/29 00:31:09 INFO zookeeper.ZooKeeper: Initiating client connection,     connectString=localhost.localdomain:2181 sessionTimeout=600000 watcher=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager$DummyWatcher@78513cb2
14/06/29 00:31:09 INFO log.PerfLogger: <PERFLOG method=acquireReadWriteLocks from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO log.PerfLogger: </PERFLOG method=acquireReadWriteLocks start=1404027069488 end=1404027069503 duration=15 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO log.PerfLogger: <PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO ql.Driver: Starting command: LOAD DATA INPATH '/user/flume/tweets/FlumeData.1404026375345' INTO TABLE `default.tweets` PARTITION (datehour='1404026375345')
14/06/29 00:31:09 INFO log.PerfLogger: </PERFLOG method=TimeToSubmit start=1404027069472 end=1404027069504 duration=32 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO log.PerfLogger: <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO log.PerfLogger: <PERFLOG method=task.MOVE.Stage-0 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:09 INFO exec.Task: Loading data to table default.tweets partition (datehour=1404026375345) from hdfs://localhost.localdomain:8020/user/flume/tweets/FlumeData.1404026375345
14/06/29 00:31:09 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
14/06/29 00:31:09 INFO hive.metastore: Connected to metastore.
14/06/29 00:31:09 INFO cli.CLIService: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=e0530ca4-dadf-4f95-8f8c-ae91b1027cc6]: getOperationStatus()
14/06/29 00:31:09 INFO exec.MoveTask: Partition is: {datehour=1404026375345}
14/06/29 00:31:10 ERROR exec.Task: Failed with exception copyFiles: error while checking/creating destination directory!!!
org.apache.hadoop.hive.ql.metadata.HiveException: copyFiles: error while checking/creating destination directory!!!
    at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2235)
    at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1227)
    at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:407)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
    at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1485)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1263)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1091)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:926)
    at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:144)
    at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:64)
    at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:177)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=hive, access=WRITE, inode="/user/flume/tweets":flume:flume:drwxr-xr-x
    at
    org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:265)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:251)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:232)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:176)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5461)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5443)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5417)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3571)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3541)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3515)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:739)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:558)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
    at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2549)
    at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2518)
    at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:827)
    at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:823)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:823)
    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:816)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1815)
    at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2229)
    ... 17 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=hive, access=WRITE, inode="/user/flume/tweets":flume:flume:drwxr-xr-x
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:265)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:251)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:232)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:176)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5461)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5443)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5417)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3571)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3541)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3515)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:739)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:558)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)

    at org.apache.hadoop.ipc.Client.call(Client.java:1409)
    at org.apache.hadoop.ipc.Client.call(Client.java:1362)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy10.mkdirs(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy10.mkdirs(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:500)
    at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2547)
    ... 25 more
14/06/29 00:31:10 INFO log.PerfLogger: </PERFLOG method=task.MOVE.Stage-0 start=1404027069504 end=1404027070056 duration=552 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:10 ERROR ql.Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
14/06/29 00:31:10 INFO log.PerfLogger: </PERFLOG method=Driver.execute start=1404027069503 end=1404027070058 duration=555 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:10 INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:10 INFO ZooKeeperHiveLockManager:  about to release lock for default/tweets
14/06/29 00:31:10 INFO ZooKeeperHiveLockManager:  about to release lock for default
14/06/29 00:31:10 INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1404027070058 end=1404027070067 duration=9 from=org.apache.hadoop.hive.ql.Driver>
14/06/29 00:31:10 ERROR operation.Operation: Error:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
    at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:146)
    at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:64)
    at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:177)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
14/06/29 00:31:10 INFO cli.CLIService: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=e0530ca4-dadf-4f95-8f8c-ae91b1027cc6]: getOperationStatus()
14/06/29 00:31:10 INFO cli.CLIService: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=e0530ca4-dadf-4f95-8f8c-ae91b1027cc6]: getOperationStatus()

Flume is working correctly and I can see all the tweets and data in my HDFS under flume/tweets. But why does hive not copy any data to the metastore in the HDFS warehouse?

Thanks in advance for any help!

Source: (StackOverflow)

Multiple data marts separated by legal requirements need to be consolidated

I have a high-level business intelligence question.

I have 12 identical relational databases populated with typical dimensions and facts you would expect from AdventureWorks sample database or any other transaction-driven data model (customers, sales, regions, people, etc). A nightly ETL populates them with the data from 12 instances of the web app, where most of the fact data is generated by the users. The reason there's 12 of them is because Legal enforces contracts stating that data must be walled off to ensure security, prevent bugs, etc, but mostly because Legal says so. Databases are PostgreSQL, app is php. We'll be using MicroStrategy for reporting. Some reports need to be common across the data sources, some reports belong to only one data source, and many belong to several. E.g. Report "Total Number of Patients" is needed by users from every instance, but Report "Total Global Sales All Instances" can only be seen by our leadership. Similarly, Report "Custom Web Conversion" is only needed by users from instances 4,8,11.

The problem is developing a strategy for reporting across all 12 instances down to the lowest grain, e.g. timestamped transaction for a specific customer. I can't legally merge the data via an ETL with 12 connections, but I also don't want to maintain 12 ETLs, or 12 sets of reports.

I need to provide reports for both internal leadership (e.g. show me amount of Sales across all 12 instances) and external users (e.g. show me amount of Sales in instance #7 only).

My best idea right now is to create a single "smart" ETL that iterates through a list of connections and populates 12 data warehouses. Then I would create a single report template in MicroStrategy and do connection switching depending on who logged in. For managerial reports across instances, I would create some abstract rollups and merge them in a package layer.

Anyone have any suggestions for best practices re: how to pull this off?

Thanks.

Source: (StackOverflow)

Scaling with kettle, concurrency issues

I pretend to scale my app and some part of the process includes to run kettle jobs simultaneously by some processing clients, in some point transformations need to perform combination lookup on some shared table (let's suppose "clients_table"), this table grows quickly due to all posible clients are unknown, so they are inserted as the show up. Combination lookup on simultaneously executing jobs (let's say 2 but could be more than that, each per client) over this table, it is presenting concurrency issues because of "duplicate entry key xxx" error i suppose while inserting non existing clients (when combination lookup didn't find them in the table to retrieves it's id). It's a fact that every time processing clients executes kettle jobs will transfer new clients so above problem it's very common. I wonder if i'm breaking kettle philosophy or i'm missing something, i've read about making kettle transformations transactional, could that be my solution????, because i think what's going on is that combination lookup step isn't transactional, give me some ideas, PS: I'm using Kettle 4.2 and MySql 5.2

Source: (StackOverflow)

difference between rm and mp3 formats

rm files are comparitively much smaller in size. how do they compare quality-wise?

for a songs warehouse application, is it advisable to convert all mp3 to rm before archiving to save storage space?

Source: (StackOverflow)

Is there somewhere I can search for available webservices?

I'm wondering if there is a website that collects (and hopefully updates) information on available web services.

Edit: Thanks for all the info; many good answers. I can only accept 1 as the "accepted answer" at this time, so I picked my favorite one.

Source: (StackOverflow)

Prestashop stock management - multiple warehouses

I'm testing the shop on localhost , ps version 1.5.4.1, using the basic theme. It's a clothing store that has branches in several cities, so every store should have its own stock. I created a warehouse for every store, set the available quantities for products and its combinations based on warehouse stock.

My question is how do i know, when a client is buying a product from the online store, from which warehouse does the product get subtracted, how does this system work? can I configure this? If there is a possibility to select the warehouse source of the product on delivery?

Thanks!

Source: (StackOverflow)