EzDevInfo.com

mongodb interview questions

Top mongodb frequently asked interview questions

mongodb: how to get the last N records?

I can't find anywhere it has been documented this. By default, the find() operation will get the records from beginning. How can I get the last N records in mongodb?

Edit: also I want the returned result ordered from less recent to most recent, not the reverse.


Source: (StackOverflow)

How to list all collections in the mongo shell?

In the MongoDB shell, how do I list all collections for the current database that I'm using?


Source: (StackOverflow)

Advertisements

"Large data" work flows using pandas

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.

One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.

My first thought is to use HDFStore to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:

What are some best-practice workflows for accomplishing the following:

  1. Loading flat files into a permanent, on-disk database structure
  2. Querying that database to retrieve data to feed into a pandas data structure
  3. Updating the database after manipulating pieces in pandas

Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data".

Edit -- an example of how I would like this to work:

  1. Iteratively import a large flat-file and store it in a permanent, on-disk database structure. These files are typically too large to fit in memory.
  2. In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory.
  3. I would create new columns by performing various operations on the selected columns.
  4. I would then have to append these new columns into the database structure.

I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.

Edit -- Responding to Jeff's questions specifically:

  1. I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns.
  2. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset.
  3. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.
  4. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case.
  5. It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.
  6. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.

It is rare that I would ever add rows to the dataset. I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).


Source: (StackOverflow)

When to Redis? When to MongoDB?

What I want is not a comparison between Redis and MongoDB. I know they are different; the performance and the API is totally different.

Redis is very fast, but the API is very 'atomic'. MongoDB will eat more resources, but the API is very very easy to use, and I am very happy with it.

They're both awesome, and I want to use Redis in deployment as much as I can, but it is hard to code. I want to use MongoDB in development as much as I can, but it needs an expensive machine.

So what do you think about the use of both of them? When to pick Redis? When to pick MongoDB?


Source: (StackOverflow)

How do I query mongodb with "like"?

I want query something as SQL's like:

select * from users where name like '%m%'

How to do the same in mongodb? I can't find a operator for like in the documentation.


Source: (StackOverflow)

How do I drop a MongoDB database from the command line?

What's the easiest way to do this from my bash prompt?


Source: (StackOverflow)

When to use CouchDB over MongoDB and vice versa

I am stuck between these two NoSQL databases. In my project I will be creating a database within a database. For example, I need a solution to create dynamic tables. So users can create tables with columns and rows. I think either MongoDB or CouchDB will be good for this, but I am not sure which one. I will also need efficient paging as well.


Source: (StackOverflow)

Random record from MongoDB

I am looking to get a random record from a huge (100 million record) mongodb. What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row. Any suggestions?


Source: (StackOverflow)

mongodb Mongod complains that there is no /data/db folder

I am using my new mac for the first time today. I am following the get started guide on the mongodb.org up until the step where one creates the /data/db directory. btw, I used the homebrew route.

So I open a terminal, and I think I am at what you called the Home Directory, for when I do "ls", I see folders of Desktop Application Movies Music Pictures Documents and Library.

So I did a

mkdir -p /data/db

first, it says permission denied. I kept trying different things for half and hour and finally :

mkdir -p data/db

worked. and when I "ls", a directory of data and nested in it a db folder do exist.

then I fire up mongod and it complains about not finding data/db

Have I done something wrong?

Now I have done the

sudo mkdir -p /data/db

and when I do a "ls" I do see the data dir and the db dir. inside the db dir though, there is absolutely nothing in it and when I now run mongod

Sun Oct 30 19:35:19 [initandlisten] exception in initAndListen: 10309 Unable to create/open lock file: /data/db/mongod.lock errno:13 Permission denied Is a mongod instance already running?, terminating
Sun Oct 30 19:35:19 dbexit: 
Sun Oct 30 19:35:19 [initandlisten] shutdown: going to close listening sockets...
Sun Oct 30 19:35:19 [initandlisten] shutdown: going to flush diaglog...
Sun Oct 30 19:35:19 [initandlisten] shutdown: going to close sockets...
Sun Oct 30 19:35:19 [initandlisten] shutdown: waiting for fs preallocator...
Sun Oct 30 19:35:19 [initandlisten] shutdown: lock for final commit...
Sun Oct 30 19:35:19 [initandlisten] shutdown: final commit...
Sun Oct 30 19:35:19 [initandlisten] shutdown: closing all files...
Sun Oct 30 19:35:19 [initandlisten] closeAllFiles() finished
Sun Oct 30 19:35:19 [initandlisten] shutdown: removing fs lock...
Sun Oct 30 19:35:19 [initandlisten] couldn't remove fs lock errno:9 Bad file descriptor
Sun Oct 30 19:35:19 dbexit: really exiting now

EDIT Getting error message for

sudo chown mongod:mongod /data/db

chown: mongod: Invalid argument

Thanks, everyone!


Source: (StackOverflow)

How to print out more than 20 items (documents) in MongoDB's shell?

db.foo.find().limit(300)

won't do it. It still prints out only 20 documents.

db.foo.find().toArray()
db.foo.find().forEach(printjson)

will both print out very expanded view of each document instead of the 1-line version for find():


Source: (StackOverflow)

What does MongoDB not being ACID compliant really mean?

I am not a database expert and have no formal computer science background, so bear with me. I want to know the kinds of real world negative things that can happen if you use MongoDB, which is not ACID compliant. This applies to any ACID noncompliant database.

I understand that MongoDB can perform Atomic Operations, but that they don't "support traditional locking and complex transactions", mostly for performance reasons. I also understand the importance of database transactions, and the example of when your database is for a bank, and you're updating several records that all need to be in sync, you want the transaction to revert back to the initial state if there's a power outage so credit equals purchase, etc.

But when I get into conversations about MongoDB, those of us that don't know the technical details of how databases are actually implemented start throwing around statements like:

MongoDB is way faster than MySQL and Postgres, but there's a tiny chance, like 1 in a million, that it "won't save correctly".

That "won't save correctly" part is referring to this understanding: If there's a power outage right at the instant you're writing to MongoDB, there's a chance for a particular record (say you're tracking pageviews in documents with 10 attributes each), that one of the documents only saved 5 of the attributes… which means over time your pageview counters are going to be "slightly" off. You'll never know by how much, you know they'll be 99.999% correct, but not 100%. This is because, unless you specifically made this a mongodb atomic operation, the operation is not guaranteed to have been atomic.

So my question is, what is the correct interpretation of when and why MongoDB may not "save correctly"? What parts of ACID does it not satisfy, and under what circumstances, and how do you know when that 0.001% of your data is off? Can't this be fixed somehow? If not, this seems to mean that you shouldn't store things like your users table in MongoDB, because a record might not save. But then again, that 1/1,000,000 user might just need to "try signing up again", no?

I am just looking for maybe a list of when/why negative things happen with an ACID noncompliant database like MongoDB, and ideally if there's a standard workaround (like run a background job to cleanup data, or only use SQL for this, etc.).


Source: (StackOverflow)

MongoDB and "joins"

I'm sure MongoDB doesn't officially support "joins". What does this mean?

Does this mean "We cannot connect two collections(tables) together."?

I think if we put the value for _id in collection A to the other_id in collection B, can we simply connect two collections?

If my understanding is correct, MongoDB can connect two tables together, say, when we run a query. This is done by "Reference" written in http://www.mongodb.org/display/DOCS/Schema+Design.

Then what does "joins" really mean?

I'd love to know the answer because this is essential to learn MongoDB schema design. http://www.mongodb.org/display/DOCS/Schema+Design


Source: (StackOverflow)

Retrieve only the queried element in an object array in MongoDB collection

Suppose you have the following documents in my collection:

{  
   "_id":ObjectId("562e7c594c12942f08fe4192"),
   "shapes":[  
      {  
         "shape":"square",
         "color":"blue"
      },
      {  
         "shape":"circle",
         "color":"red"
      }
   ]
},
{  
   "_id":ObjectId("562e7c594c12942f08fe4193"),
   "shapes":[  
      {  
         "shape":"square",
         "color":"black"
      },
      {  
         "shape":"circle",
         "color":"green"
      }
   ]
}

Do query:

db.test.find({"shapes.color": "red"}, {"shapes.color": 1})

Or

db.test.find({shapes: {"$elemMatch": {color: "red"}}}, {"shapes.color": 1})

Returns matched document (Document 1), but always with ALL array items in shapes:

{ "shapes": 
  [
    {"shape": "square", "color": "blue"},
    {"shape": "circle", "color": "red"}
  ] 
}

However, I'd like to get the document (Document 1) only with the array that contains color=red:

{ "shapes": 
  [
    {"shape": "circle", "color": "red"}
  ] 
}

How can I do this?


Source: (StackOverflow)

MySQL vs MongoDB 1000 reads

I have been very excited about MongoDb and have been testing it lately. I had a table called posts in MySQL with about 20 million records indexed only on a field called 'id'.

I wanted to compare speed with MongoDB and I ran a test which would get and print 15 records randomly from our huge databases. I ran the query about 1,000 times each for mysql and MongoDB and I am suprised that I do not notice a lot of difference in speed. Maybe MongoDB is 1.1 times faster. That's very disappointing. Is there something I am doing wrong? I know that my tests are not perfect but is MySQL on par with MongoDb when it comes to read intensive chores.


Note:

  • I have dual core + ( 2 threads ) i7 cpu and 4GB ram
  • I have 20 partitions on MySQL each of 1 million records

Sample Code Used For Testing MongoDB

<?php
function microtime_float()
{
    list($usec, $sec) = explode(" ", microtime());
    return ((float)$usec + (float)$sec);
}
$time_taken = 0;
$tries = 100;
// connect
$time_start = microtime_float();

for($i=1;$i<=$tries;$i++)
{
    $m = new Mongo();
    $db = $m->swalif;
    $cursor = $db->posts->find(array('id' => array('$in' => get_15_random_numbers())));
    foreach ($cursor as $obj)
    {
        //echo $obj["thread_title"] . "<br><Br>";
    }
}

$time_end = microtime_float();
$time_taken = $time_taken + ($time_end - $time_start);
echo $time_taken;

function get_15_random_numbers()
{
    $numbers = array();
    for($i=1;$i<=15;$i++)
    {
        $numbers[] = mt_rand(1, 20000000) ;

    }
    return $numbers;
}

?>


Sample Code For Testing MySQL

<?php
function microtime_float()
{
    list($usec, $sec) = explode(" ", microtime());
    return ((float)$usec + (float)$sec);
}
$BASE_PATH = "../src/";
include_once($BASE_PATH  . "classes/forumdb.php");

$time_taken = 0;
$tries = 100;
$time_start = microtime_float();
for($i=1;$i<=$tries;$i++)
{
    $db = new AQLDatabase();
    $sql = "select * from posts_really_big where id in (".implode(',',get_15_random_numbers()).")";
    $result = $db->executeSQL($sql);
    while ($row = mysql_fetch_array($result) )
    {
        //echo $row["thread_title"] . "<br><Br>";
    }
}
$time_end = microtime_float();
$time_taken = $time_taken + ($time_end - $time_start);
echo $time_taken;

function get_15_random_numbers()
{
    $numbers = array();
    for($i=1;$i<=15;$i++)
    {
        $numbers[] = mt_rand(1, 20000000);

    }
    return $numbers;
}
?>

Source: (StackOverflow)

MongoDB: Is it possible to make a case-insensitive query?

Example:

> db.stuff.save({"foo":"bar"});

> db.stuff.find({"foo":"bar"}).count();
1
> db.stuff.find({"foo":"BAR"}).count();
0

Source: (StackOverflow)