EzDevInfo.com

elasticsearch interview questions

Top elasticsearch frequently asked interview questions

How to stop/shut down an elasticsearch node?

I want to restart an elasticsearch node with a new configuration. What is the best way to gracefully shut down an node?

Is killing the process the best way of shutting the server down, or is there some magic URL I can use to shut the node down?


Source: (StackOverflow)

How to retrieve unique count of a field using Kibana + Elastic Search

Is it possible to query for a distinct/unique count of a field using Kibana? I am using elastic search as my backend to Kibana.

If so, what is the syntax of the query? Heres a link to the Kibana interface I would like to make my query: http://demo.kibana.org/#/dashboard

I am parsing nginx access logs with logstash and storing the data into elastic search. Then, I use Kibana to run queries and visualize my data in charts. Specifically, I want to know the count of unique IP addresses for a specific time frame using Kibana.


Source: (StackOverflow)

Advertisements

How to use Elasticsearch with MongoDB?

I have gone through many blogs and sites about configuring Elasticsearch for MongoDB to index Collections in MongoDB but none of them were straightforward.

Please explain to me a step by step process for installing elasticsearch, which should include:

  • configuration
  • run in the browser

I am using Node.js with express.js, so please help accordingly.


Source: (StackOverflow)

Make elasticsearch only return certain fields?

I'm using elasticsearch to index my documents.

Is it possible to instruct it to only return particular fields instead of the entire json document it has stored?


Source: (StackOverflow)

Elastic search, multiple indexes vs one index and types for different data sets?

I have an application developed using the MVC pattern and I would like to index now multiple models of it, this means each model has a different data structure.

  • Is it better to use mutliple indexes, one for each model or have a type within the same index for each model? Both ways would also require a different search query I think. I just started on this.

  • Are there differences performancewise between both concepts if the data set is small or huge?

I would test the 2nd question myself if somebody could recommend me some good sample data for that purpose.


Source: (StackOverflow)

list all indexes on ElasticSearch server?

I would like to list all indexes present on an ElasticSearch server. I tried this:

curl -XGET localhost:9200/

but it just gives me this:

{
  "ok" : true,
  "status" : 200,
  "name" : "El Aguila",
  "version" : {
    "number" : "0.19.3",
    "snapshot_build" : false
  },
  "tagline" : "You Know, for Search"
}

I want a list of all indexes..


Source: (StackOverflow)

Solr vs. ElasticSearch

What are the core architectural differences between these technologies?

Also, what use cases are generally more appropriate for each?


Source: (StackOverflow)

ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage? [closed]

I'm currently looking at other search methods rather than having a huge SQL query. I saw elasticsearch recently and played with whoosh (a Python implementation of a search engine).

Can you give reasons for your choice(s)?


Source: (StackOverflow)

elasticsearch v.s. MongoDB for filtering application [closed]

This question is about making an architectural choice prior to delving into the details of experimentation and implementation. It's about the suitability, in scalability and performance terms, of elasticsearch v.s. MongoDB, for a somewhat specific purpose.

Hypothetically both store data objects that have fields and values, and allow querying that body of objects. So presumably filtering out subsets of the objects according to fields selected ad-hoc, is something fit for both.

My application will revolve around selecting objects according to criteria. It would select objects by filtering simultaneously by more than a single field, put differently, its query filtering criteria would typically comprise anywhere between 1 and 5 fields, maybe more in some cases. Whereas the fields chosen as filters would be a subset of a much larger amount of fields. Picture some 20 field names existing, and each query is an attempt to filter the objects by few fields out of those overall 20 fields (It can be less or more than 20 overall field names existing, I just used this number to demonstrate the ratio of fields to fields used as filters in every discrete query). The filtering can be by the existence of the chosen fields, as well as by the field values, e.g. filtering out objects that have field A, and their field B is between x and y, and their field C is equal to w.

My application will be continuously doing this sort of filtering, whereas there would be nothing or very little constant in terms of which fields are used for the filtering at any moment. Perhaps in elasticsearch indexes need to be defined, but maybe even without indexes speed is at par with that of MongoDB.

As per the data getting into the store, there are no special details about that.. the objects would be almost never changed after having been inserted. Perhaps old objects would need to be dropped, I'd like to assume both data stores support expire deleting stuff internally or by an application made query. (Less frequently, objects that fit a certain query would need to be dropped as well).

What do you think? And, have you experimented this aspect?

I am interested in the performance and the scalability of it, of each of the two data stores, for this kind of task. This is the sort of an architectural desing question, and details of store-specific options or query cornerstones that should make it well architected are welcome as a demonstration of a fully thought-out suggestion.

Thanks!


Source: (StackOverflow)

How to search for a part of a word with ElasticSearch

I've recently started using ElasticSearch and I can't seem to make it search for a part of a word.

Example: I have three documents from my couchdb indexed in ElasticSearch:

{
  "_id" : "1",
  "name" : "John Doeman",
  "function" : "Janitor"
}
{
  "_id" : "2",
  "name" : "Jane Doewoman",
  "function" : "Teacher"
}
{
  "_id" : "3",
  "name" : "Jimmy Jackal",
  "function" : "Student"
} 

So now, I want to search for all documents containing "Doe"

curl http://localhost:9200/my_idx/my_type/_search?q=Doe

That doesn't return any hits. But if I search for

curl http://localhost:9200/my_idx/my_type/_search?q=Doeman

It does return one document (John Doeman).

I've tried setting different analyzers and different filters as properties of my index. I've also tried using a full blown query (for example:

{
    "query" : {
            "term" : {
                     "name" : "Doe"
                     } 
              }                    
}

) But nothing seems to work.

How can I make ElasticSearch find both John Doeman and Jane Doewoman when I search for "Doe" ?

UPDATE

I tried to use the nGram tokenizer and filter, like Igor proposed, like this:

{
    "index" : {
        "index" : "my_idx",
        "type" : "my_type",
        "bulk_size": "100",
        "bulk_timeout" : "10ms",
        "analysis" : {
                   "analyzer" : {
                              "my_analyzer" : {
                                            "type" : "custom",
                                            "tokenizer" : "my_ngram_tokenizer",
                                            "filter" : ["my_ngram_filter"]
                              }
                   },
                   "filter" : {
                            "my_ngram_filter" : {
                                       "type" : "nGram",
                                       "min_gram" : 1,
                                       "max_gram" : 1
                            }
                   },
                   "tokenizer" : {
                               "my_ngram_tokenizer" : {
                                                    "type" : "nGram",
                                                    "min_gram" : 1,
                                                    "max_gram" : 1
                               }
                   }
        }
    }
}

The problem I'm having now is that each and every query returns ALL documents :-S Any pointers? ElasticSearch's documentation on using nGram's isn't great...


Source: (StackOverflow)

Removing Data From ElasticSearch

I'm new to ElasticSearch. I'm trying to figure out how to remove data from ElasticSearch. I have deleted my indexes. However, that doesn't seem to actually remove the data itself. The other stuff I've seen points to the Delete by Query feature. However, I'm not even sure what to query on. I know my indexes. Essentially, I'd like to figure out how to do a

DELETE FROM [Index]

From PostMan in Chrome. However, I'm not having any luck. It seems like no matter what I do, the data hangs around. Thus far, I've successfully deleted the indexes by using the DELETE HTTP Verb in PostMan and using a url like:

   http://localhost:9200/[indexName]

However, that doesn't seem to actually remove the data (aka docs) themselves.

Thank you for any assistance you can provide.


Source: (StackOverflow)

When do you start additional Elasticsearch nodes?

I'm in the middle of attempting to replace a Solr setup with Elasticsearch. This is a new setup, which has not yet seen production, so I have lots of room to fiddle with things and get them working well.

I have very, very large amounts of data. I'm indexing some live data and holding onto it for 7 days (by using the _ttl field). I do not store any data in the index (and disabled the _source field). I expect my index to stabilize around 20 billion rows. I will be putting this data into 2-3 named indexes. Search performance so far with up to a few billion rows is totally acceptable, but indexing performance is an issue.

I am a bit confused about how ES uses shards internally. I have created two ES nodes, each with a separate data directory, each with 8 indexes and 1 replica. When I look at the cluster status, I only see one shard and one replica for each node. Doesn't each node keep multiple indexes running internally? (Checking the on-disk storage location shows that there is definitely only one Lucene index present). -- Resolved, as my index setting was not picked up properly from the config. Creating the index using the API and specifying the number of shards and replicas has now produced exactly what I would've expected to see.

Also, I tried running multiple copies of the same ES node (from the same configuration), and it recognizes that there is already a copy running and creates its own working area. These new instances of nodes also seem to only have one index on-disk. -- Now that each node is actually using multiple indices, a single node with many indices is more than sufficient to throttle the entire system, so this is a non-issue.

When do you start additional Elasticsearch nodes, for maximum indexing performance? Should I have many nodes each running with 1 index 1 replica, or fewer nodes with tons of indexes? Is there something I'm missing with my configuration in order to have single nodes doing more work?

Also: Is there any metric for knowing when an HTTP-only node is overloaded? Right now I have one node devoted to HTTP only, but aside from CPU usage, I can't tell if it's doing OK or not. When is it time to start additional HTTP nodes and split up your indexing software to point to the various nodes?


Source: (StackOverflow)

Queries vs. Filters

I can't see any description of when I should use a query or a filter or some combination of the two. Can anyone please explain or point me to an explanation?


Source: (StackOverflow)

ElasticSearch -- boosting relevance based on field value

Need to find a way in ElasticSearch to boost the relevance of a document based on a particular value of a field. Specifically, there is a special field in all my documents where the higher the field value is, the more relevant the doc that contains it should be, regardless of the search.

Consider the following document structure:

{
    "_all" : {"enabled" : "true"},
    "properties" : {
        "_id":            {"type" : "string",  "store" : "yes", "index" : "not_analyzed"},
        "first_name":     {"type" : "string",  "store" : "yes", "index" : "yes"},
        "last_name":      {"type" : "string",  "store" : "yes", "index" : "yes"},
        "boosting_field": {"type" : "integer", "store" : "yes", "index" : "yes"}
        }
}

I'd like documents with a higher boosting_field value to be inherently more relevant than those with a lower boosting_field value. This is just a starting point -- the matching between the query and the other fields will also be taken into account in determining the final relevance score of each doc in the search. But, all else being equal, the higher the boosting field, the more relevant the document.

Anyone have an idea on how to do this?

Thanks a lot!


Source: (StackOverflow)

ElasticSearch: Unassigned Shards, how to fix?

I have an ES cluster with 4 nodes:

number_of_replicas: 1
search01 - master: false, data: false
search02 - master: true, data: true
search03 - master: false, data: true
search04 - master: false, data: true

I had to restart search03, and when it came back, it rejoined the cluster no problem, but left 7 unassigned shards laying about.

{
  "cluster_name" : "tweedle",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 15,
  "active_shards" : 23,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 7
}

Now my cluster is in yellow state. What is the best way to resolve this issue?

  • Delete (cancel) the shards?
  • Move the shards to another node?
  • Allocate the shards to the node?
  • Update 'number_of_replicas' to 2?
  • Something else entirely?

Interestingly, when a new index was added, that node started working on it and played nice with the rest of the cluster, it just left the unassigned shards laying about.

Follow on question: am I doing something wrong to cause this to happen in the first place? I don't have much confidence in a cluster that behaves this way when a node is restarted.


Source: (StackOverflow)