EzDevInfo.com

full-text-search interview questions

Top full-text-search frequently asked interview questions

Tools to search for strings inside files without indexing [closed]

I have to change some connection strings in an incredibly old legacy application, and the programmers who made it thought it would be a great idea to plaster the entire app with connection strings all over the place.

VS "current project" search is INCREDIBLY slow, and I don't trust Windows Search.

So what's the best free, non-indexed text search tool out there? All it should do is return a list with files that contain the wanted string inside a folder and its subfolders.

Oh, yeah, this is on Windows 2003 Server.


Source: (StackOverflow)

Full Text Searching with Rails

I've been looking into searching plugins/gems for Rails. Most of the articles compare Ferret (Lucene) to Ultrasphinx or possibly Thinking Sphinx, but none that talk about SearchLogic. Does anyone have any clues as to how that one compares? What do you use, and how does it perform?


Source: (StackOverflow)

Advertisements

What is Full Text Search vs LIKE

I just read a post mentioning "full text search" in SQL.

I was just wondering what the difference between FTS and LIKE are. I did read a couple of articles but couldn't find anything that explained it well.


Source: (StackOverflow)

What is faceted search?

What exactly is faceted search in the context of full-text search?

I even read about it from Wikipedia, but I couldn't completely understand the use/benefit of it. Hope the community can answer/expand and explain with some good examples.

NOTE: We're into the process of evaluating/researching different open search full-text search engine and mostly I'm seeing faceted search listed as one of the feature among others. So I'm trying to assess whether this would be helpful for our application requirement.


Source: (StackOverflow)

Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

I'm building a Django site and I am looking for a search engine.

A few candidates:

  • Lucene/Lucene with Compass/Solr

  • Sphinx

  • Postgresql built-in full text search

  • MySQl built-in full text search

Selection criteria:

  • result relevance and ranking
  • searching and indexing speed
  • ease of use and ease of integration with Django
  • resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU
  • scalability
  • extra features such as "did you mean?", related searches, etc

Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions.

EDIT: As for indexing needs, as users keep entering data into the site, those data would need to be indexed continuously. It doesn't have to be real time, but ideally new data would show up in index with no more than 15 - 30 minutes delay


Source: (StackOverflow)

When should you use full-text indexing?

We have a whole bunch of queries that "search" for clients, customers, etc. You can search by first name, email, etc. We're using LIKE statements in the following manner:

SELECT * FROM customer WHERE fname LIKE '%someName%'

Does full-text indexing help in the scenario? We're using MSSQL 2005.


Source: (StackOverflow)

Performance of like '%Query%' vs full text search CONTAINS query

I have a situation where I would like to search a single word.

For that scenario, which query would be good from a performance point of view?

Select Col1, Col2 from Table Where Col1 Like '%Search%'

or

Select Col1, Col2 from Table Where Col1 CONTAINS(Col1,'Search')

?


Source: (StackOverflow)

Fulltext Search with InnoDB

I'm developing a high-volume web application, where part of it is a MySQL database of discussion posts that will need to grow to 20M+ rows, smoothly.

I was originally planning on using MyISAM for the tables (for the built-in fulltext search capabilities), but the thought of the entire table being locked due to a single write operation makes me shutter. Row-level locks make so much more sense (not to mention InnoDB's other speed advantages when dealing with huge tables). So, for this reason, I'm pretty determined to use InnoDB.

The problem is... InnoDB doesn't have built-in fulltext search capabilities.

Should I go with a third-party search system? Like Lucene(c++) / Sphinx? Do any of you database ninjas have any suggestions/guidance? LinkedIn's zoie (based off Lucene) looks like the best option at the moment... having been built around realtime capabilities (which is pretty critical for my application.) I'm a little hesitant to commit yet without some insight...

(FYI: going to be on EC2 with high-memory rigs, using PHP to serve the frontend)


Source: (StackOverflow)

What is the use of "multiValued" field type in Solr?

I'm new to Apache Solr. Even after reading the documentation part, I'm finding it difficult to clearly understand the functionality and use of the multiValued field type property.

What internally Solr does/treats/handles a field that is marked as multiValued?

What is the difference in indexing in Solr between a field that is multiValued and those that are not?

Can somebody explain with some good example?

Doc says:

multiValued=true|false

True if this field may contain multiple values per document, i.e. if it can appear multiple times in a document


Source: (StackOverflow)

What is the meaning of O( polylog(n) )? In particular, how is polylog(n) defined?

Brief:
When academic (computer science) papers say "O(polylog(n))", what do they mean? I'm not confused by the "Big-Oh" notation, which I'm very familiar with, but rather by the function polylog(n). They're not talking about the complex analysis function Lis(Z) I think. Or are they? Something totally different maybe?

More detail:
Mostly for personal interest, I've recently been looking over various papers on Compressed Suffix Arrays, e.g. Advantages of Backward Searching -- Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays. The computational complexity estimates stated sometimes involve polylog(n), which is a function I'm not familiar with.

Wikipedia gives a definition of polylogs(z) which appears to mainly be about complex analysis and analytic number theory. My suspicion is that it's not related to the polylog(n) in the compression papers, though I'd love to hear otherwise from someone more knowledgeable. If this is the case, why exactly is it thought reasonable to omit the subscript?

My only other guess is that maybe O(polylog(n)) is supposed to mean "Asymptotic to a polynomial function of log(n)." But that's only a guess: I have no evidence of this, and it would be an abuse of notation to boot.

In any case, a link to a reasonably authoritative definition would be greatly appreciated!


Source: (StackOverflow)

Any reason not use PostgreSQL's built-in full text search on Heroku?

I'm preparing to deploy a Rails app on Heroku that requires full text search. Up to now I've been running it on a VPS using MySQL with Sphinx.

However, if I want to use Sphinx or Solr on Heroku, I'd need to pay for an add-on.

I notice that PostgreSQL (the DB used on Heroku) has built-in full text search capability.

Is there a reason I couldn't use Postgres's full-text search? Is it slower than Sphinx or is there some other major limitation?


Source: (StackOverflow)

Search engine solution for Django that actually works?

The story so far:

Decided to go with Xapian as search backend because it has all search-engine features I was looking for, knows about Unicode, stemming, has few dependencies and requires no bloated app-server installation on top of it.

Tried Django and Haystack (plus xapian-haystack, the backend glue code to tie Haystack to Xapian) because it was advertised on quite some blogs as "working". Did not work. Neither django-haystack nor the xapian-haystack project provide a version combination that actually works together. MASTER from both projects yields an error from Xapian, so it's not stable at all. Haystack 1.0.1 and xapian-haystack 1.0.x/1.1.0 are not API-compatible. Plus, in a minimally working installation of Haystack 1.0.1 and xapian-haystack MASTER, any complex query yields zero results due to errors in either django-haystack or xapian-haystack (I double-verified this), maybe because the unit-tests actually test very simple cases, and no edge-cases at all.

Tried Djapian. The source-code is riddled with spelling errors (mind you, in variable names, not comments), documentation is also riddled with ambiguities and outdated information that will never lead to a working installation. Not surprisingly, users rarely ask for features but how to get it working in the first place.

Next on the plate: exploring Solr (installing a Java environment plus Tomcat gives me headaches, the machine is RAM- and CPU-constrained), or Lucene (slightly less headaches, but still).

Before I proceed spending more time with a solution that might or might not work as advertised, I'd like to know: Did anyone ever get an actual, real-world search solution working in Django? I'm serious. I find it really frustrating reading about "large problems mostly solved", and then realizing that you will never get a working installation from the source-code because, actually, all bloggers dealing with those "mostly solved problems" never went past basic installation and copy-pasting the official tutorials.

So here are the requirements:

  • must be able to search for 10-100 terms in one query
  • must handle + (term must be present) and - (term must not be present), AND/OR
  • must handle arbitrary grouping (i.e. parentheses around AND/OR)
  • must allow for Django-ORM filtering before or after fulltext-search (i.e. pre-/post-processing of results with the full set of filters that Django knows about)
  • alternatively, there must be a facility to bulk-fetch the result set and transform it into a QuerySet
  • should be light on the machine, so preferably no humongous JVM and Java-based app-server installation

Is there anything out there that does this? I'm not interested in anecdotal evidence, or references to some blog posts that claim it should be working. I'd like to hear from someone who actually has a fully-functional setup working in the real world, under real conditions, with real queries.

EDIT:

Let me repeat again that I'm not so much interested in anecdotal evidence that someone, somewhere has a somewhat running installation working with unspecified properties. I already went there, I read all the blog posts, mailing lists, I contacted the authors, but when it came to actual implementation of real-world scenarios, nothing ever worked as advertised.

Also, and a user below brought that point up as well, considering the TCO of any project, I'm definitely not interested in hearing that someone, somewhere was able to pull it off once a vendor parachuted in an unknown number of specialists to monkey-patch the whole installation with specific domain-knowledge that's documented nowhere.

So, please, if you claim you have a working installation that actually satisfies minimum requirements for a full-fledged search (see requirements above), please provide the following so that we can all benefit from a search solution for Django that actually solves the problem:

  • exact Linux distribution, release version,
  • exact release version of Haystack (or equivalent) and release version of search backend,
  • exact release version of the search engine
  • publicly (!) available documentation how to set up all components exactly in the way that your installation was set up such that the minimal requirements above are met.

Thank you.


Source: (StackOverflow)

Cannot use a CONTAINS or FREETEXT predicate on table or indexed view because it is not full-text indexed

I am getting following error in my SQL server 2008 R2 database:

Cannot use a CONTAINS or FREETEXT predicate on table or indexed view 'tblArmy' because it is not full-text indexed.


Source: (StackOverflow)

SQL Server 2008 Full Text Search (FTS) versus Lucene.NET

I know there have been questions in the past about SQL 2005 versus Lucene.NET but since 2008 came out and they made a lot of changes to it and was wondering if anyone can give me pros/cons (or link to an article).


Source: (StackOverflow)

Search for "whole word match" in MySQL

I would like to write an SQL query that searches for a keyword in a text field, but only if it is a "whole word match" (e.g. when I search for "rid", it should not match "arid", but it should match "a rid".

I am using MySQL.

Fortunately, performance is not critical in this application, and the database size and string size are both comfortably small, but I would prefer to do it in the SQL than in the PHP driving it.


Source: (StackOverflow)