EzDevInfo.com

lucene interview questions

Top lucene frequently asked interview questions

Indexing .PDF, .XLS, .DOC, .PPT using Lucene.NET

I've heard of Lucene.Net and I've heard of Apache Tika. The question is - how do I index these documents using C# vs Java? I think the issue is that there is no .Net equivalent of Tika which extracts relevant text from these document types.

UPDATE - Feb 05 2011

Based on given responses, it seems that the is not currently a native .Net equivalent of Tika. 2 interesting projects were mentioned that are each interesting in their own right:

  1. Xapian Project (http://xapian.org/) - An alternative to Lucene written in unmanaged code. The project claims to support "swig" which allows for C# bindings. Within the Xapian Project there is an out-of-the-box search engine called Omega. Omega uses a variety of open source components to extract text from various document types.
  2. IKVM.NET (http://www.ikvm.net/) - Allows Java to be run from .Net. An example of using IKVM to run Tika can be found here.

Given the above 2 projects, I see a couple of options. To extract the text, I could either a) use the same components that Omega is using or b) use IKVM to run Tika. To me, option b) seems cleaner as there are only 2 dependencies.

The interesting part is that now there are several search engines that could probably be used from .Net. There is Xapian, Lucene.Net or even Lucene (using IKVM).

UPDATE - Feb 07 2011

Another answer came in recommending that I check out ifilters. As it turns out, this is what MS uses for windows search so Office ifilters are readily available. Also, there are some PDF ifilters out there. The downside is that they are implemented in unmanaged code, so COM interop is necessary to use them. I found the below code snippit on a DotLucene.NET archive (no longer an active project):

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Text;

namespace IFilter
{
    [Flags]
    public enum IFILTER_INIT : uint
    {
        NONE = 0,
        CANON_PARAGRAPHS = 1,
        HARD_LINE_BREAKS = 2,
        CANON_HYPHENS = 4,
        CANON_SPACES = 8,
        APPLY_INDEX_ATTRIBUTES = 16,
        APPLY_CRAWL_ATTRIBUTES = 256,
        APPLY_OTHER_ATTRIBUTES = 32,
        INDEXING_ONLY = 64,
        SEARCH_LINKS = 128,
        FILTER_OWNED_VALUE_OK = 512
    }

    public enum CHUNK_BREAKTYPE
    {
        CHUNK_NO_BREAK = 0,
        CHUNK_EOW = 1,
        CHUNK_EOS = 2,
        CHUNK_EOP = 3,
        CHUNK_EOC = 4
    }

    [Flags]
    public enum CHUNKSTATE
    {
        CHUNK_TEXT = 0x1,
        CHUNK_VALUE = 0x2,
        CHUNK_FILTER_OWNED_VALUE = 0x4
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct PROPSPEC
    {
        public uint ulKind;
        public uint propid;
        public IntPtr lpwstr;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct FULLPROPSPEC
    {
        public Guid guidPropSet;
        public PROPSPEC psProperty;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct STAT_CHUNK
    {
        public uint idChunk;
        [MarshalAs(UnmanagedType.U4)] public CHUNK_BREAKTYPE breakType;
        [MarshalAs(UnmanagedType.U4)] public CHUNKSTATE flags;
        public uint locale;
        [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;
        public uint idChunkSource;
        public uint cwcStartSource;
        public uint cwcLenSource;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct FILTERREGION
    {
        public uint idChunk;
        public uint cwcStart;
        public uint cwcExtent;
    }

    [ComImport]
    [Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]
    [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
    public interface IFilter
    {
        [PreserveSig]
        int Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, uint cAttributes, [MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes, ref uint pdwFlags);

        [PreserveSig]
        int GetChunk(out STAT_CHUNK pStat);

        [PreserveSig]
        int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);

        void GetValue(ref UIntPtr ppPropValue);
        void BindRegion([MarshalAs(UnmanagedType.Struct)] FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);
    }

    [ComImport]
    [Guid("f07f3920-7b8c-11cf-9be8-00aa004b9986")]
    public class CFilter
    {
    }

    public class IFilterConstants
    {
        public const uint PID_STG_DIRECTORY = 0x00000002;
        public const uint PID_STG_CLASSID = 0x00000003;
        public const uint PID_STG_STORAGETYPE = 0x00000004;
        public const uint PID_STG_VOLUME_ID = 0x00000005;
        public const uint PID_STG_PARENT_WORKID = 0x00000006;
        public const uint PID_STG_SECONDARYSTORE = 0x00000007;
        public const uint PID_STG_FILEINDEX = 0x00000008;
        public const uint PID_STG_LASTCHANGEUSN = 0x00000009;
        public const uint PID_STG_NAME = 0x0000000a;
        public const uint PID_STG_PATH = 0x0000000b;
        public const uint PID_STG_SIZE = 0x0000000c;
        public const uint PID_STG_ATTRIBUTES = 0x0000000d;
        public const uint PID_STG_WRITETIME = 0x0000000e;
        public const uint PID_STG_CREATETIME = 0x0000000f;
        public const uint PID_STG_ACCESSTIME = 0x00000010;
        public const uint PID_STG_CHANGETIME = 0x00000011;
        public const uint PID_STG_CONTENTS = 0x00000013;
        public const uint PID_STG_SHORTNAME = 0x00000014;
        public const int FILTER_E_END_OF_CHUNKS = (unchecked((int) 0x80041700));
        public const int FILTER_E_NO_MORE_TEXT = (unchecked((int) 0x80041701));
        public const int FILTER_E_NO_MORE_VALUES = (unchecked((int) 0x80041702));
        public const int FILTER_E_NO_TEXT = (unchecked((int) 0x80041705));
        public const int FILTER_E_NO_VALUES = (unchecked((int) 0x80041706));
        public const int FILTER_S_LAST_TEXT = (unchecked((int) 0x00041709));
    }

    /// 
    /// IFilter return codes
    /// 
    public enum IFilterReturnCodes : uint
    {
        /// 
        /// Success
        /// 
        S_OK = 0,
        /// 
        /// The function was denied access to the filter file. 
        /// 
        E_ACCESSDENIED = 0x80070005,
        /// 
        /// The function encountered an invalid handle, probably due to a low-memory situation. 
        /// 
        E_HANDLE = 0x80070006,
        /// 
        /// The function received an invalid parameter.
        /// 
        E_INVALIDARG = 0x80070057,
        /// 
        /// Out of memory
        /// 
        E_OUTOFMEMORY = 0x8007000E,
        /// 
        /// Not implemented
        /// 
        E_NOTIMPL = 0x80004001,
        /// 
        /// Unknown error
        /// 
        E_FAIL = 0x80000008,
        /// 
        /// File not filtered due to password protection
        /// 
        FILTER_E_PASSWORD = 0x8004170B,
        /// 
        /// The document format is not recognised by the filter
        /// 
        FILTER_E_UNKNOWNFORMAT = 0x8004170C,
        /// 
        /// No text in current chunk
        /// 
        FILTER_E_NO_TEXT = 0x80041705,
        /// 
        /// No more chunks of text available in object
        /// 
        FILTER_E_END_OF_CHUNKS = 0x80041700,
        /// 
        /// No more text available in chunk
        /// 
        FILTER_E_NO_MORE_TEXT = 0x80041701,
        /// 
        /// No more property values available in chunk
        /// 
        FILTER_E_NO_MORE_VALUES = 0x80041702,
        /// 
        /// Unable to access object
        /// 
        FILTER_E_ACCESS = 0x80041703,
        /// 
        /// Moniker doesn't cover entire region
        /// 
        FILTER_W_MONIKER_CLIPPED = 0x00041704,
        /// 
        /// Unable to bind IFilter for embedded object
        /// 
        FILTER_E_EMBEDDING_UNAVAILABLE = 0x80041707,
        /// 
        /// Unable to bind IFilter for linked object
        /// 
        FILTER_E_LINK_UNAVAILABLE = 0x80041708,
        /// 
        /// This is the last text in the current chunk
        /// 
        FILTER_S_LAST_TEXT = 0x00041709,
        /// 
        /// This is the last value in the current chunk
        /// 
        FILTER_S_LAST_VALUES = 0x0004170A
    }

    /// 
    /// Convenience class which provides static methods to extract text from files using installed IFilters
    /// 
    public class DefaultParser
    {
        public DefaultParser()
        {
        }

        [DllImport("query.dll", CharSet = CharSet.Unicode)]
        private extern static int LoadIFilter(string pwcsPath, [MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter, ref IFilter ppIUnk);

        private static IFilter loadIFilter(string filename)
        {
            object outer = null;
            IFilter filter = null;

            // Try to load the corresponding IFilter
            int resultLoad = LoadIFilter(filename,  outer, ref filter);
            if (resultLoad != (int) IFilterReturnCodes.S_OK)
            {
                return null;
            }
            return filter;
        }

        public static bool IsParseable(string filename)
        {
            return loadIFilter(filename) != null;
        }

        public static string Extract(string path)
        {
            StringBuilder sb = new StringBuilder();
            IFilter filter = null;

            try
            {
                filter = loadIFilter(path);

                if (filter == null)
                    return String.Empty;

                uint i = 0;
                STAT_CHUNK ps = new STAT_CHUNK();

                IFILTER_INIT iflags =
                    IFILTER_INIT.CANON_HYPHENS |
                    IFILTER_INIT.CANON_PARAGRAPHS |
                    IFILTER_INIT.CANON_SPACES |
                    IFILTER_INIT.APPLY_CRAWL_ATTRIBUTES |
                    IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
                    IFILTER_INIT.APPLY_OTHER_ATTRIBUTES |
                    IFILTER_INIT.HARD_LINE_BREAKS |
                    IFILTER_INIT.SEARCH_LINKS |
                    IFILTER_INIT.FILTER_OWNED_VALUE_OK;

                if (filter.Init(iflags, 0, null, ref i) != (int) IFilterReturnCodes.S_OK)
                    throw new Exception("Problem initializing an IFilter for:\n" + path + " \n\n");

                while (filter.GetChunk(out ps) == (int) (IFilterReturnCodes.S_OK))
                {
                    if (ps.flags == CHUNKSTATE.CHUNK_TEXT)
                    {
                        IFilterReturnCodes scode = 0;
                        while (scode == IFilterReturnCodes.S_OK || scode == IFilterReturnCodes.FILTER_S_LAST_TEXT)
                        {
                            uint pcwcBuffer = 65536;
                            System.Text.StringBuilder sbBuffer = new System.Text.StringBuilder((int)pcwcBuffer);

                            scode = (IFilterReturnCodes) filter.GetText(ref pcwcBuffer, sbBuffer);

                            if (pcwcBuffer > 0 && sbBuffer.Length > 0)
                            {
                                if (sbBuffer.Length < pcwcBuffer) // Should never happen, but it happens !
                                    pcwcBuffer = (uint)sbBuffer.Length;

                                sb.Append(sbBuffer.ToString(0, (int) pcwcBuffer));
                                sb.Append(" "); // "\r\n"
                            }

                        }
                    }

                }
            }
            finally
            {
                if (filter != null) {
                    Marshal.ReleaseComObject (filter);
                    System.GC.Collect();
                    System.GC.WaitForPendingFinalizers();
                }
            }

            return sb.ToString();
        }
    }
}

At the moment, this seems like the best way to extract text from documents using the .NET platform on a Windows server. Thanks everybody for your help.

UPDATE - Mar 08 2011

While I still think that ifilters are a good way to go, I think if you are looking to index documents using Lucene from .NET, a very good alternative would be to use Solr. When I first started researching this topic, I had never heard of Solr. So, for those of you who have not either, Solr is a stand-alone search service, written in Java on top of Lucene. The idea is that you can fire up Solr on a firewalled machine, and communicate with it via HTTP from your .NET application. Solr is truly written like a service and can do everything Lucene can do, (including using Tika extract text from .PDF, .XLS, .DOC, .PPT, etc), and then some. Solr seems to have a very active community as well, which is one thing I am not to sure of with regards to Lucene.NET.


Source: (StackOverflow)

How does Lucene work

I would like to find out how lucene search works so fast. I can't find any useful docs on the web. If you have anything (short of lucene source code) to read, let me know.

A text search query using mysql5 text search with index takes about 18 minutes in my case. A lucene search for the same query takes less than a second.


Source: (StackOverflow)

Advertisements

How to evaluate hosted full text search solutions?

What are the options when it comes to SaaS/hosted full text search? How should I evaluate the different options available?

I'm looking for something that uses Lucene, solr, or sphinx on the backend, and provides a REST API for submitting documents to index, and running searches.

I could build my own EC2 AMI, but I'd have to configure EBS and other stuff, monitor it, etc.


Source: (StackOverflow)

How to incorporate multiple fields in QueryParser?

Dim qp1 As New QueryParser("filename", New StandardAnalyzer())
Dim qp2 As New QueryParser("filetext", New StandardAnalyzer())
.
.

Instead of creating 2 separate QueryParser objects and using them to obtain 2 Hits objects, is it possible perform a search on both fields using a single QueryParser object, so that I have only on Hits object which gives me overall score of each Document.


Source: (StackOverflow)

Solr vs. ElasticSearch

What are the core architectural differences between these technologies?

Also, what use cases are generally more appropriate for each?


Source: (StackOverflow)

What is best and most active open source .Net search technology?

I'm trying to decide on an open source search/indexing technology for a .Net project. It seems like the standard out there for Java projects is Lucene, but as far as .Net is concerned, the Lucene.Net project seems to be pretty inactive. Is this still the best option out there? Or are there other viable alternatives?


Source: (StackOverflow)

"Did you mean?" feature in Lucene.net

Can someone please let me know how do I implement "Did you mean" feature in Lucene.net?

Thanks!


Source: (StackOverflow)

How to query SOLR for empty fields?

I have a large solr index, and I have noticed some fields are not updated correctly (the index is dynamic).

This has resulted in some fields having an empty "id" field.

I have tried these queries, but they didn't work:

 id:''
 id:NULL
 id:null
 id:""
 id:
 id:['' TO *]

Is there a way to query empty fields?

Thanks


Source: (StackOverflow)

ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage? [closed]

I'm currently looking at other search methods rather than having a huge SQL query. I saw elasticsearch recently and played with whoosh (a Python implementation of a search engine).

Can you give reasons for your choice(s)?


Source: (StackOverflow)

Comparison of Lucene Analyzers

Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.


Source: (StackOverflow)

Difference between solr and lucene

I know that Lucene and Solr are 2 differents Apache projects that are made to work together, but I don't understand what is the aim of each project.

For what I understood untill now is that Lucene is used to create a search index and Solr use this index to perform searches. Am I right or is this a totally different approach?


Source: (StackOverflow)

How would one use Lucene.NET to help implement search on a site like Stack Overflow?

I've asked a simlar question on Meta Stack Overflow, but that deals specifically with whether or not Lucene.NET is used on Stack Overflow.

The purpose of the question here is more of a hypotetical, as to what approaches one would make if they were to use Lucene.NET as a basis for in-site search and other factors in a site like Stack Overflow [SO].

As per the entry on the Stack Overflow blog titled "SQL 2008 Full-Text Search Problems" there was a strong indication that Lucene.NET was being considered at some point, but it appears that is definitely not the case, as per the comment by Geoff Dalgas on February 19th 2010:

Lucene.NET is not being used for Stack Overflow - we are using SQL Server Full Text indexing. Search is an area where we continue to make minor tweaks.

So my question is, how would one utilize Lucene.NET into a site which has the same semantics of Stack Overflow?

Here is some background and what I've done/thought about so far (yes, I've been implementing most of this and search is the last aspect I have to complete):

Technologies:

And of course, the star of the show, Lucene.NET.

The intention is also to move to .NET/C# 4.0 ASAP. While I don't think it's a game-changer, it should be noted.

Before getting into aspects of Lucene.NET, it's important to point out the SQL Server 2008 aspects of it, as well as the models involved.

Models

This system has more than one primary model type in comparison to Stack Overflow. Some examples of these models are:

  • Questions: These are questions that people can ask. People can reply to questions, just like on Stack Overflow.
  • Notes: These are one-way projections, so as opposed to a question, you are making a statement about content. People can't post replies to this.
  • Events: This is data about a real-time event. It has location information, date/time information.

The important thing to note about these models:

  • They all have a Name/Title (text) property and a Body (HTML) property (the formats are irrelevant, as the content will be parsed appropriately for analysis).
  • Every instance of a model has a unique URL on the site

Then there are the things that Stack Overflow provides which IMO, are decorators to the models. These decorators can have different cardinalities, either being one-to-one or one-to-many:

  • Votes: Keyed on the user
  • Replies: Optional, as an example, see the Notes case above
  • Favorited: Is the model listed as a favorite of a user?
  • Comments: (optional)
  • Tag Associations: Tags are in a separate table, so as not to replicate the tag for each model. There is a link between the model and the tag associations table, and then from the tag associations table to the tags table.

And there are supporting tallies which in themselves are one-to-one decorators to the models that are keyed to them in the same way (usually by a model id type and the model id):

  • Vote tallies: Total postive, negative votes, Wilson Score interval (this is important, it's going to determine the confidence level based on votes for an entry, for the most part, assume the lower bound of the Wilson interval).

Replies (answers) are models that have most of the decorators that most models have, they just don't have a title or url, and whether or not a model has a reply is optional. If replies are allowed, it is of course a one-to-many relationship.

SQL Server 2008

The tables pretty much follow the layout of the models above, with separate tables for the decorators, as well as some supporting tables and views, stored procedures, etc.

It should be noted that the decision to not use full-text search is based primarily on the fact that it doesn't normalize scores like Lucene.NET. I'm open to suggestions on how to utilize text-based search, but I will have to perform searches across multiple model types, so keep in mind I'm going to need to normalize the score somehow.

Lucene.NET

This is where the big question mark is. Here are my thoughts so far on Stack Overflow functionality as well as how and what I've already done.

Indexing

Questions/Models

I believe each model should have an index of its own containing a unique id to quickly look it up based on a Term instance of that id (indexed, not analyzed).

In this area, I've considered having Lucene.NET analyze each question/model and each reply individually. So if there was one question and five answers, the question and each of the answers would be indexed as one unit separately.

The idea here is that the relevance score that Lucene.NET returns would be easier to compare between models that project in different ways (say, something without replies).

As an example, a question sets the subject, and then the answer elaborates on the subject.

For a note, which doesn't have replies, it handles the matter of presenting the subject and then elaborating on it.

I believe that this will help with making the relevance scores more relevant to each other.

Tags

Initially, I thought that these should be kept in a separate index with multiple fields which have the ids to the documents in the appropriate model index. Or, if that's too large, there is an index with just the tags and another index which maintains the relationship between the tags index and the questions they are applied to. This way, when you click on a tag (or use the URL structure), it's easy to see in a progressive manner that you only have to "buy into" if you succeed:

  • If the tag exists
  • Which questions the tags are associated with
  • The questions themselves

However, in practice, doing a query of all items based on tags (like clicking on a tag in Stack Overflow) is extremely easy with SQL Server 2008. Based on the model above, it simply requires a query such as:

select
     m.Name, m.Body
from
    Models as m
        left outer join TagAssociations as ta on
            ta.ModelTypeId = <fixed model type id> and
            ta.ModelId = m.Id
        left outer join Tags as t on t.Id = ta.TagId
where
    t.Name = <tag>

And since certain properties are shared across all models, it's easy enough to do a UNION between different model types/tables and produce a consistent set of results.

This would be analagous to a TermQuery in Lucene.NET (I'm referencing the Java documentation since it's comprehensive, and Lucene.NET is meant to be a line-by-line translation of Lucene, so all the documentation is the same).

The issue that comes up with using Lucene.NET here is that of sort order. The relevance score for a TermQuery when it comes to tags is irrelevant. It's either 1 or 0 (it either has it or it doesn't).

At this point, the confidence score (Wilson score interval) comes into play for ordering the results.

This score could be stored in Lucene.NET, but in order to sort the results on this field, it would rely on the values being stored in the field cache, which is something I really, really want to avoid. For a large number of documents, the field cache can grow very large (the Wilson score is a double, and you would need one double for every document, that can be one large array).

Given that I can change the SQL statement to order based on the Wilson score interval like this:

select
     m.Name, m.Body
from
    Models as m
        left outer join TagAssociations as ta on
            ta.ModelTypeId = <fixed model type id> and
            ta.ModelId = m.Id
        left outer join Tags as t on t.Id = ta.TagId
        left outer join VoteTallyStatistics as s on
            s.ModelTypeId = ta.ModelTypeId and
            s.ModelId = ta.ModelId
where
    t.Name = <tag>
order by
    --- Use Id to break ties.
    s.WilsonIntervalLowerBound desc, m.Id

It seems like an easy choice to use this to handle the piece of Stack Overflow functionality "get all items tagged with <tag>".

Replies

Originally, I thought this is in a separate index of its own, with a key back into the Questions index.

I think that there should be a combination of each model and each reply (if there is one) so that relevance scores across different models are more "equal" when compared to each other.

This would of course bloat the index. I'm somewhat comfortable with that right now.

Or, is there a way to store say, the models and replies as individual documents in Lucene.NET and then take both and be able to get the relevance score for a query treating both documents as one? If so, then this would be ideal.

There is of course the question of what fields would be stored, indexed, analyzed (all operations can be separate operations, or mix-and-matched)? Just how much would one index?

What about using special stemmers/porters for spelling mistakes (using Metaphone) as well as synonyms (there is terminology in the community I will service which has it's own slang/terminology for certain things which has multiple representations)?

Boost

This is related to indexing of course, but I think it merits it's own section.

Are you boosting fields and/or documents? If so, how do you boost them? Is the boost constant for certain fields? Or is it recalculated for fields where vote/view/favorite/external data is applicable.

For example, in the document, does the title get a boost over the body? If so, what boost factors do you think work well? What about tags?

The thinking here is the same as it is along the lines of Stack Overflow. Terms in the document have relevance, but if a document is tagged with the term, or it is in the title, then it should be boosted.

Shashikant Kore suggests a document structure like this:

  • Title
  • Question
  • Accepted Answer (Or highly voted answer if there is no accepted answer)
  • All answers combined

And then using boost but not based on the raw vote value. I believe I have that covered with the Wilson Score interval.

The question is, should the boost be applied to the entire document? I'm leaning towards no on this one, because it would mean I'd have to reindex the document each time a user voted on the model.

Search for Items Tagged

I originally thought that when querying for a tag (by specifically clicking on one or using the URL structure for looking up tagged content), that's a simple TermQuery against the tag index for the tag, then in the associations index (if necessary) then back to questions, Lucene.NET handles this really quickly.

However, given the notes above regarding how easy it is to do this in SQL Server, I've opted for that route when it comes to searching tagged items.

General Search

So now, the most outstanding question is when doing a general phrase or term search against content, what and how do you integrate other information (such as votes) in order to determine the results in the proper order? For example, when performing this search on ASP.NET MVC on Stack Overflow, these are the tallies for the top five results (when using the relevance tab):

    q votes answers accepted answer votes asp.net highlights mvc highlights
    ------- ------- --------------------- ------------------ --------------
         21      26                    51                  2              2
         58      23                    70                  2              5
         29      24                    40                  3              4
         37      15                    25                  1              2
         59      23                    47                  2              2

Note that the highlights are only in the title and abstract on the results page and are only minor indicators as to what the true term frequency is in the document, title, tag, reply (however they are applied, which is another good question).

How is all of this brought together?

At this point, I know that Lucene.NET will return a normalized relevance score, and the vote data will give me a Wilson score interval which I can use to determine the confidence score.

How should I look at combining tese two scores to indicate the sort order of the result set based on relevance and confidence?

It is obvious to me that there should be some relationship between the two, but what that relationship should be evades me at this point. I know I have to refine it as time goes on, but I'm really lost on this part.

My initial thoughts are if the relevance score is beween 0 and 1 and the confidence score is between 0 and 1, then I could do something like this:

1 / ((e ^ cs) * (e ^ rs))

This way, one gets a normalized value that approaches 0 the more relevant and confident the result is, and it can be sorted on that.

The main issue with that is that if boosting is performed on the tag and or title field, then the relevance score is outside the bounds of 0 to 1 (the upper end becomes unbounded then, and I don't know how to deal with that).

Also, I believe I will have to adjust the confidence score to account for vote tallies that are completely negative. Since vote tallies that are completely negative result in a Wilson score interval with a lower bound of 0, something with -500 votes has the same confidence score as something with -1 vote, or 0 votes.

Fortunately, the upper bound decreases from 1 to 0 as negative vote tallies go up. I could change the confidence score to be a range from -1 to 1, like so:

confidence score = votetally < 0 ? 
    -(1 - wilson score interval upper bound) :
    wilson score interval lower bound

The problem with this is that plugging in 0 into the equation will rank all of the items with zero votes below those with negative vote tallies.

To that end, I'm thinking if the confidence score is going to be used in a reciprocal equation like above (I'm concerned about overflow obviously), then it needs to be reworked to always be positive. One way of achieving this is:

confidence score = 0.5 + 
    (votetally < 0 ? 
        -(1 - wilson score interval upper bound) :
        wilson score interval lower bound) / 2

My other concerns are how to actually perform the calculation given Lucene.NET and SQL Server. I'm hesitant to put the confidence score in the Lucene index because it requires use of the field cache, which can have a huge impact on memory consumption (as mentioned before).

An idea I had was to get the relevance score from Lucene.NET and then using a table-valued parameter to stream the score to SQL Server (along with the ids of the items to select), at which point I'd perform the calculation with the confidence score and then return the data properly ordred.

As stated before, there are a lot of other questions I have about this, and the answers have started to frame things, and will continue to expand upon things as the question and answers evovled.


Source: (StackOverflow)

Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

I'm building a Django site and I am looking for a search engine.

A few candidates:

  • Lucene/Lucene with Compass/Solr

  • Sphinx

  • Postgresql built-in full text search

  • MySQl built-in full text search

Selection criteria:

  • result relevance and ranking
  • searching and indexing speed
  • ease of use and ease of integration with Django
  • resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU
  • scalability
  • extra features such as "did you mean?", related searches, etc

Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions.

EDIT: As for indexing needs, as users keep entering data into the site, those data would need to be indexed continuously. It doesn't have to be real time, but ideally new data would show up in index with no more than 15 - 30 minutes delay


Source: (StackOverflow)

Is Solr available for .Net?

I want to learn Solr. May I know some good tutorial/links for it?

Also, is Solr available for .NET?


Source: (StackOverflow)

NoSQL (MongoDB) vs Lucene (or Solr) as your database

With the NoSQL movement growing based on document-based databases, I've looked at MongoDB lately. I have noticed a striking similarity with how to treat items as "Documents", just like Lucene does (and users of Solr).

So, the question: Why would you want to use NoSQL (MongoDB, Cassandra, CouchDB, etc) over Lucene (or Solr) as your "database"?

What I am (and I am sure others are) looking for in an answer is some deep-dive comparisons of them. Let's skip over relational database discussions all together, as they serve a different purpose.

Lucene gives some serious advantages, such as powerful searching and weight systems. Not to mention facets in Solr (which Solr is being integrated into Lucene soon, yay!). You can use Lucene documents to store IDs, and access the documents as such just like MongoDB. Mix it with Solr, and you now get a WebService-based, load balanced solution.

You can even throw in a comparison of out-of-proc cache providers such as Velocity or MemCached when talking about similar data storing and scalability of MongoDB.

The restrictions around MongoDB reminds me of using MemCached, but I can use Microsoft's Velocity and have more grouping and list collection power over MongoDB (I think). Can't get any faster or scalable than caching data in memory. Even Lucene has a memory provider.

MongoDB (and others) do have some advantages, such as the ease of use of their API. New up a document, create an id, and store it. Done. Nice and easy.


Source: (StackOverflow)