Select the search type
  • Site
  • Web
Search
You are here:  Support/Forums
Support

Bring2mind Forums

Search results
Last Post 12/12/2012 7:08 PM by david@designmind.com. 16 Replies.
Sort:
PrevPrev NextNext
You are not authorized to post a reply.
Author Messages
Darryl Jenkins
New Member
New Member
Posts:16


--
10/03/2012 6:03 PM
I'm having problems getting accurate PDF content search results using DMX v6.03 Lucene Search Provider. I have the latest IFilters installed and my site is running in Full trust.

Using LUKE to examine the index, I see some problems. For example, committe is indexed but committee is not. Pilot is indexed but not pilots. There are many cases where words ending in s are not indexed.

Using a stand alone third party DNN search engine (Search Boost) and examining its index shows committee (not committe) and both pilots and pilot and overall, returns more accurate results.

Is there anything I can do improve the accuracy of the results returned by DMX?
Peter Donker
Veteran Member
Veteran Member
Posts:4536


--
10/04/2012 3:01 PM
We use the default Lucene search engine without any modifications. It is version 2.9.2.2. I'm not aware of word ending issues like you mention. But the actual word storage etc is handled by Lucene, not by DMX. So that would have to show up in other applications using Lucene 2.9. What version of Lucene does Search Boost use?
Darryl Jenkins
New Member
New Member
Posts:16


--
10/04/2012 3:12 PM
Peter,

Thanks for the response. When I examined the DMX Index using Luke, it shows as Lucene version 3.1

The Search Boost version is 2.9. I also examined an old DMX index built using version 5.x and it appears correct (Never had a problem with the search until this new version). It appears that the search engine is dropping words ending in s (i.e. pilots) and es (committe, includ, provid).

The search functionality is extremely important to us so I appreciate your help.
Peter Donker
Veteran Member
Veteran Member
Posts:4536


--
10/04/2012 4:16 PM
Hi Darryl,

OK. That is somewhat confusing. Here Luke tells me the DMX index is 2.9. Which would make sense since that is the version DMX came with. Can you verify the version nr on the Bring2mind.Lucene.Net.dll?

Peter
Darryl Jenkins
New Member
New Member
Posts:16


--
10/04/2012 4:41 PM
Peter,

I updated to 6.04 and re-ran the index. Luke now shows version 2.9 (also shown in file property) but the index hasn't really changed (i.e., still dropping s, es, etc.)

Here's a view of the DMX index of top ranking terms
1400 contents the
1392 contents to
1390 contents in
1387 contents a
1383 contents and
1374 contents of
1358 contents for
1354 contents on
1330 contents is
1319 contents will
1309 contents pilot
1297 contents with
1290 contents that
1287 contents be
1260 contents at
1252 contents this
1247 contents by
1239 contents are
1238 contents delta
1234 contents as
1209 contents s
1198 contents or
1162 contents an
1160 contents from
1117 contents not
1115 contents mec
1114 contents have
1097 contents all
1074 contents if
1063 contents provid
1043 contents has
1034 contents committe
1029 contents it
1025 contents may
989 contents alpa
989 contents time
973 contents you
956 contents one
936 contents ani
936 contents follow
934 contents your
918 contents 1
881 contents can
877 contents two
875 contents other
875 contents includ
862 contents been
836 contents line
835 contents which
831 contents inform

Notice pilot (and not pilots), includ, provid, and committe.

Here's the print out from the Seach Boost index (also very similar to the DMX 5.x index)

1493 Content file
1493 Content 0
1279 Content delta
1271 Content s
1243 Content pilots
1162 Content pilot
1128 Content mec
1115 Content have
1110 Content alpa
1097 Content all
1057 Content 1
1047 Content has
1025 Content committee
1024 Content may
991 Content you
959 Content one
951 Content your
938 Content any
886 Content can
879 Content time
877 Content two
861 Content been
857 Content other
848 Content 2
834 Content which
800 Content following
798 Content new
795 Content more
793 Content also
792 Content provide
788 Content during
787 Content 3
779 Content available
771 Content under
757 Content first
746 Content who
746 Content information
745 Content only
741 Content after
741 Content 10
739 Content 11
732 Content line
728 Content 12
723 Content please
721 Content 5
718 Content through
716 Content we
716 Content number
714 Content than
699 Content 7

It includes both pilot and pilots as well as committee, etc.

Darryl
Darryl Jenkins
New Member
New Member
Posts:16


--
10/04/2012 8:48 PM
Peter,

While the index is dropping the s and es from words, I found that the Lucene.analysis.en.EnglishAnalyzer in Luke will return documents when I search for committees though the other analyzers will not. Don't know if this is helpful but I thought I would let you know.

Darryl
Peter Donker
Veteran Member
Veteran Member
Posts:4536


--
10/05/2012 4:54 PM
Hi Darryl,

OK, I think we're getting somewhere. So it's not the version of Lucene, but the algorithm used when it parses the text coming in. I know that I pass the text in there and have to tell it what language it should expect. This is done based on Threading.Thread.CurrentThread.CurrentCulture. I.e. the language that DNN is currently running under. It then uses the SnowballAnalyzer of Lucene to parse the text. I then stumbled on this:

http://stackoverflow.com/...analyzer-vs-snowball

This looks very much like what is happening. I'm not sure about the best way forward but for now it looks like it will need a review. If you have the partial source version you could potentially already tweak this yourself to your own liking. For the main release I wonder if this should be revised and how. I.e. should there be another mechanism when retrieving the search or is it really the stemmer that is too aggressive? Can you give me an example of a search that is going wrong as a result of this?

Peter
Darryl Jenkins
New Member
New Member
Posts:16


--
10/05/2012 5:14 PM
Peter,

Virtually any search for words ending in s or es are not returned at all. So a search for a proper name (Roberts) will return 0 results while a search for Robert will include all results of Robert and Roberts. Searching for "Committees" return nothing but a search for "Committe" return the results I expect.

This seems opposite of the snowball link above as it states in the thread

For example, Snowball will stem "organization" into "organ", so a search for "organization" will return results with "organ", without any scoring penalty.

It may be stemming committees into 'committe' but a search for 'committees' returns nothing.

Thanks again for your attention to this matter.

Darryl
Peter Donker
Veteran Member
Veteran Member
Posts:4536


--
10/05/2012 5:29 PM
From what I've read, the stemmer causes the original word to be lost and replaced with the "stem" of the word. So Roberts becomes Robert. What I don't get is how, when retrieving search results, Roberts could find Robert in the data (which is what the link appears to suggest). There is virtually no documentation on this, so it pretty tough to get to the bottom of this. But I'll give it a shot.
Note that I'm out of the office for a couple of weeks, though.

Peter
Darryl Jenkins
New Member
New Member
Posts:16


--
10/05/2012 5:38 PM
Peter,

Is the Analyzer part of Lucene or part of DMX? Would loading the DMX 5.x version of Lucene (2.0) get me the results I'm currently getting with 5.x. Anything to get me up and running in the short term while you look into a solution.

Thanks,
Darryl
Peter Donker
Veteran Member
Veteran Member
Posts:4536


--
10/05/2012 5:38 PM
NB. During the search retrieval there is no definition of a stemmer. I forgot to add that. So "Roberts" goes in "as is" without being reduced to "Robert".
Peter Donker
Veteran Member
Veteran Member
Posts:4536


--
10/05/2012 5:57 PM
Th analyzer is wrapped into Lucene. But the code that tells Lucene what analyzer to use is in DMX and part of the partial source package. Switching dlls won't work I'm afraid. You'll probably end up with a mess as the versions are all registered in the other dlls and .net will have a fit if you switch them.
Darryl Jenkins
New Member
New Member
Posts:16


--
10/09/2012 9:33 PM
Peter,

In the meantime I'm trying to go back to DMX 5x but after unistalling DMX 6x and installing 5x, I get following SQL error -

Failure
SQL Execution resulted in following Exceptions: System.Data.SqlClient.SqlException (0x80131904): There is already an object named 'PK_DMX_EntryPermissions' in the database. Could not create constraint. See previous errors. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning() at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlCommand.RunExecuteNonQueryTds(String methodName, Boolean async) at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe) at System.Data.SqlClient.SqlCommand.ExecuteNonQuery() at DotNetNuke.Data.SqlDataProvider.ExecuteADOScript(String SQL) at DotNetNuke.Data.SqlDataProvider.ExecuteScript(String Script, Boolean UseTransactions) ALTER TABLE dbo.[DMX_EntryPermissions] ADD CONSTRAINT [PK_DMX_EntryPermissions] PRIMARY KEY CLUSTERED ([EntryId], [PermissionId], [RoleId], [UserId])

Now I'm completely stuck as neither edition will install (or uninstall). I've manually removed all the DMX tables, stored procedures, views, and functions in the database as well as the DesktopModules/DMX folder but I continue to get this error.

Hope you can help.

Darryl
Darryl Jenkins
New Member
New Member
Posts:16


--
10/12/2012 2:56 PM
Peter,

I've got Version 5.3.9 loaded up on my site so I'll wait to upgrade after you've had a chance to look into the Lucene Search issue. Can you add an activation to my account so I can activate the 5x module.

Thanks for all of your help.

Darryl
Peter Donker
Veteran Member
Veteran Member
Posts:4536


--
10/24/2012 5:55 PM
Darryl,

No problem. Please contact me by email for that with the invoice code.\

Peter
david@designmind.com
New Member
New Member
Posts:12


--
12/11/2012 8:14 PM
We seem to be experiencing this problem as well. According to our DLLs we are using ver 6.0.3 (Bring2mind.Lucene.Net.dll is v2.9.2.2)

Is this resolved in the latest version?
david@designmind.com
New Member
New Member
Posts:12


--
12/12/2012 7:08 PM
Apologies. I missed the BUGNET box at the top of the thread.

The answer to my own question is:

DMX - 476
Lucene stemmer issues
Fixed In Version: 06.01.00
You are not authorized to post a reply.