Binary file search index creation debugging (3.7.4)

Author Message

kracker (the)

Monday 23 October 2006 10:41:12 am

I have seen an error attempting to follow the article:
<i>http://ez.no/layout/set/printarticle/community/articles/indexing_multiple_binary_file_types</i>

<b>Settings: Binary File : binaryfile.ini.append.php</b>

# cat settings/override/binaryfile.ini.append.php

# Here you can add handlers for new datatypes.

[HandlerSettings]

MetaDataExtractor[text/plain]=plaintext

MetaDataExtractor[application/pdf]=binaryfile

MetaDataExtractor[application/msword]=binaryfile

MetaDataExtractor[application/vnd.ms-excel]=binaryfile

MetaDataExtractor[application/vnd.ms-powerpoint]=binaryfile

# The full path to your log file (used for debugging/testing)

[BinaryFileHandlerSettings]

LogFile=var/log/index.log

Using eZ publish 3.7.4

Any suggestion or insight into following search index update <i>fatal</i> errors?

//kracker
<i>tv : theme : scrubs</i>

<b>Attempt #1</b>

# /usr/local/php4.4.4/bin/php -d memory_limit=456M -C update/common/scripts/updatesearchindex.php -s example_com-dev --logfiles --verbose --sql --db-user=db --db-password=db --db-database=example_com_dev_376

Using siteaccess example_com-dev for search index update
Starting object re-indexing
Number of objects to index: 5467
...................................................................... 1.280409731114%
...................................................................... 2.5608194622279%
...................................................................... 3.8412291933419%
...................................................................... 5.1216389244558%
...................................................................... 6.4020486555698%
...................................................................... 7.6824583866837%
...................................................................... 8.9628681177977%
...................................................................... 10.243277848912%
...................................................................... 11.523687580026%
...................................................................... 12.80409731114%
...................................................................... 14.084507042254%
...................................................................... 15.364916773367%
...................................................................... 16.645326504481%
...................................................................... 17.925736235595%
...................................................................... 19.206145966709%
...................................................................... 20.486555697823%
...................................................................... 21.766965428937%
...................................................................... 23.047375160051%
...................................................................... 24.327784891165%
...................................................................... 25.608194622279%
...................................................................... 26.888604353393%
...................................................................... 28.169014084507%
...................................................................... 29.449423815621%
...................................................................... 30.729833546735%
...................................................................... 32.010243277849%
...................................................................... 33.290653008963%
...................................................................... 34.571062740077%
...................................................................... 35.851472471191%
...................................................................... 37.131882202305%
...................................................................... 38.412291933419%
...................................................................... 39.692701664533%
...................................................................... 40.973111395647%
...................................................................... 42.253521126761%
...................................................................... 43.533930857875%
...................................................................... 44.814340588988%
...................................................................... 46.094750320102%
...................................................................... 47.375160051216%
...................................................................... 48.65556978233%
...................................................................... 49.935979513444%
...................................................................... 51.216389244558%
...................................................................... 52.496798975672%
...................................................................... 53.777208706786%
...................................................................... 55.0576184379%
...................................................................... 56.338028169014%
...................................................................... 57.618437900128%
...................................................................... 58.898847631242%
...................................................................... 60.179257362356%
...................................................................... 61.45966709347%
...................................................................... 62.740076824584%
...................................................................... 64.020486555698%
...................................................................... 65.300896286812%
...................................................................... 66.581306017926%
...................................................................... 67.86171574904%
...................................................................... 69.142125480154%
...................................................................... 70.422535211268%
...................................................................... 71.702944942382%
...................................................................... 72.983354673496%
...................................................................... 74.263764404609%
...................................................................... 75.544174135723%
...................................................................... 76.824583866837%
...................................................................... 78.104993597951%
...................................................................... 79.385403329065%
...................................................................... 80.665813060179%
...................................................................... 81.946222791293%
...................................................................... 83.226632522407%
...................................................................... 84.507042253521%
...................................................................... 85.787451984635%
...................................................................... 87.067861715749%
...............PHP Fatal error:  Call to a member function on a non-object in /web/dev/example.com/ezpublish-3.7.6/kernel/search/plugins/ezsearchengine/ezsearchengine.php on line 73

Fatal error: Call to a member function on a non-object in /web/dev/example.com/ezpublish-3.7.6/kernel/search/plugins/ezsearchengine/ezsearchengine.php on line 73

Fatal error: eZ publish did not finish its request
The execution of eZ publish was abruptly ended, the debug output is present below.

# date
Mon Oct 23 10:05:02 CDT 2006


<b>Attempt #2</b>

# /usr/local/php4.4.4/bin/php -d memory_limit=456M -C update/common/scripts/updatesearchindex.php -s example_com-dev --logfiles --verbose --sql --db-user=db --db-password=db --db-database=example_com_dev_376
Using siteaccess example_com-dev for search index update
Starting object re-indexing
Number of objects to index: 5467
...................................................................... 1.280409731114%
...................................................................... 2.5608194622279%
...................................................................... 3.8412291933419%
...................................................................... 5.1216389244558%
...................................................................... 6.4020486555698%
...................................................................... 7.6824583866837%
...................................................................... 8.9628681177977%
...................................................................... 10.243277848912%
...................................................................... 11.523687580026%
...................................................................... 12.80409731114%
...................................................................... 14.084507042254%
...................................................................... 15.364916773367%
...................................................................... 16.645326504481%
...................................................................... 17.925736235595%
...................................................................... 19.206145966709%
...................................................................... 20.486555697823%
...................................................................... 21.766965428937%
...................................................................... 23.047375160051%
...................................................................... 24.327784891165%
...................................................................... 25.608194622279%
...................................................................... 26.888604353393%
...................................................................... 28.169014084507%
...................................................................... 29.449423815621%
...................................................................... 30.729833546735%
...................................................................... 32.010243277849%
...................................................................... 33.290653008963%
...................................................................... 34.571062740077%
...................................................................... 35.851472471191%
...................................................................... 37.131882202305%
...................................................................... 38.412291933419%
...................................................................... 39.692701664533%
...................................................................... 40.973111395647%
...................................................................... 42.253521126761%
...................................................................... 43.533930857875%
...................................................................... 44.814340588988%
...................................................................... 46.094750320102%
...................................................................... 47.375160051216%
...................................................................... 48.65556978233%
...................................................................... 49.935979513444%
.....................................................................Segmentation fault

# date
Mon Oct 23 09:05:02 CDT 2006

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

kracker (the)

Monday 23 October 2006 10:41:51 pm

Might anyone have a suggestion on where I might need to change setting to enable further debug information to deduce the cause of the error?

The end result .. crash (of unclear cause) seems very similar to the one described in the article.

<i>"In addition, we found that both pstotext and xpdf caused the same problem with very large PDF files - the SQL INSERT statement got too large (this is related to the eZ publish indexer, not the parsing tools) and would crash the indexer, resulting in no further content being indexed. Addressing that, in addition to handling multiple binary types in one file, led us to write our own custom parsing plugin."</i>

//kracker
<i>tv : robot.chicken</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

kracker (the)

Tuesday 24 October 2006 10:06:42 am

<b>Questions</b>

1. <i>The eZ publish 'var/' size question</i> - Does the process segfault because because of the 3500 MB of PDF, XLS, PPT, DOC files in this installation's 'var/' directory?

2. <i>The eZ publish 'version' question</i> - Does the process segfault because I'm trying to use 3.7.4 instead the very latest 3.8.x ?

3. <i>The eZ publish 'var/' file character contents question</i> - Does the process segfault because of the character content?

//kracker
<i>KMK - Koast II Koast - I've Had It</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

Paul Borgermans

Tuesday 24 October 2006 2:19:51 pm

Hi Cracker

Now you have a a nut to crack ;-)

(referring to your last post)
1: no
2: no
3: maybe, but unlikely

Try to adapt the updatesearch index script as I've done for the lucene plugin: this will give you a clue where the crash happened (in terms of object to index):

Dowload the file

http://pubsvn.ez.no/community/trunk/extension/lucene/trunk/bin/php/updatesearchindexlucene.php

Look for the line

print ("Indexing " . $innerNode->attribute( 'name' ) . " with node id " . $innerNode->attribute( 'node_id' ) . " and url_alias " . $innerNode->attribute( 'url_alias' ) . $endl );

And add this before the eZSearch::addObject( $object ); line of your update script or a copy of it).

I actually identified even corrupt objects/nodes in particular sites with this update script when using the lucene plugin (which calls every node placement and attribute of objects: it is also a unique debugging tool for corrupted databases indeed :-))

hth

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

kracker (the)

Wednesday 25 October 2006 8:45:48 pm

<i>@Paul</i>

Heh (nice quote), as always I thank you for the very helpful direction.

<i>@Everyone</i>

After taking a closer look into the eZ publish error log, <i>var/log/error.log</i> I found the following error which gave a description and the final query.

<i>"Query error: MySQL server has gone away. Query: INSERT INTO ......."</i>

<i>http://www.phpmyvisites.us/faq/message-error-mysql-should-change-mysql-configuration-51.html</i>
<i>http://www.oooforum.org/forum/viewtopic.phtml?p=180203</i>
<i>http://dev.mysql.com/doc/refman/4.1/en/gone-away.html</i>
<i>http://dev.mysql.com/doc/refman/4.1/en/server-system-variables.html</i>

Yet seemingly even after changing a popular mysql configuration setting the problem persists. I added the setting, max_allowed_packet = 750M.

Cheers,
//kracker

<i>http://www.imdb.com/title/tt0105977/</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

Felipe Jaramillo

Wednesday 01 November 2006 7:13:10 pm

Hi kracker,

What a coincidence, just today Ive been doing some tests and it seems i'm having the same problem.

Im trying to upload a trimmed text version of the Php manual which is ~750K to a test installation of eZ 3.7.4.

1. If i try to upload without DelayedIndexing, i get a blank page and nothing gets indexed.
2. After enabling delayed indexing, the file uploads correctly.
3. If i run 'php runcronjobs.php' it goes to the new object, but stays there forever.

Running cronjobs/unpublish.php

Running cronjobs/rssimport.php

Running cronjobs/indexcontent.php
Starting processing pending search engine modifications
        Indexing object ID #94

I tried using the longer 'update/common/scripts/updatesearchindex.php' and i get:

Using siteaccess shop_admin for search index update
Starting object re-indexing
Number of objects to index: 34
..................Segmentation fault

Yet, when I check, it DOES manage to index at least some part of the file (i can search succsessfully for words and phrases that appear even at the end of the document)

I'm sure we'll find the solution somehow. By the way, the article on indexing_multiple_binary_file_types is very good.

Regards,

Felipe

I tried changing the mysql conf:
set-variable=max_allowed_packet=150M

But it seems to make no difference.

Felipe Jaramillo
eZ Certified Extension Developer
http://www.aplyca.com | Bogotá, Colombia

kracker (the)

Wednesday 01 November 2006 10:40:18 pm

<i>@Felipe</i>

Hey Felipe! Thanks for testing and reporting back your findings on this subject.

I really do appreciate it. I'm certain others have had these kinds of problems in the past and certainly in the future. Now's the time to weigh in and talk about it.

I've narrowed the problem a bit between eZ -> DB (MySQL) and most of my test results were similar to the ones you report.

I am working now on taking a different, new look at this very subject again here in the near future. I will have to report back my findings and end-game solution.

The answers are here right now, looking at us clearly in the eye, waiting to be seen ... so naturally I'm looking towards the solution in front of me.

<i>Cheers,
//kracker

/me says on irc://freenode.net #ezpublish, `the documentation defines the solution ....`</i>

KMK - In God We Trust (feat. Rude Boy)</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

Felipe Jaramillo

Thursday 02 November 2006 7:46:20 am

Hi kracker,

After further tests I can confirm the same behaviour in 3.7.5 on Windows XP.

I don't really see the answer that close. Seems that we'd have to look into the indexing code more carefully.

I did manage to get the html file indexing working using a modified copy of the word indexer and using "lynx --dump" as the command to execute.

Regards,

Felipe

Felipe Jaramillo
eZ Certified Extension Developer
http://www.aplyca.com | Bogotá, Colombia

kracker (the)

Monday 06 November 2006 4:30:22 pm

<i>It's a bug ... from a certain point of view ...</i>
I'll update again once I get just a little closer.

I found today, randomly browsing an issue entry which read to me surprisingly similar. <i>http://issues.ez.no/IssueView.php?Id=9300&activeItem=2</i>
I need to test this further first to see what if any results it yields.

A question passed my way today. Would I rather have a slower more reliable search (avoid bug via code) or fix the real reason for the bug (in php). What would you chose to solve?

//kracker
<i>intristicly - phone calls from the past unannounced ...</i>
<i>ps. damn, references are money makers!</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

kracker (the)

Monday 06 November 2006 4:47:22 pm

Well...

The results were more or less the same, I expected it to die before 90% and it last failed around 83%, this time it completed 85%.

The following is the history of the attempt with <i>xdebug</i> enabled.

# /web/ini/bin/build_ez_search_index_for_example_com.sh
Using siteaccess example-dev for search index update
Starting object re-indexing
{eZSearchEngine: Cleaning up search data}
Number of objects to index: 5465
...................................................................... 1.2808783165599%
...................................................................... 2.5617566331199%
...................................................................... 3.8426349496798%

<snip />
...................................................................... 80.695333943275%
...................................................................... 81.976212259835%
...................................................................... 83.257090576395%
...................................................................... 84.537968892955%
...................................................................... 85.818847209515%
............................
Notice: Uninitialized string offset:  0 in /web/dev/example_com/ezpublish-3.8.5/lib/ezxml/classes/ezxml.php on line 178

Call Stack:
    0.0009      55024   1. {main}() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:0
 1256.9776   12000000   2. ezsearch::addobject() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:165
 1256.9778   12000048   3. ezsearchengine->addobject() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezsearch.php:87
 1256.9960   12025624   4. ezcontentobjectattribute->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/search/plugins/ezsearchengine/ezsearchengine.php:85
 1256.9961   12025720   5. ezxmltexttype->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezcontentobjectattribute.php:1134
 1256.9963   12026520   6. ezxml->domtree() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/datatypes/ezxmltext/ezxmltexttype.php:344


Notice: Uninitialized string offset:  -1 in /web/dev/example_com/ezpublish-3.8.5/lib/ezxml/classes/ezxml.php on line 256

Call Stack:
    0.0009      55024   1. {main}() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:0
 1256.9776   12000000   2. ezsearch::addobject() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:165
 1256.9778   12000048   3. ezsearchengine->addobject() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezsearch.php:87
 1256.9960   12025624   4. ezcontentobjectattribute->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/search/plugins/ezsearchengine/ezsearchengine.php:85
 1256.9961   12025720   5. ezxmltexttype->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezcontentobjectattribute.php:1134
 1256.9963   12026520   6. ezxml->domtree() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/datatypes/ezxmltext/ezxmltexttype.php:344


Notice: Uninitialized string offset:  -1 in /web/dev/example_com/ezpublish-3.8.5/lib/ezxml/classes/ezxml.php on line 342

Call Stack:
    0.0009      55024   1. {main}() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:0
 1256.9776   12000000   2. ezsearch::addobject() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:165
 1256.9778   12000048   3. ezsearchengine->addobject() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezsearch.php:87
 1256.9960   12025624   4. ezcontentobjectattribute->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/search/plugins/ezsearchengine/ezsearchengine.php:85
 1256.9961   12025720   5. ezxmltexttype->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezcontentobjectattribute.php:1134
 1256.9963   12026520   6. ezxml->domtree() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/datatypes/ezxmltext/ezxmltexttype.php:344

.......................................... 87.099725526075%
.............
Fatal error: Call to a member function on a non-object in /web/dev/example_com/ezpublish-3.8.5/kernel/search/plugins/ezsearchengine/ezsearchengine.php on line 74

Call Stack:
    0.0009      55024   1. {main}() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:0
 1266.4872   12292752   2. ezsearch::addobject() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:165
 1266.4873   12292800   3. ezsearchengine->addobject() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezsearch.php:87


Fatal error: eZ publish did not finish its request
The execution of eZ publish was abruptly ended, the debug output is present below.

//kracker
<i>if you don't use xdebug...
then you don't know ...</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

Felipe Jaramillo

Tuesday 07 November 2006 2:51:11 pm

Ive looked into the search engine indexing code and found two things which may cause the problem:

\kernel\search\plugins\ezsearchengine

LINE 105:                $wordArray = split( " ", $text );

This splits the text all at once. In my experience this takes a long, long time. I guess maybe it would be better to do it in different parts.

The place where it breaks is when words are indexed in 1000 word groups. i think it goes up to 127.000 words before breaking:

LINE 129:

 $db =& eZDB::instance();
        $db->begin();
        for( $arrayCount = 0; $arrayCount < $wordCount; $arrayCount += 1000 )
        {
            $placement = $this->indexWords( $contentObject, array_slice( $indexArray, $arrayCount, 1000 ), $wordIDArray, $placement );
        }
        $db->commit();

I did try putting the db->begin() and db->commit() inside the for loop, to favour several transactions instead of a huge one, but it didn't help.

On windows, PHP crashes after reaching the 127.000th word.

I upload the complete PHP manual in html which is about 12 MB and then do indexing through the shell.

It's interesting to see that the text gets stripped off the html tags:

Line 94:
$text = eZSearchEngine::normalizeText( strip_tags(  $metaDataPart['text'] ), true );

So this way, you don't need to do a 'lynx --dump' to index html files as they will get stripped of html tags anyway. Therefore to index html, one could use the plaintext plugin:

settings/override/binaryfile.ini.append.php

[HandlerSettings]
MetaDataExtractor[text/html]=plaintext

Hope this helps a bit.

Regards,

Felipe

Felipe Jaramillo
eZ Certified Extension Developer
http://www.aplyca.com | Bogotá, Colombia

Kristian Hole

Tuesday 07 November 2006 3:59:07 pm

Felipe:

I did a test with the PHP-manual like you did. I get a segmentation fault in PHP as you did. This is caused by a reference count overflow in PHP. You can see the description here: http://ilia.ws/archives/5-Top-10-ways-to-crash-PHP.html

I have been provided with a patch for PHP which tells me exactly which variable that is causing the reference count overflow and results in PHP segfaulting. I am currently looking into getting help from PHP-core-developers to interpret the result, and possibly making a modification to the script that avoids the problem.

Kristian

http://ez.no/ez_publish/documenta...tricks/show_which_templates_are_used
http://ez.no/doc/ez_publish/techn...te_operators/miscellaneous/attribute

Felipe Jaramillo

Thursday 09 November 2006 3:08:02 pm

Hi Kristian,

Thanks for your input. I look forward to the results of your research as this is an essential feature in eZ.

Felipe

Felipe Jaramillo
eZ Certified Extension Developer
http://www.aplyca.com | Bogotá, Colombia

Kristian Hole

Wednesday 15 November 2006 8:33:46 am

Update on the subject:
http://lists.ez.no/pipermail/sdk-public/2006-November/002463.html

Kristian

http://ez.no/ez_publish/documenta...tricks/show_which_templates_are_used
http://ez.no/doc/ez_publish/techn...te_operators/miscellaneous/attribute

Brookins Consulting

Saturday 14 April 2007 9:27:08 am

Greetings Felipe!

Brookins Consulting has recently released a new eZ Publish extension which provides scripts which will successfully index your eZ publish content objects without errors via an alternative replacement for the stock eZ publish search index update shell scripts.

Project, <i>http://projects.ez.no/updatesearchindex</i>
Release News, <i>http://projects.ez.no/updatesearchindex/news/updatesearchindex_extension_0_0_1_released</i>
Download extension, <i>http://projects.ez.no/content/download/757/2989/version/2/file/updatesearchindex-0_0_1.tar.gz</i>

Feel free to ask further questions about this solution within the project forum, <i>http://projects.ez.no/updatesearchindex/forum/general/updatesearchindex_forum#msg451</i>

eZ Partner | North American Experience
http://brookinsconsulting.com/experience

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.