Forums / Install & configuration / PDF files not indexed

PDF files not indexed

Author Message

Jeroen Sangers

Thursday 29 June 2006 1:54:44 am

I am trying to include the contents of PDF files in the search index, but cannot get it to work.

I installed pstotext on my server, and tested it with a PDF file. I followed the steps as layed out in http://ez.no/products/ez_publish/documentation/configuration/configuration/search_engine/configuring_binary_file_indexing, and uploaded a PDF file to my site. However, when I search for some words in that file, no results show up.

Is there any way I can turn on logging/auditing to see what is happening when I upload a PDF file?

Siniša Šehović

Thursday 29 June 2006 11:20:51 pm

Hi Jeroen

I have the same problem on eZ 3.8.2.

Can anyone help us here? :-)

Best regards,
S.

---
If at first you don't succeed, look in the trash for the instructions.

Jeroen Sangers

Friday 30 June 2006 8:17:04 am

I still can't get it to work. I have tried moving around pstotext all over my server, I switched to pdftotext, I specified the full path to pstotext in my binaryfile.ini.append.php, but always I receive the same error:

Plugin for application/pdf was not found

Does anybody have a clue on how I can solve this?

Siniša Šehović

Saturday 01 July 2006 4:18:09 am

Hi Jeroen

What happend if you try to execute pstotext from linux shell?

Do you get any errors?

Did you try this aproach?
http://ez.no/community/forum/setup_design/indexing_binary_files_excel_and_powerpoint

S.

---
If at first you don't succeed, look in the trash for the instructions.

Jeroen Sangers

Monday 03 July 2006 1:33:13 am

I managed to solve it this weekend. There were two problems, and in the various configurations I have tried, always one of them appeared, until I tried the right combination!

The first problem is a mistake in the documentation. http://ez.no/products/ez_publish/documentation/configuration/configuration/search_engine/configuring_binary_file_indexing mentioned the following code:

[HandlerSettings]
MetaDataExtractor[application/pdf]=pdf

I copied that setting to my binaryfile.ini file, effectively destroying PDF parsing. Of course, I should have left it at the default value:

[HandlerSettings]
MetaDataExtractor[application/pdf]=ezpdf

The second problem I had was related to pdftotext. I've found out that the command used by eZ Publish (pdftotext example.pdf) does not produce any output. To get this to work, I had to modify kernel/classes/datatypes/ezbinaryfile/plugins/ezpdfparser.php:

passthru( "$textExtractionTool $fileName -" );

Siniša Šehović

Tuesday 04 July 2006 2:57:18 am

Hi Jeroen

Thanx for tip!

Now I can index my PDFs.

Best regards,
S.

---
If at first you don't succeed, look in the trash for the instructions.