Forums / Developer / Search engine : search inside a .pdf, .doc... ? Is it possible ?
laurent le cadet
Sunday 14 December 2003 3:30:23 am
Hi,
Someone can tell me if it's possible to make a search in all the files wich have been uploaded in the db ?
The files can be : .doc .pdf .xls...
Thanks per advance.
Laurent.
Marco Zinn
Sunday 14 December 2003 5:47:02 am
Hi Laurent, yes, since 3.2, you can define Search Plugins (external indexing programms) to "fetch" the words out of your binary files, depending on their MIME-Type. See http://ez.no/developer/ez_publish_3/documentation/incoming/configuring_binary_file_indexing for more info.We successfully did this with PDFs and partially .DOCs and work on PPT and XLS.
Marco http://www.hyperroad-design.com
Sunday 14 December 2003 6:00:53 am
Hi Marco.
Great great news !I'm working on a proposition for an intranet and they need absolutly this functionnality. They didn't asked me if eZ can make coffee but I'm sure it'll be possible with the next release ;-)
Thanks for repplying and hello to all sunday workers.
Sunday 14 December 2003 9:28:18 am
I'm not sure about the cofee-making, but maybe we should file a "suggestion" for 3.4 or so ;)
Please note: Indexing binary files needs third-party "parsing" tools for the Documtent-Types, that you want to index. Those usually are very strict about Application Versions etc. Means: When you can index Word 97 with application "x", that does not mean, that it works with Word 2003. Another note: You will need quite some RAM and possible processing power to index large files (manuals or so), as well as possibly increase some MySQL parameter, because ez (3.2) creates one large SQL query to story the words for each file. Also, your "indexing tools" must take care of character sets and "foreign characters"... are you having french content?
Sunday 14 December 2003 9:36:29 am
with suggar please ;-)
I will transfert all your approche to my phosting partner who will do the job if we win the competition.
All the content will be french and there is no other langage.I think we'll use windows-1252 for the charset. It seems to be ok for all the "special" character.