extracting text from word or pdf file

jimbo · July 11, 2007, 12:57pm

Hi

I have a document content type which is published out and indexed by index server. I would like to attach some metadata to the binary file but don’t know how. Obviously I can set up the metadata fields in the content editor so the user can select the values but then how do i attach that to the document.

Is there a way of extracting the text from a binary (.doc, .pdf) when saving the content editor and save it to hidden field so that I can publish this as a search template for index server to catalog?

Cheers
James

dbenua · July 11, 2007, 1:32pm

James,

One trick I’ve used in the past is to define a new Variant or Template (depending on what version you’re using) which publishes XML in a format that the indexer understands. The XML contains the file path of the Binary as well as the metadata that you want indexed. Obviously this format depends on which indexer you are using, but most of them have a way to do this.

You can, of course, write a custom publisher plugin for your indexer (I’ve done that, too), but the XML file approach is frequently simpler and quicker.

Dave

dbenua · July 11, 2007, 1:35pm

You’re asking how to get the data, not how to publish the data…

Well there’s always the TextExtraction UDF, but it’s kind of clumsy (you get ALL the text).

jimbo · July 11, 2007, 1:39pm

Hi David

Yes I’ve also done this in the past but I’d also like it to index the content of the file as well as the metadata which is the problem. I can either have a search variant/template which the indexer indexes or the binary file that is index. I can’t seem to have both.

Cheers
James

jimbo · July 11, 2007, 1:41pm

Hi David

>>TextExtraction UDF

Is this part of the product in 6.x or do I have to contact PSO?

Is this also available in 5.7?

Cheers
James

dbenua · July 11, 2007, 1:53pm

Look at Chapter 8 (page 91) of Implementing_Content_Editors_Version_5_7.pdf.

Dave

jimbo · July 11, 2007, 5:04pm

are there any know issue with using this shared fields. I can get this working on local fields by not shared. I get the following message

“The data supplied for extraction contains a file type that is not supported.”

It’s still save the binary OK and extracts the mime type

my source is set to : PSXParam/item_file_attachment
and my FileTypeParam is set to : item_file_attachment_type

cheers
James

rljohnson · July 16, 2007, 11:21am

For Rhythmyx Version 6.5, see “Implementing Text Extraction” on p. 256 of the Rhythmyx Implementation Guide.

[QUOTE=dbenua;295]Look at Chapter 8 (page 91) of Implementing_Content_Editors_Version_5_7.pdf.

Dave[/QUOTE]

jimbo · July 17, 2007, 5:44am

Thanks but it is Rhythmyx 5.7 I’m having the problem with.

Cheers
James

dbenua · July 17, 2007, 6:46pm

[QUOTE=jimbo;298]are there any know issue with using this shared fields. I can get this working on local fields by not shared. I get the following message

“The data supplied for extraction contains a file type that is not supported.”

It’s still save the binary OK and extracts the mime type

my source is set to : PSXParam/item_file_attachment
and my FileTypeParam is set to : item_file_attachment_type

cheers
James[/QUOTE]

I’m not familiar with this message. There have historically been some issues with shared fields and “full text search” (which is heavily inter-related with Text Extraction). However, I don’t see any that are still “open”.

Can you give a few more details of what exactly you’ve done and what the exact error message is?

Dave

jimbo · July 18, 2007, 4:53am

Hi David

After capturing screenshots and detailing how it was setup it now works fine the first time you upload a binary file although I get this message if I try and update the content item even if I use the same binary file.

An error occurred processing the update submitted by session id 6b0ba441f76a7158d0451372e4fd165f84c18867.
1 An exception occurred while processing the internal request handler call: com.percussion.error.PSException: org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
2 modify

I can’t attach any screen shots as I get this message when uploading them

text_extraction.doc:
Your file of 275.5 KB bytes exceeds the forum’s limit of 19.5 KB for this filetype.

Cheers
James

dbenua · July 18, 2007, 9:50am

James,

This is getting to be too complicated for the informal processes of the Forum, we need to take this offline. You need to open a support case, but before you do, I’d like to suggest that you:

Turn on tracing for the Content Editor application (at least “HTML Parameters” and “Exit Execution”). Having the trace file will let us know which Extension is failing.
Does it make a difference when you “re-upload” the file? That is, is the behaviour different when an uploaded file is present or not present in the input parameters?

The Text Extraction exit doesn’t do any real XML processing, and the error you are getting is an XML DOM error. Also, we know of nothing different about Shared fields. Text Extraction is just looking at the HTML parameters
which are not any different between local and shared fields.

Dave