We are working with a content type for which we’d like to produce both HTML and text files. The content type has a number of HTML fields and if possible, we’d like to include the contents of those fields, formatted appropriately (and without tags) in our text file output. When I say “appropriately”, I mean things like replacing <p> tags with line breaks, bullets with “*”, etc.
Has anyone had any success (or failure, for that matter) in doing this?
In 6.5.2 you could do it in the page template that generates the text file, using string replacement in fields and the regular expressions capabilities of the ReplaceAll method. Something like…
…would strip out all HTML tags. Replacing specific tags with text-only equivalents would require a lot of experimenting with multiple and more complicated regex patterns, but it would be do-able.
Writing a java extension to do what you want would give you finer control and the template would not look nearly as messy (ie just call the extension and pass it the body field say). I am sure that there is a java class out there that is open source that you can modify to do your bidding as well (it would be doing string modification just as andrew suggested, but inside a java class as opposed to the velocity template.)…
Of course I say this because we will be doing this shortly and that is the path that I will be taking…
I had thought about the regex idea, but wondered how maintainable and readable that is. I don’t want to try and re-invent the wheel in Java, since I can’t be the first person to do this conversion. But I’ll do that if I end up needing to, I guess.
Anyhow, I do appreciate the suggestions - thanks.
Another approach is to use the psoTransform extension if you have the PSO toolkit installed.
It requires both the PSO toolkit and using XSL so, I’m not sure if that is what you’re really looking for.
It is, however, the way that the FO templates provided by PSO are transforming HTML into Formatted Objects that the FOP assembler uses to generate PDFs and the like.
Come to think of it, the FOP assembler is also supposed to be able to generate rich text, etc. But, again, I’m not sure that is what you’re looking for.
Thanks for this pointer. We are already using the PSO Toolkit, so that’s not a barrier. And I’m fine with XSL, if it solves the problem we’re trying to solve. I may give this a try to see if I can make it work.
The psoTransform extension, with XSL, seems to work well. XSL is well suited to dealing with the tags in HTML, so it’s a relatively simple stylesheet. It won’t handle all random HTML tags, but by having templates for the most common tags, and throwing away the others, we will get a close approximation of a text version of the HTML field value.
Thanks for the suggestion, Creig!