Special Character Publishing

We have some content that was copy/pasted from Microsoft Word. When previewing the pages, the content looks fine. However, when the items are published, these characters are not formatted correctly.

Example:
Preview Output
world’s

Published Output
world’s

Is there a way within the Publisher to ensure the correct encoding, or do we have to go into each item and remove the special characters?

Thanks

What character set are you publishing in? (that is, what encoding is selected in the template) Also when you examine the published page, what encoding does the web browser think it is?

The character in question (a “smart quote”) doesn’t exist in ISO-8859-1, only in CP-1252 and Unicode.

Dave

Thanks for the reply, we were able to resolve this issue. The template character set was “UTF-8” but within the head of our global templates (html source) we had the following:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

Now that we’ve changed the Content-Type to utf-8, everything appears to be fine.

Thanks

dbenua, Do you have any more information on this encoding question. I am interested in the answer also. Further when we take characters like this and publish to Oracle in UTF-8 some characters are lost in the translation. Any solutions for Word smart characters to UTF-8 in oracle.

With regards to database publishing, we have found a bug (RX-12613) scheduled to be fixed in the next release that may be related to your problem.

If you are using database publishing to write a snippet’s html to a CLOB field in the database, the snippets data is converted to a string for writing to the database using the system’s default encoding rather than UTF-8.

Are you publishing to CLOB fields?

Yes we are publishing to a clob.

Tom,

It’s highly likely that this is the same issue that Jay mentioned above. It’s not currently fixed, but is scheduled for the next release. Talk to tech support for an update on status.

Dave

There are several issues with MS Word and Rhythmyx, mostly to do with Microsoft’s idea of “smart” everything ie. smart quotes, smart tags etc. :rolleyes:
We advise our users to ‘paste as text’ from Word but even then some smart tag coding remains and the Content Editor wont save as HTML Tidy throws out the namespace errors. Given the nature of our web content, having MSWord wrap “Darwin” in smart place and city tags is anything but helpful and does not endear Rhythmyx to the end user even though it is not an Rx fault as such.

While I accept the source of the problem is Microsoft, is there something that Rhythmyx can do to make life easier, like making the resulting error messages more meaningful to users or even stripping the smart tags embedded code from the field? As so many users copy their content from Word and these ‘smart’ tags, quotes etc are switched on by default, is there anything we can do to prevent the problems from happening in the first place?

Cara,

The new Ephox version now has a way to add a “paste filter” (written in Java) that can be used to remove extra stuff. (this was always possible, but it required some “deep programming” before the latest version).

I know we’ve done this in at least one customer, but we can look at making something more generic. I’ll ask around and see who has done what. (I’m currently on my way to the UK, so it might be a few days).

Since this is standard Ephox functionality, perhaps you can ask about paste filters for Word in their user forums, as I cannot imagine that this problem is unique to us: other Ephox users must paste from Word, and they have to deal with the same garbage that we do.

Dave

Thank you David. I’ll check out the Ephox forum and post any useful responses here for info.