How to html-encode a field's chars, without double-encoding the html tags

april · December 24, 2008, 12:35pm

We have a field with a mix of Korean characters and html tags. In the Rhythmyx tables, the data is stored exactly the way we need to output it: the Korean characters are html-encoded. But Rhythmyx seems to re-encode the characters when it extracts the field from the database.

If we use the codec.encodeForXml method to re-html-encode the Korean characters, then the html tags get encoded. So that doesn’t help us.

Any suggestions?

paulhoward · December 27, 2008, 4:41pm

When you say the chars are ‘html-encoded’ in the db, do you mean as numeric entities? When you say that Rx ‘re-encodes’ them, exactly what form do they take?

april · December 29, 2008, 9:27am

Yes, they are numerically encoded when they are stored in the database. I’m not sure what the encoding is, when they are retrieved by displayfield, but probably UTF-8, since that is what we have chosen for the output when we create the template.

I also should have mentioned that we are using the Edit Live control for this field

april · December 30, 2008, 10:14am

To give a better idea of what I am talking about. Here is an attachment with the text as it is in the database, Korean characters numerically encoded. This is how we would like to display it.

And also another file showing how the characters get output by Rhythmyx, in some other kind of encoding.

paulhoward · December 30, 2008, 11:46am

The transformation you are seeing is a side-effect of inline-link processing. To process for inline links, the content is parsed as xml. The numeric entities are converted to their actual character encoding value when this is done and they are not converted back when the xml is serialized. You would have to write your own version of ‘escapeForXml’ that only converted the characters you desired converted.

If you are outputting the page as UTF, why do you want to use entities? The characters can be properly rendered in any UTF encoding scheme.

dbenua · December 30, 2008, 11:49am

These characters look like valid UTF-8 characters to me.

Rhythmyx will always output characters in the “most compact” way that is legal for that output encoding. If you specify UTF-8 or UTF-16, you’ll get these types of character.

If you specify ISO-8859-1, you should get the numeric entities.

Why do you need to Korean characters to be entities? The characters will output correctly on modern computers (some old machines and browsers don’t support unicode properly, but this is quite rare nowadays).

Is there something strange about your delivery system?

Dave

april · December 30, 2008, 2:02pm

We are passing the file to a Siebel app that can’t seem to handle the korean characters as UTF-8. And it won’t take UTF-16. But it does OK with the numeric encoding.

I just tried out your suggestion. I changed the template, in the General tab, to use character set ISO-8859-1 and now the characters come out as ‘??’

The thing I am displaying is a node property, so I am using node.getProperty(“xyz”).String

[QUOTE=dbenua;5792]These characters look like valid UTF-8 characters to me.

Rhythmyx will always output characters in the “most compact” way that is legal for that output encoding. If you specify UTF-8 or UTF-16, you’ll get these types of character.

If you specify ISO-8859-1, you should get the numeric entities.

Why do you need to Korean characters to be entities? The characters will output correctly on modern computers (some old machines and browsers don’t support unicode properly, but this is quite rare nowadays).

Is there something strange about your delivery system?

Dave[/QUOTE]

april · December 30, 2008, 2:05pm

In the meantime, I have put together a SQL query in velocity to get the string directly from the database.

That’s working, except that I don’t know how to get the correct revision. Currently, I am retrieving the latest revision. But what if they are editing a new revision and it publishes?

Is there a way to get the correct revision from $sys.item ?

april · December 30, 2008, 2:26pm

[QUOTE=april;5794]In the meantime, I have put together a SQL query in velocity to get the string directly from the database.
…
Currently, I am retrieving the latest revision. But what if they are editing a new revision and it publishes?

[/QUOTE]

Actually, I just tested it out, and it seems to be publishing and previewing the correct revisions. I wonder how that works. All I’m doing is selecting by the max revisionid and contentid. I’m not explicitly selecting by the live revision or the currently edited revision.

april · December 31, 2008, 9:13am

[QUOTE=april;5794]In the meantime, I have put together a SQL query in velocity to get the string directly from the database.

That’s working, except that I don’t know how to get the correct revision. Currently, I am retrieving the latest revision. But what if they are editing a new revision and it publishes?

Is there a way to get the correct revision from $sys.item ?[/QUOTE]

I figured out how to get the revisionid of the item:
$sys.assemblyItem.getPropertyValue(“sys_revision”,“1”)

I still don’t understand, though, why my sql seemed to magically show me the correct revision, without my explicitly asking for it. Perhaps… I never will…

paulhoward · December 31, 2008, 12:46pm

Retrieving directly from the db is very bad practice. Since you are going directly to the db, you must not have any inline links or namespace processing. Therefore, just turn those off and you will get your desired output w/o having to query the db directly. These are disabled in the Control Properties dialog for the control on the field of interest.

april · January 5, 2009, 10:22am

I understand. Getting the content directly was a last resort.

Your suggestion works. So I will use that instead of my db query. The only drawback, as you said, is that we can’t use the inline links (or namespace processing, whatever that is). I guess if we need to use that, then I will have to write a something to re-numerically-encode the characters.

dbenua · January 5, 2009, 11:59am

In the past, I’ve used the built-in XML parser and XSLT parser to “post process” content generated by Velocity. One of the things that this allows is changing the character set.

If you output the content in ISO-8859-1 (for example) instead of UTF-8, all of the non-ISO characters (including the Korean characters) will be converted to numeric entities.

Of course, the content must be “well-formed” XML for this to work, but this is generally true of rich-text (Ephox) fields that have inline links in them: they must be well formed or the inline links processing will fail.

The last time I did this, we had a custom publisher in place, and we did it as part of that publisher (in Java).

There is a JEXL function in the PSOToolkit that allows you to take a field and run it through an XSL Transform. If you use the identity transform and change the <xsl: output> tag to specify a character set, you might be able to do this on a field basis.

I have not tried this, but it might well work. If you try it and it does not, please let me know, and I’ll help you dig into it some more.

Dave

april · January 5, 2009, 2:24pm

[QUOTE=dbenua;5864]In the past, I’ve used the built-in XML parser and XSLT parser to “post process” content generated by Velocity. One of the things that this allows is changing the character set.

If you output the content in ISO-8859-1 (for example) instead of UTF-8, all of the non-ISO characters (including the Korean characters) will be converted to numeric entities.

Of course, the content must be “well-formed” XML for this to work, but this is generally true of rich-text (Ephox) fields that have inline links in them: they must be well formed or the inline links processing will fail.

The last time I did this, we had a custom publisher in place, and we did it as part of that publisher (in Java).

There is a JEXL function in the PSOToolkit that allows you to take a field and run it through an XSL Transform. If you use the identity transform and change the <xsl: output> tag to specify a character set, you might be able to do this on a field basis.

I have not tried this, but it might well work. If you try it and it does not, please let me know, and I’ll help you dig into it some more.

Dave[/QUOTE]

Thanks for the tip. I am going to look into this. But it turns out that our users don’t use the inline links, so for now, unchecking the ‘Allow inline links’ checkbox is a good-enough solution.