Cleaning User Input with Regular expression.

All

Our web content group would like end users to only be able to enter plain text into our cms. Using the ephox control helps with this a great deal but not completely.

Rather than writing an ephox plugin, I was thinking I could just clean the user input text with a regex(regular expression).

If the user text is stored in the database, I could develop a stored procedure to run my regex after an insert (or just hook in with jdbc using a filesystem event watching the log files).

Also, we could also just run the regex on filesystem after a publishing run. To accomplish this I would just add some marker tags to the page templates.

Each template would be modified to have a div tag class of ‘plain text’ so that the regex could use this to tidy up the user input.

Has anyone else thought about doing this?

The table that contain the callout and body fragments:

http://forum.percussion.com/showthread.php?t=858

You can add a field input transformer to your content type that runs a regular expression. This is the recommended approach - you should not modify the database directly.

I just stumbled upon the validation mechanism that can be preformed on the body field:

"To validate field values, you should add a validation check on each field of interest. This is done in the workbench. Choose the field of interest in the content type editor, then click on the ‘Validation’ button at the bottom of the editor.’

I think we will just search for certain tags and tell the user they should only paste in plain text mode.

Here is a nice link on developing a regEx:http://haacked.com/archive/2004/10/2...matchhtml.aspx

Just a heads up, but if you want to validate something like the “body” field, i would put the validation on the global/shared field as opposed to local content types (ie, so you only need to put the validation rule once as opposed to once for each content type). This would be similar to the validation on the sys_title (http://forum.percussion.com/showthread.php?t=331).

The difference would be that you’d want the validation on the “body” which i believe will probably be a shared field for you (so edit the field via Content Design > Shared Fields as opposed to Content Design > Content Types > Content Type X > Fields).

Of course if you only want to validate the “body” on certain content types, then you would only want to do the validation on those content types…

Thanks for the tip. Our body field is not a shared field but it seems like it should be.

What are the implications of changing the body field to shared–just that every content type will have a body field?

If so then this seems okay except for the image content type. In this case would the only drawback be when submitting a new image end users would see the body field as an input;however, the body field would not be used when assembling/publishing the image content type?

David,

Creating a body field as part of a shared field set means that it is available to be included in multiple Content Types, as opposed to being available to one Content Type if you create it locally. It does not mean that all Content Types will include the body field. A shared field is only included in a Content Type if you explicitly add it to that Content Type.

I think Jitendra’s point is that when applying a validation or a transformation to a field that will be shared (which is typically the case for a body field), it’s more efficient to define the validations and/or transformations in the shared definition configuration, rather than configuring them locally for each Content Type where you use the field. All Content Types that include the field will use the shared configurations. If you want different validation or transformation rules for a specific Content Type, you can define local configurations that will override the shared configurations in that specific Content Type.

RLJII

On the body field, I am trying to validate the the user did not submit any html that has the color attribute with this regex:

(?!color\b)\b\w+)

This would disallow this html fragment:

color=“blue” size=“3” face="Times New Roman

If this does not pass validation my failure message displays:
Detected html formating in body…please paste using plain text.

I can’t seem to get this to work, any ideas?

=======================
PS: I got this sample regex from:

Excluding Matches With Regular Expressions (http://www.codinghorror.com/blog/archives/000425.html)

Here’s an interesting regex problem:

I seem to have stumbled upon a puzzle that evidently is not new, but for which no (simple) solution has yet been found. I am trying to find a way to exclude an entire word from a regular expression search. The regular expression should find and return everything EXCEPT the text string in the search expression.

But not all regex flavors support negative lookbehind. And those that do typically have severe restrictions on the lookbehind, eg, it must be a simple fixed-length expression. To avoid incompatibility, we can restate our solution using negative lookahead:

(?!fox\b)\b\w+

Also, does the jexl validation offer any more power with regard to making sure the body field does not contain a tag attribute string?

In looking at the source for this validation(below) I’m thinking I may be able to use this technique:http://www.velocityreviews.com/forums/t138352-how-to-exclude-a-string-using-regexp-pattern.html

public class PSValidateStringPattern implements IPSFieldValidator
{

public Object processUdf(Object[] params, IPSRequestContext request) throws PSConversionException
{
String value, regex;
PSExtensionParams ep = new PSExtensionParams(params);
value = ep.getStringParam(0, null, false);
regex = ep.getStringParam(1, null, true);

if (value == null)
{
return false;
}

if (value == null || !Pattern.matches(regex,value))
{
return false;
}

return true;