Archive - Originally posted on "The Horse's Mouth" - 2009-06-14 11:18:29 - Graham Ellis
Are your writing or maintaining a web based application that uses forms? If so, you have better be aware of some of the nasty characters that are around!
The < character, when echoed back from a users's input 'unchallenged', may form the start of a tag. So that in a relatively benign case, a user who enters <em> at the start of his name will have his name emphasised back to him ... and to anyone else to whom that data is echoed unless your application cleans up.
The " character too can cause problems when echoed - if it gets written into a tag that's already got an attribute that's quoted, you can get some odd results. A user who enters 44" type="password into an unchallenged box that's echoed may be able to make the next form come up with the field he is entering using blobs rather than the actual characters typed in the box.
The ' character can be a snare too - if your application stores the entry uncleaned in a database, then with appropriate following code after the quote (I am not giving an example here!) can do severe damage.
And those are just three examples of special characters that can cause problems if they are not carefully considered; others include ` . + \ & % and even the humble space. And if you are unwise enough to treat a user's input as a regular expression, you're opening the way for the user to start performing all sorts of nasties with other characters too such as * ? [ ] | ( and ) (and this list is not - and is not intended to be - complete!)
Have I frightened you so much that you never want to provide a user input box again? I hope not, because there are robust and easy solutions!
I find it helpful to draw diagrams to show how the variables flow through my code and are processed, labelling each of the legs with the function / code necessary to clean up and close loopholes. The variable conditions ("from web", "in memory", "as part of XML string", "in database" and "sent back to web") will be the same no matter what language you're using. The labels on the flow lines will vary, depending on the functions in the language and how much work the web / database interfaces in the language do for you, and how much is left up to you.
Here is the diagram for PHP; you'll typically use "stripslashes" to bring a string into memory, with most of the rest of the work done by PHP. "addslashes" or "mysqlrealescape" converts the data for database storage, and "htmlspecialchars" gets it read for sending back to the web.
For Perl, you can use a module like CGI.pm, or you can roll your own. Personally, I have a sub that I call collectform that turns up via a use in most of my apps, and another called webify that cleans for output. They need to hand things like hex codes (%2B) and + characters which PHP handles silently for you (one of the differences between the ethos of the languages - Perl being general purpose, whereas PHP is written by a web programmer, for web programmers).
With Python, the cgi module provides methods such as cgi.Fieldstoragecgi.escape which add, in single calls, the necessary converters to the language. There's an example in our source code library here (and further examples linked from that page too!.
If you're using Tcl as your server side scripting language, we have sample of source code that tidies up nasty characters here. And if you're a Lua Programmer, then we have an example here.