James Cassell's Blog

Monday, March 03, 2008

Parallels Between Blogging and Programming

I have been programming quite a bit lately; my Computer Science 2 class assigns an involved programming problem each week. These take several hours each. When I work on my project, I go about writing my code, and as soon as I need to reference some other piece of code, I will change gears and write at least the interface for it (i.e., the prototype). I will later go back and write the implementation of the function or class that I prototyped previously.

I find that much of the same happens when I am blogging. As I am writing, I think of other posts of mine that I would like to reference (i.e., link to). Many times, I find that I have never written such a post. While I haven't been able to link to non-existent posts, I have written down notes for myself to go back and write such posts in the future. My "about" page is one post that I made so that I would have it for future reference. My post about my Alienware notebook is one that I had tried to reference on many occasions, only to find that it didn't exist. Yesterday, I finally got around to writing the proverbial "implementation" of that which I had tried on many occasions to reference.

In writing code, there are many ways to achieve the same results, but have the code look very different. One can write two statements that are logically identical but are syntactically different. It is the same with the creative use of language; I can divide my thoughts with a semicolon, a period, or in less formal writing, (such as a blog,) with an em-dash. There are, of course, many other ways to vary the used syntax, but retain the same meaning. If one does not master this variation in the use of syntax, his writings end up being quite dull and boring, much like the first several years worth of posts on this blog.

Just as I am happy to continue learning programming syntaxes and constructs, I am interested in learning and practicing proper grammar in all areas of communication. This is the one area of "humanities" that I have ever enjoyed; I have never liked the critical analysis of literature or other such things.

Labels: , ,

Friday, February 22, 2008

Validating Input

When writing code, always validate the input. This may seem like common sense -- and it is under most circumstances -- but in one particular case, it is not. In the case of a computer science class when the instructor guarantees proper input, it is very tempting to not validate the input and simply assume that it is correct. This will save you a little time on each assignment, but is not worth it in the long run. I found this out today. I had a computer science project due at midnight last night. I didn't have it done on time, so I had to waste my remaining late day on it. I spent nearly 24 hours trying to debug my program that should have been working.

When I was debugging, I kept seeing things that could go wrong with improper input, but remembered that I only had to deal with proper input. My program seemed very brittle; the difference between a segmentation fault and the program running fine (but exiting early) was the difference between a break; and a continue; statement. It was at this that I randomly noticed that there was one line of the input was causing the crash. The input was improper despite assurances of the contrary the instructor.

I checked on-line to verify that I didn't mistakenly modify the file. Sure enough, the file on-line was correct, but the date on the on-line file was more recent than the date on my file. Apparently, the teacher noticed (or was informed of) the mistake, and updated the files on-line. What he did not do was send a general notice of the mistake and subsequent correction. Because of all this, I wasted nearly 24 hours of my time as well as a "late day" for turning in homework.

The moral of the story is that one should always check his input even if it has been guaranteed that it will be correct. The benefits of checking the input greatly outweigh the costs. Besides this, it is simply good practice, especially for any code that will be used in production. I had to deal with this when I was writing the contact form for my site; most of my time was spent writing the validation code to prevent any security problems.

Labels: , , ,

Sunday, May 13, 2007

Current Work on NHS Site

Over the past week, I have been working on a web-based management system for the NHS site. I am trying to make it as easy as possible to update and add pages to the site for whoever will maintain it when I am not there next year. To be completely honest, I don't know if anyone actually visits the site besides me; it has been very useful to me in that I can easily keep up with NHS activities.

I am going to store page data as XML. I was going to use SimpleXML to read back the data, but I don't have access to PHP 5. I am essentially building my own XML parser. It should not be too difficult; I plan to store everything opaquely by running it through PHP's htmlspecialchars() function before storing it in the XML file.

The first thing that I am working on (before implementing the code that writes the XML file) is validating user input. Since the site is served as "application/xhtml+xml" to browsers that support it, I need to make sure that the input is well formed, and properly nested to avoid a draconian XML error (these are good in that they force authors to produce well formed code). Checking for well-formedness was quite a challenge. I eventually came up with a slow, recursive function that checks the input. It isn't fool-proof, but is good enough for my purposes because I am encoding all HTML special characters to their corresponding entities for markup that I don't recognize. I only recognize a pre-defined list of markup elements.

I do hope that this site that I have spent much time on is actually used and appreciated.

In case anyone is interested, here is the function that I came up with to check if the input is well formed:

function validateInput($content) {
 $content = preg_replace('/<!--.*?-->/s','',$content);
 if (strpos($content,'<')===false)
  return true;
 $fullElements = preg_match_all(
'/<([a-z][a-z0-9]*)(\\s+[a-z][a-z0-9]*="[^"]*")*\\s*>(.*?)<\/\1>/is',
$content, $matches,PREG_PATTERN_ORDER);
 if ($fullElements == 0)
  return false;
 $valid = true;
 foreach($matches[3] as $key => $value)
  $valid &= validateInput($value);
 return $valid;
}

Edit 27 May 2007 23:59:38 EDT: there is a problem with this function that incorrectly identifies proper markup such as: "<div>text<div>text</div>text</div>" as invalid. I am working on this problem.

Labels: , ,

Monday, February 05, 2007

Regular Expression to Limit Lines to 76 Characters

The other day when I was at school, I typed a fairly long quasi blog post into the NHS contact form to continue to test it. When I got the e-mail at home later in the day, a line-break had been put into my post because it was such a long non-breaking string. It was broken in the middle of the word. I decided that I should automatically break the line after 76 characters. I wanted to use a regular expression. I searched Google, but could find nothing that helped. Today, I wrote one that will limit the line-length to 76 characters except when there is a long string that doesn't have any spaces in it such as a URL. I'm posting the regex here in case anyone else would find it useful.

Find:(.{1,75}\s)(?!\n)
Replace: $1\n

Labels: