Automated W3C HTML Validation

The free (thanks!!!) HTML validation service http://validator.w3.org/ from the W3C will validate a page from a web site, a single file, or a block of arbitrary HTML for compliance with whatever HTML standard is stated in the DOCTYPE line preceding the HTML page. Unfortunately, it won't do what I really want and need -- validate the image of a website on my hard drive BEFORE I post the files to the Internet. Nor does such a capability seem to be on their todo list. You'd think that an automated technique for persuading W3C to check a local image would be obvious or readily available on the Web. And maybe it is. But I can't see an obvious way to do it and couldn't find one via Google.

I had no choice but to roll my own.

Revised JUNE 2009. Sometime this Spring, the W3C changed their site and the layout of the replies that I was screen scraping to get the validation results.

So, I now have a version 1.0 W3CVAL.PY Version 1.0 script that can either be imported into a PYTHON script or run as a standalone bash (Unix shell)script. It it is given a file name, it will send the file to the W3C web site for validation. If it is given a directory name it will recursively walk the directory/directories and send files to the W3C site unless and until an error is detected in one of the files. The script returns a non-zero value if a problem is encountered. Default behavior is to print messages indicating what is going on and to retry up to ten times if the W3C reports that its internal validator script is not available -- which is sometimes does. W3CVAL --help should produce a listing of the options.

Update DEC 2009. I am posting a Version 1.01 of W3CVAL.py It has largely been tested. The man file has been verified.. The code and man files are in file SCRIPTS/W3CVAL.tgz

There are other HTLM Validators, Web Design Group list of validators.. My script could possibly be edited to use one of the other validators if they are more convenient/capable than W3C.

These are all free software. They work for me and might work for you. Then again they might not. Testing is largely completed. License? I dunno -- how about GPL2? Commercial use? Sure ... fine ... OK. (What commercial use could this stuff possibly have?). Warning. Code is included that I did not create. Since it is on the web in numerous places and has neither Copyright nor Licensing declaration that I can find, I assume it is public domain. Use at your own risk.


NOTES and Historical

This section is included to help anyone who has to dig into why W3CVAL is the way it is -- in order to fix it; adapt it to their own situation; or roll their own solution.

An HTTP GET call (What your web browser does when you tell it to go to http://validator.w3.org/) to the W3C validator returns a web page that has three user entry boxes:

We can eliminate specifying a URI unless we have an externally accessible HTTP server (W3C, not suprisingly rejects a URI with a FILE://... scheme). So, we will need to use either the file validation or text validation service. The former will probably better if we use a shell script to do the validation. The latter might be easier, but will probably need an interpreted language script to deal with getting the text to be validated into an appropriate HTTP POST request. It wasn't entirely clear to me which approach would be better. Initially, I thought that validating blocks of HTML might save me from having to code a special purpose web server or client. Or at least let me use a simpler one.


Saving the web page from W3C to a local file we find that the relevant HTML is

...
<h2>Validate Your Markup</h2>
  <fieldset class="front" id="validate-by-uri">
    <legend>Validate by URL</legend>
    <form method="get" action="check">
      <p>
        <label title="Address of page to Validate" for="uri">Address:
          <input id="uri" name="uri" size="40" /></label>
        <label title="Submit URL for validation">
          <input type="submit" value="Check" />
        </label>
    </p>
    </form>
    <p>
      Enter the <abbr title="Uniform Resource Locator">URL</abbr> of the page
      you want to check. Advanced options are available from the 
      Extended Interface to the W3C Markup Validation Service
  </fieldset>
<br />
  <fieldset class="front" id="validate-by-upload">
    <legend>Validate by File Upload</legend>
    <form method="post" enctype="multipart/form-data" action="check">
      <p>
        <label title="Choose a Local File to Upload and Validate" 
          for="uploaded_file">Local File:
          <input type="file" id="uploaded_file" name="uploaded_file"
            size="30" />
        </label>
        <label title="Submit file for validation">
          <input type="submit" value="Check" />
        </label>
      </p>
    </form>
      <p>
        Select the file you want to upload and check. Advanced options are
        available from the <a
          title="File Upload Interface to the W3C Markup Validation Service"
          Extended File Upload Interface.
      </p>
      <p><strong>Note</strong>: file upload may not work with Internet
        Explorer on some versions of Windows XP Service Pack 2, see our
        <a href="http://www.w3.org/QA/2005/01/Validator-IE_WinXP_SP2">
        information page</a> 
        on the W3C QA Website.
      </p>
  </fieldset>
<br />
  <fieldset class="front" id="validate-by-input">
  <legend>Validate by Direct Input</legend>
  <form method="post" enctype="multipart/form-data" action="check">
    <p>
      Input the markup you would like to validate in the text area below:
    </p>
    <p>
      label title="Paste a complete (HTML) Document here" for="fragment">
        <textarea cols="75" rows="12" name="fragment" id="fragment"></textarea>
      </label>
      <br />
      <label title="Submit markup for validation">
        <input type="submit" value="Check" />
      </label>
    </p>
  </form>
  ...

The key to this are the lines

These determine what your browser passes to the W3C site when you click on check. In the case of URIs, the URI you have typed in is sent to validator.w3.org using a very simple format called GET. See for the difference between GET and POST methonds. The URI is passed in the request header along with the word GET. GET is supported by all browsers and by utilities like WGET. In the case of validation of files or direct input, a different method called POST is used. That's not a huge problem, but POST is told to use a horrendous format called multipart/form-data. Many browsers do not support multipart/form-data. Neither does the wget utility (At least I couldn't get it to work if it does). The somewhat similar curl utility does support mutltipart/form-data, but the user has to format the data.

The following alternatives seem worth considering:


Working Solution

OK, how about we (meaning me) create a Python script that implements the FTP and validate solution? Then I'll replace it later with a Python Script that does the same thing using curl.


The Python script is called valfile.py. The call (for now) is python file files are posted via FTP to donaldkenney.110mb.com/temp.htm and the output is sent to val.txt in the current directory.

The Real World Intervenes

The FTP part was no big deal -- when FTP worked, which wasn't always. The problem was that I could not, for the life of me, persuade the GET call to work using wget. As far as I could tell, I sent the same GET message to the validator that my browsers did. But the validation that worked with the browsers got me a simple 200 (OK) response via wget -- but no validation page. I tried the netcat utility as well, with about the same results.

Back to the Drawing Board

By this time, I had come across a set of Python scripts that generate multipart/formdata format. Looking through one of these, I realized that not only did it create the multipart/formdata POST request, it contained code to send it to a website and return a result. I cobbled together a test and sure enough, it returned a web page from the W3C validator. Problem solved pretty much.

Copyright 2007,2009 Donald Kenney (Donald.Kenney@GMail.com). Permission is hereby granted to use any materials on this page under the V2.5 Creative Commons License