URL syntax (was: ComCast DNS hijacking)

Ben Scott dragonhawk at gmail.com
Tue Aug 25 17:12:12 EDT 2009


On Tue, Aug 25, 2009 at 1:36 PM, Michael ODonnell
<michael.odonnell at comcast.net> wrote:
> I definitely transmitted a literal ampersand in the URL in the
> original message ...

  That's what Gmail shows me, too, even with "Show original".  Gmail
can be a bit funky, but I think it's telling the truth in this case.

> (cut'n'pasted right out of Firefox's address bar)

  Be aware that for some browsers (including Firefox), what you see in
the "address bar" may not be the URL as the protocol processes it.
Firefox will show you the decoded result (complete with characters not
allowed), but send the encoded version, and if you copy to clipboard,
you get the encoded version.  This is generally what you want, but it
occasionally leads to confusion.  Since you clipboarded it anyway,
this shouldn't be one of those times.

> ... no MIME encoding or anything like that was involved.

  FWIW, the %00 notation is not something MIME knows about.  AFAIK, it
was invented for WWW, and is just called "URL encoding".

> Is it bad form to use literal ampersands in emailed URLs?

  I wanted to say that ampersands are reserved in URLs, but I just
checked, and official sources seem to say they are allowed.

  The original Mar 1994 URL specification calls ampersand "safe", and
does not place it in the reserved character list.
<http://www.w3.org/Addressing/URL/url-spec.txt>

  RFC-1738 (Dec 1994) says ampersand *may* be reserved in some
schemes, but specifically allows it in HTTP host paths.
<http://tools.ietf.org/html/rfc1738>

  RFC-3986 (Jan 2005) says host paths consists of segments, and
segments consist of pchar's, and pchar's include sub-delimiters, and
an ampersand is a sub-delimiter.  It also states that the semantics of
query parameters are outside of the scope of the URI spec, beyond the
use of question-mark (?) to separate the query from the path.  It even
specifically mentions that the equals sign, as with field=value, is
not part of that spec.  <http://tools.ietf.org/html/rfc3986>

  The ampersand/equals syntax used in HTML forms submitted via GET
appear to be defined by the HTML specification, as
"application/x-www-form-urlencoded".  It's not mentioned anywhere else
that I can find.
<http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4>

  So all that suggests that an ampersand is perfectly legal in a URL.
So I guess the "failure" is in software at Dan's end.

  It may be worth noting that ampersand is reserved in
{SG,HT,X,XHT}ML, so you can't put a URL containing an ampersand in an
HTML document literally; you have to encode the ampersand as an *ML
character entity (&amp;).

  In general, I suggest avoiding everything but letters, numbers,
dashes, periods, and underscores in URL path components, for just this
sort of reason.

-- Ben


More information about the gnhlug-discuss mailing list