I’ve been lost in the details of URL encoding a number of times. Each time I figure it out, move on, then promptly forget everything about it. This article is nowhere near a complete reference, it exists mainly to jog my memory.
For the rest of the article I’m going to say URI instead of URL, but I’m really thinking about URLs – http and https schemes in particular.
URLDecoder) that might pass a smoke test
application/x-www-form-urlencoded, but doesn’t have to be according to the URI spec.
A URI is made up of five components. So called “reserved characters” are used to separate the components from one another. From RFC3986:
foo://example.com:8042/over/there?name=ferret#nose \_/ \______________/\_________/ \_________/ \__/ | | | | | scheme authority path query fragment
The five components are scheme, authority, path, query, and fragment. Some
components can be broken into sub-components (my term). For example the path
over/there has two sub-components:
by the reserved charcter
/ (at least for the http(s) scheme). Authority has
host and port sub-components, and uses the reserved character
: to separate
them. Query has no sub-components. It’s just everything from the first
the first following
# or the end of the URI. Typically the data in query is
HTML form encoded.
Why care about components and sub-components? Each sub-component needs to be percent-encoded using a component-specific set of reserved characters, then assembled with the correct URI syntax. (e.g. path vs. query) In the standard JDK there are no foolproof shortcuts for doing this. You have to break your URL into it’s smallest components, percent-encode each part, then assemble with the appropriate syntax.
From ad-hoc testing, it seems that implmentations of URI and form data parsing are very lenient and even if you percent-encode something that doesn’t require it, you’ll probably get the result you want.
These are horribly named classes. You might expect them to be useful for
encoding a URL. They’re not. They are good for HTML form encoding,
though. These classes encode and decode the
application/x-www-form-urlencoded MIME type, which is similar to what you
need for general purpose URI encoding, but not quite. These classes don’t know
anything about which component of a URI you’re working on, and they also always
encode spaces to
+ (instead of
%20), which is only a valid encoding for
HTML form data (i.e. in the query string only). You can use these classes as a
blunt instrument to percent-encode strings, but you’ll need component-specific
fixups after encoding depending on what part of a URI you’re working on. Not
an attractive option.
The URI class gets some parts almost right. Its
is an interesting read, though. It’s where I first learned about different
components having different reserved characters. If you use the many-argument
URI, the correct encoding rules are applied sometimes,
depending on your data.
(java.net.URI. "http" "user:password" "foo.com" 8080 "/foo/bar/a+b/c d/baz" "a=Mark's stuff&c=yo" "frag") ;; http://user:email@example.com:8080/foo/bar/a+b/c%20d/baz?a=Mark's%20stuff&c=yo#frag
/is a path separator
&which is fine unless one of your values is
a & b == c, in which case you’re screwed again
URI to do general purpose URL encoding is not a viable option. It does
look like it’s a usable URI parser though, using the single-string constructor