URL Encoding, Percent-Encoding and Query Strings in Java
I’ve been lost in the details of URL encoding a number of times. Each time I figure it out, move on, then promptly forget everything about it. This article is nowhere near a complete reference, it exists mainly to jog my memory.
For the rest of the article I’m going to say URI instead of URL, but I’m really thinking about URLs – http and https schemes in particular.
The majority of this info comes from RFC3986 and the Wikipedia page on percent-encoding.
High level
- There’s nothing in the JDK to make correct URI encoding/decoding simple
- There are various half-measures in the JDK that will get you occasionally-correct
code (
URI
,URLEncoder
,URLDecoder
) that might pass a smoke test - This stuff only looks easy, there are a million edge cases
- Consider a library
- urlbuilder (small and to the point; doesn’t support matrix params?)
- jersey’s URIBuilder (huge dependency)
- apache httpclient’s URIBuilder (huge dependency)
- If rolling your own
- Each component of a URI has different rules for which characters need percent-encoding
- Encode each component of a URI separately, applying component-specific percent-encoding rules
- For decoding, regex parse incoming URI string into components, then decode each separately using component-specific rules
- percent-encoding is different than
application/x-www-form-urlencoded
- They differ in handling of spaces and newlines
- The query string is typically encoded using
application/x-www-form-urlencoded
, but doesn’t have to be according to the URI spec.
URI Components
A URI is made up of five components. So called “reserved characters” are used to separate the components from one another. From RFC3986:
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
The five components are scheme, authority, path, query, and fragment. Some
components can be broken into sub-components (my term). For example the path
component over/there
has two sub-components: over
, and there
, separated
by the reserved charcter /
(at least for the http(s) scheme). Authority has
host and port sub-components, and uses the reserved character :
to separate
them. Query has no sub-components. It’s just everything from the first ?
to
the first following #
or the end of the URI. Typically the data in query is
HTML form encoded.
Why care about components and sub-components? Each sub-component needs to be percent-encoded using a component-specific set of reserved characters, then assembled with the correct URI syntax. (e.g. path vs. query) In the standard JDK there are no foolproof shortcuts for doing this. You have to break your URL into it’s smallest components, percent-encode each part, then assemble with the appropriate syntax.
From ad-hoc testing, it seems that implmentations of URI and form data parsing are very lenient and even if you percent-encode something that doesn’t require it, you’ll probably get the result you want.
URLEncoder and URLDecoder
These are horribly named classes. You might expect them to be useful for
encoding a URL. They’re not. They are good for HTML form encoding,
though. These classes encode and decode the
application/x-www-form-urlencoded
MIME type, which is similar to what you
need for general purpose URI encoding, but not quite. These classes don’t know
anything about which component of a URI you’re working on, and they also always
encode spaces to +
(instead of %20
), which is only a valid encoding for
HTML form data (i.e. in the query string only). You can use these classes as a
blunt instrument to percent-encode strings, but you’ll need component-specific
fixups after encoding depending on what part of a URI you’re working on. Not
an attractive option.
URI Class
The URI class gets some parts almost right. Its
source
is an interesting read, though. It’s where I first learned about different
components having different reserved characters. If you use the many-argument
constructors for URI
, the correct encoding rules are applied sometimes,
depending on your data.
(java.net.URI. "http" "user:password" "foo.com" 8080 "/foo/bar/a+b/c d/baz" "a=Mark's stuff&c=yo" "frag")
;; http://user:password@foo.com:8080/foo/bar/a+b/c%20d/baz?a=Mark's%20stuff&c=yo#frag
- path looks good in this example
- it did the right thing preserving the
+
and using%20
for spaces - it understood that
/
is a path separator - you’re screwed if your path element contains a “/” that needs encoding though
- it did the right thing preserving the
- query is wrong, but likely works in practice
- it’s percent-encoded, not form encoded (note the apostrophe is not percent-encoded, this is fine by URI spec, but not by form encoding spec)
- it didn’t encode
=
and&
which is fine unless one of your values isa & b == c
, in which case you’re screwed again
Using URI
to do general purpose URL encoding is not a viable option. It does
look like it’s a usable URI parser though, using the single-string constructor
and the raw
getters.
Conclusion
- Use a library
- Write your own URI encoder/decoder helpers, but recognize it’s less straight-forward than you probably want it to be
- Hope for a working URIBuilder in the JDK one day
- A standard set of tests for URI parsing would be nice, too
References
- Detailed discussion of URL encoding with examples (Excellent read)
- RFC3986 - URI: Generic Syntax
- percent-encoding at Wikipedia
- application/x-www-form-urlencoded “spec” (as best I can find…)
- Tests for URI parsing
- Python’s urlparse