Friday, April 24, 2020

Just another regex...


Preface

I like regexes, I really do. And I also like perl, and feel like they both are heavily neglected nowadays and neither of them deserves it. OK, it indeed takes some effort to learn to use them correctly, but that's also true for a chainsaw. Or for a scalpel. Or for a tool that can be used as a chainsaw and as a scalpel and as a lightsaber and ...

OK, back to the regexes. It'll be python, its regex support is almost as good as in perl ;).

I ran into a task that requires some URL mangling: some hostnames have to be replaced by some new hostnames, and so on.

Such an URL looks like: scheme://user:pass@host:port/path1/path2/path3?arg1=val1&arg2=val2

The scheme is usually "http" or "https", but it can be "svn+ssh" as well, and all parts but the host are optional.

Yes, I know, officially the scheme is mandatory as well, but when you ask a user to enter the URL of a server, in 80% of the cases it won't start with "https://", but with a plain hostname. And real-life data comes from real-life users, and they don't care about the RFCs and if you try to enforce it, you'll be asked to remove that "pesky" check and imply that "http://" if there is nothing there.

First I wanted to avoid reinventing the wheel and tried to use the urllib.parse module, but it doesn't handle that "user:pass" part well (for parsing it's said to handle but doesn't, for "unparsing" it's not even considered). Besides, it returns named tuples of ParseResult type, and as they are tuples, they're immutable, so any modification would require creating new instances of them. And there is no such thing as "copy-all-attributes-but-the-hostname" constructor, so I would've had to list each and every attribute for the normal constructor. Which doesn't handle separate username and password, so it would've had to be pre-mangled into the hostname... barf... sorry...

Urllib is not aimed for this task. I'm not yet sure what exactly it's aimed for, but it wouldn't make this task much simpler.

I don't need that immutability, I want a plain class that can parse the input string, lets me access (r+w) the fields and can rebuild them to an output string.

And if I have to write wrapper code to enclose some gory junk, then I prefer it to be my own junk.

It wasn't a hard task, but it lead to this steroid-filled monster regex golem:

^(?:([^:]*)://)?(?:([^:@]*)(?::([^@]*))?@)?([^:/?]*)(?::([0-9]*))?(/[^?]*)?(?:[?](.*))?

Actually, it's much simpler that how it looks, we just have to build it step by step, but it indeed needs the explanation. Which I don't want to embed into the code comments, just the URL of this post.

Even the longest journey begins with a single small step, so:

The structure of the URLs

Let's make it clear what we want to parse:

  • At the beginning of the string
  • There is an optional scheme: anything but ":" characters, followed by "://"
  • Then the also optional auth part:
    • A user part: anything but ":" or "@" characters
    • Then an optional password part: a ":", followed by anything but "@" characters
    • And then an "@" character
  • Then the host part: anything but ":" or "/" or "?" characters
  • Then the optional port part: a ":", followed by digits
  • Then the optional path part: a "/", followed by anything but "?" characters
  • Then the optional args part: a "?", followed by the rest
Let's begin with the scheme part. It'll be a bit longer, as the concepts are explained here as well.

The scheme part

At the beginning of the string, a sequence of anything but ":" characters, followed by "://"
That is: ^[^:]*://
  • ^ matches the beginning of the string
  • [^:] matches one character that is anything but ":"
  • so [^:]* matches (the longest possible) sequence of zero or more such characters
  • : is not a special character, so it matches a ":"
  • / is not a special character either, so it matches a "/"
  • so :// matches the literal string "://"
So ^[^:]*:// matches the scheme prefix followed by "://".

"Matching" only means a yes/no decision whether the text is in this format or not, but doesn't return us anything. However, we want to extract (the term is "capture") that part that was matched by that [^:]*, and this is what grouping does: if we enclose this part of the pattern in parentheses, then the regex engine will (at the end) return us the part of the input matched by this part of the pattern:

^([^:]*)://

Note that we're interested only in the scheme, like "https", and not the "://" that follows it, so that's why I left the latter one outside of the grouping. That is only matched for, but not to be captured.

If we match any input against this regex like x = test_regex.search("https://some.host.com"), then x.group(1) will contain "https". We're almost there :).

There is one more thing with the scheme part: it's optional. If there is no "://" in the input, then it's still valid, it's just the host part follows a "missing" scheme part.

In regex, the multiplicity sign for "0 or 1 occurences" is ?, and it is applied to whatever atomic unit it follows: either a single character, or a grouped sub-expression. The regex aab?cc would match "aabcc" or "aacc": "aa", followed by 1 or 0 occurences of "b", followed by "cc".

We want to use it to check the occurence count of not one character, but a sub-pattern group, so it will follow not a character, but a parenthesized grouping: ( ... )?

Namely, the whole scheme part (now with extra spaces for readability): ^  (  ([^:]*)://  )?

The start of the line, then our scheme part enclosed in a new group, which must have the cardinality of 0 or 1.

And this would work!

But now that we have defined two groups in the pattern, we would get two groups in the result as well:
  • x.group(1) would return the text matched by the 1st starting group: "https://" and
  • x.group(2) would return the text matched by the 2nd starting group: "https"
This would ruin the result group numbering and in fact we don't want the text matched by the 1st (the outer) parenthesized group, we just needed it to mark the scope of that ?.

In extended regex syntax there is a non-capturing group syntax: it's like a normal group, only it doesn't bother to extract the matched text: (?: ... )

Yes, nasty syntax. It wasn't part of the original "basic" regex, and he other braces already had their special meaning, so for such extended features an otherwise meaningless, invalid syntax was introduced, the (?...) notation. Among which the (?: ... ) means the non-capturing groups.

Replacing the 1st (outer) parentheses of our scheme-capturing regex:
Instead of ^  (  ([^:]*)://  )? it becomes ^  (?:  ([^:]*)://  )?

And the without those extra spaces, the real one: ^(?:([^:]*)://)?

The authenticity part

  • A user part: anything but ":" or "@" characters
  • Then an optional password part: a ":", followed by anything but "@" characters
  • And then an "@" character
User part: ([^:@]*)
We've already seen patterns like this, in the scheme part regex.

Password part: :  ([^@]*)
Remember: the leading ":" is just matched for, but we don't need to capture it.

The optional password part: (?:  :([^@]*)  )?
Remember: non-capturing group, 0 or 1 times.

The trailing "@": @
It's not a special character, so it matches itself :).

So the authenticity part together (with didactic spaces): ([^:@]*)  (?::([^@]*))?  @

These non-capturing-groups make it seem more mystical than how it actually is, but that's the
only way to deal with optional parts.

Apropos optional, this whole authenticity is also optional, so it needs a surrounding
(?: ... )? construct: (?:  ([^:@]*)(?::([^@]*))?@  )?

And the without those extra spaces, the real one: (?:([^:@]*)(?::([^@]*))?@)?

(Hint: Don't try to interpret the whole pattern char-by-char. Just follow the way it evolved from simpler sub-patterns, and what those sub-patterns do, and not how they do it. If you want to analyze those sub-patterns, then do it separately, without the complexity of the bigger composite pattern.)


The host part

It's not optional, so it'll be way simpler than the previous ones. A sequence of any characters but ":", "/" or "?", all of it captured as a group.

This is it: ([^:/?]*)


The port part

Starts with a ":", followed by a sequence of digits, which should be captured as a group, so it would be: :([0-9]*)

But it's optional, so: (?:  :([0-9]*)  )?

Without the extra spaces: (?::([0-9]*))?


The path part

Starts with a "/", followed by a sequence of non-"?" characters, all of it captured as a group, which is optional (i.e. occurs 0 or 1 times)

This is it: (/[^?]*)?

Note that since we already have the whole 0-or-1-times part as a group, there is no need for the usual non-capturing `(?: ... )` grouping, but we can affix the "?" right after our plain capturing group.


The arguments part

Starts with a "?", followed by anything (that is, a sequence of any characters).

By now you surely recall that in regex "?" has a special meaning, so how can we match the literal "?" character?

Either as \?, which is good, but as our regex will be a string in python, and python would interpret this as a python backslash-escape sequence (and thus produce a single ? into the regex, making it possibly invalid), so we would need to write it as \\?, from which python would make the \? we actually want.

IMHO regex can be ugly enough by its own means, no need to add some string escaping to it :D.

So I prefer the other syntax, which uses the fact that within a character set, all characters lose their special meaning (well, except for ^ and - and ], except at certain places, but that's another story), so the regex [?] matches any character that is "?" :).

So a "?" followed by any sequence of any characters: [?]  .*

Of which we want to capture that sequence: [?]  (.*)

And this part is optional, so: (?:  [?](.*)  )?

And the without those extra spaces, the real one: (?:[?](.*))?


All parts together

The moment of truth :D

- The scheme: ^(?:([^:]*)://)?
- The authenticity: (?:([^:@]*)(?::([^@]*))?@)?
- The host: ([^:/?]*)
- The port: (?::([0-9]*))?
- The path: (/[^?]*)?
- The args: (?:[?](.*))?

All together, in this order: 
^(?:([^:]*)://)?  (?:([^:@]*)(?::([^@]*))?@)?  ([^:/?]*)  (?::([0-9]*))?  (/[^?]*)?  (?:[?](.*))?

Without the extra spaces:
^(?:([^:]*)://)?(?:([^:@]*)(?::([^@]*))?@)?([^:/?]*)(?::([0-9]*))?(/[^?]*)?(?:[?](.*))?

Don't try to pronounce it, on bad days it may summon the Great Cthulhu and he'll reopen all your bugs. But when matching an URL-like string against it, we get the following results:

$ python3
...
>>> import re

>>> url_split = re.compile("^(?:([^:]*)://)?(?:([^:@]*)(?::([^@]*))?@)?([^:/?]*)(?::([0-9]*))?(/[^?]*)?(?:[?](.*))?")

>>> x = url_split.search("https://user1:password1@host1:8080/p1/p2/p3?qwer=42&asdf=111")

>>> x.group(1)
'https'
>>> x.group(2)
'user1'
>>> x.group(3)
'password1'
>>> x.group(4)
'host1'
>>> x.group(5)
'8080'
>>> x.group(6)
'/p1/p2/p3'
>>> x.group(7)
'qwer=42&asdf=111'

Try it with some of the optional parts missing, their respective group()-s shall return None.

Feelin' the power of regular expressions? We've barely scratched the surface! Here is the book about the subject.

Conclusion

That was it.

A big write-only lump of ascii junk, whose readability wouldn't be any worse after a cryptographically strong encryption either.

On the other hand, a fast and robust tool, that, when seen along its evolution, isn't random at all, but it's a concise and logical description of what we wanted to recognize.

It's just... another regex :D.

I hope you enjoyed this journey!
(For sure I did. It felt like I'm solving problems again, and not just glueing together some pre-built modules.)

And I also hope that the next time you face such a problem, you won't start mucking around with index() and substr() and state flags and multi-level if-elif-elif-else constructs...

No comments:

Post a Comment