When not to use a regex

Published 2017-08-13 on Drew DeVault's blog

The other day, I saw Learn regex the easy way. This is a great resource, but I felt the need to pen a post explaining that regexes are usually not the right approach.

Let’s do a little exercise. I googled “URL regex” and here’s the first Stack Overflow result:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)

source

This is a bad regex. Here are some valid URLs that this regex fails to match:

  • http://x.org
  • http://nic.science
  • http://名がドメイン.com (warning: this is a parked domain)
  • http://example.org/url,with,commas
  • https://en.wikipedia.org/wiki/Harry_Potter_(film_series)
  • http://127.0.0.1
  • http://[::1] (ipv6 loopback)

Here are some invalid URLs the regex is fine with:

  • http://exam..ple.org
  • http://–example.org

This answer has been revised 9 times on Stack Overflow, and this is the best they could come up with. Go back and read the regex. Can you tell where each of these bugs are? How long did it take you? If you received a bug report in your application because one of these URLs was handled incorrectly, do you understand this regex well enough to fix it? If your application has a URL regex, go find it and see how it fares with these tests.

Complicated regexes are opaque, unmaintainable, and often wrong. The correct approach to validating a URL is as follows:

from urllib.parse import urlparse

def is_url_valid(url):
    try:
        urlparse(url)
        return True
    except:
        return False

A regex is useful for validating simple patterns and for finding patterns in text. For anything beyond that it’s almost certainly a terrible choice. Say you want to…

validate an email address: try to send an email to it!

validate password strength requirements: estimate the complexity with zxcvbn!

validate a date: use your standard library! datetime.datetime.strptime

validate a credit card number: run the Luhn algorithm on it!

validate a social security number: alright, use a regex. But don’t expect the number to be assigned to someone until you ask the Social Security Administration about it!

Get the picture?

Have a comment on one of my posts? Start a discussion in my public inbox by sending an email to ~sircmpwn/public-inbox@lists.sr.ht [mailing list etiquette]

Are you a free software maintainer who is struggling with stress, demanding users, overwork, or any other social problems in the course of your work? Please email me — I know how you feel, and I can lend a sympathetic ear and share some veteran advice.


Articles from blogs I follow around the net

A Satisfactory Way of Building

With apologies, as usual, to Christopher Alexander. Satisfactory is a first-person factory construction game. COVID-19 has given me license to spend FAR too much time playing it, and I’d like to share a few thoughts that I hope might prove useful, or at le…

via Aphyr: Posts June 21, 2020

Time to upgrade your monitor

A non-comprehensive and opinionated guide to best monitor for programming

via tonsky.me June 17, 2020

Status update, June 2020

Time for a new monthly status update! Let’s start with Wayland stuff. Once again I’ve continued working on wlroots’ DRM backend. I’ve submitted a bunch of bugfixes for all of the atomic refactoring done last month. I’ve also started working on integrating…

via emersion June 17, 2020

Generated by openring