Eight Days A Week

It’s quick and dirty, but it pulls out most shorthand dates used in the US, excludes a lot of entries in the dataset I was using that were false positives, and these can then be dealt with using something like dateutil.parser .

/(?<![-\/\d\w])\d{1,2}[-\/]\d{1,2}[-\/]\d{2,4}(?![-\/\d\w])/g

The three formats used worldwide seem to order them DMY, YMD, and MDY, so year never goes in the middle. Separators would be [\/\.- ] . Expanding separators for Chinese, Japanese, or Korean, could include Unicode characters as well.

Assuming we’re trying to parse only years no more than 1000 or so years in the past and no more than 975 or so years in the future, if a four digit year, the first digit must be a 1 or a 2. If a two digit year, it could be basically anything. I’ll assume from here on that the dates we’re dealing with are birthdates for, say, a website registration (people who are currently alive), so we’ll assume that they were born no earlier than 1894. If they were born in the 19th century, this will require the individual to use a four digit year to avoid ambiguity (97 will not work; a typical parser will assume 1997 for such an entry). With this information, the pattern for year becomes

/(?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])/g

with possible following delimiters of

/[\/\.\- \x{5e74}\x{b144}]/g

Month can only be between 1 and 12, and may or may not have a leading 0. Thus

/(?:0?[1-9]|1[0-2])/g

with possible following delimiters of

/[\/\.\- \x{6708}\x{c6d4}]/g

If a day, then it may be anything between 1 and 31, and may or may not have a leading 0. Therefore,

/(?:0?[1-9]|[12][1-9]|3[01])/g

with possible following delimiters of

/[\/\.\- \x{65e5}\x{c77c}]/g

The first token, then, since it can be any of year, month, or day, must then match the following:

/(?:(?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])[\/\.\- \x{5e74}\x{b144}]|(?:0?[1-9]|1[0-2])[\/\.\- \x{6708}\x{c6d4}]|(?:0?[1-9]|[12][1-9]|3[01])[\/\.\- \x{65e5}\x{c77c}])/g

The second token can only be day or month.

/(?:(?:0?[1-9]|1[0-2])[\/\.\- \x{6708}\x{c6d4}]|(?:0?[1-9]|[12][1-9]|3[01])[\/\.\- \x{65e5}\x{c77c}])/g

The last can also be day, month or year, but unless a Chinese, Japanese, or Korean date, will have no following delimiter.

/(?:(?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])[\x{5e74}\x{b144}]?|(?:0?[1-9]|1[0-2])[\x{6708}\x{c6d4}]?|(?:0?[1-9]|[12][1-9]|3[01])[\x{65e5}\x{c77c}]?)/g

To avoid bumping into tokens on either side, I add this to the beginning

/(?<=\b)(?<![\/\-])/g

and this to the end:

/(?![\/\-])(?=\b)/g

leaving me with a full expression (so far) of

/(?<=\b)(?<![\/\-])(?:(?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])[\/\.\- \x{5e74}\x{b144}]|(?:0?[1-9]|1[0-2])[\/\.\- \x{6708}\x{c6d4}]|(?:0?[1-9]|[12][1-9]|3[01])[\/\.\- \x{65e5}\x{c77c}])(?:(?:0?[1-9]|1[0-2])[\/\.\- \x{6708}\x{c6d4}]|(?:0?[1-9]|[12][1-9]|3[01])[\/\.\- \x{65e5}\x{c77c}])(?:(?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])[\x{5e74}\x{b144}]?|(?:0?[1-9]|1[0-2])[\x{6708}\x{c6d4}]?|(?:0?[1-9]|[12][1-9]|3[01])[\x{65e5}\x{c77c}]?)(?![\/\-])(?=\b)/g

It still catches some bad tokens, like the following:

  • 2009.4.97
  • 50-24-24

Can we clean that up, too? We’ve determined that there are three possibilities for dates, so let’s group them differently.

YMD:

/(?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])[\/\.\- \x{5e74}\x{b144}](?:0?[1-9]|1[0-2])[\/\.\- \x{6708}\x{c6d4}](?:0?[1-9]|[12][1-9]|3[01])[\x{65e5}\x{c77c}]?/g

MDY:

/(?:0?[1-9]|1[0-2])[\/\.\- \x{6708}\x{c6d4}](?:0?[1-9]|[12][1-9]|3[01])[\/\.\- \x{65e5}\x{c77c}](?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])[\x{5e74}\x{b144}]?/g

DMY:

/(?:0?[1-9]|[12][1-9]|3[01])[\/\.\- \x{65e5}\x{c77c}](?:0?[1-9]|1[0-2])[\/\.\- \x{6708}\x{c6d4}](?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])[\x{5e74}\x{b144}]?/g

Combined, with our lookbehind and lookahead blockers, we get

/(?<=\b)(?<![\/\-])(?:(?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])[\/\.\- \x{5e74}\x{b144}](?:0?[1-9]|1[0-2])[\/\.\- \x{6708}\x{c6d4}](?:0?[1-9]|[12][1-9]|3[01])[\x{65e5}\x{c77c}]?|(?:0?[1-9]|1[0-2])[\/\.\- \x{6708}\x{c6d4}](?:0?[1-9]|[12][1-9]|3[01])[\/\.\- \x{65e5}\x{c77c}](?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])[\x{5e74}\x{b144}]?|(?:0?[1-9]|[12][1-9]|3[01])[\/\.\- \x{65e5}\x{c77c}](?:0?[1-9]|1[0-2])[\/\.\- \x{6708}\x{c6d4}](?:(?:189)[4-9]|(?:19)?\d{2}|(?:200)\d|(?:201)[0-6])[\x{5e74}\x{b144}]?)(?![\/\-])(?=\b)/g

That would be the raw regex. But we can at least make it easier to read:

/(?(DEFINE)(?'year'(?:(?:189)[4-9]|(?:200)\d|(?:201)[0-6])|(?:19)?\d{2})(?'month'(?:0?[1-9]|1[0-2]))(?'day'(?:[12][1-9]|3[01]|0?[1-9]))(?'sep'[\/\.\- ])(?'yearsep'[\x{5e74}\x{b144}])(?'monthsep'[\x{6708}\x{c6d4}])(?'daysep'[\x{65e5}\x{c77c}]))(?<=\b)(?<![\/\-])(?:(?&year)(?:(?&sep)|(?&yearsep))(?&month)(?:(?&sep)|(?&monthsep))(?&day)(?&daysep)?|(?:(?&month)(?:(?&sep)|(?&monthsep))(?&day)(?:(?&sep)|(?&daysep))|(?&day)(?:(?&sep)|(?&daysep))(?&month)(?:(?&sep)|(?&monthsep)))(?&year)(?&yearsep)?)(?![\/\-])(?=\b)/g

Byte-wise, this is technically saving only a few characters, but, as I said, easier to read. I was able to simplify it still further because of the ease of reading it. It may not be optimized, but it’s quite robust (it grabs valid dates and ignores a lot of false positives).

Unfortunately, only PHP with PCRE uses the DEFINE bit, meaning that if we want to adjust the valid years we’d have to search through the penultimate pattern here and find and edit each instance of the expression.

It’s still not perfect. It doesn’t prevent dates in the future, won’t automatically update for a shifting calendar, etc., but for 511 bytes (excluding the slashes and the “g” flag), it’s not bad.

Just because I need a Python version of the above regex, here it is:

import re
re.compile('(?<=\b)(?<![\/\-])(?:(?:(?:189)[4-9]|(?:200)\d|(?:201)[0-6])|(?:19)?\d{2}[\/\.\- \u5e74\ub144](?:1[0-2]|0?[1-9])[\/\.\- \u6708\uc6d4](?:[12][1-9]|3[01]|0?[1-9])[\u65e5\uc77c]?|(?:(?:1[0-2]|0?[1-9])[\/\.\- \u6708\uc6d4](?:[12][1-9]|3[01]|0?[1-9])[\/\.\- \u65e5\uc77c]|(?:[12][1-9]|3[01]|0?[1-9])[\/\.\- \u65e5\uc77c](?:1[0-2]|0?[1-9])[\/\.\- \u6708\uc6d4])(?:(?:189)[4-9]|(?:200)\d|(?:201)[0-6]|(?:19)?\d{2})[\u5e74\ub144]?)(?![\/\-])(?=\b)')