Box Set

In Python for now, eventually in VBA, I’m trying to come up with a way to test whether or not a particular quantity is appropriately specified with valid units and tolerances. It turns out to be slightly simpler to be (semi-)lenient than extremely strict, so here’s what I have so far:

#decimal numbers
number='[\d\.]+'
#unicode superscript digits, minus sign, and parentheses
exponent='[\u2070\u2074-\u2079\u207b\u207d\u207e\xb2\xb3\xb9]*'
#SI prefixes
prefix='(?:[YZEPTGMkhdcm\xb5npfazy]|da)?'
#indivisible SI prefixes
ind_prefix='(?:[YZEPTGMkh]|da)?'
#JEDEC/IEC Binary prefixes
bin_prefix='(?:[YZEPTGMK]i)?'
#units that can have a prefix (though some are less standard than others)
prefixed_unit='[ABCFHJKLNRSTVWbglmstu\u2126]|Bq|Ci|Da|Gy|Hz|Np|Pa|Sv|Wb|bar|cd|eV|kat|lm|lx|rad|rem|sr|ua|mol'
#units that cannot have a prefix (normally)
solo_unit='[°\'\"dh\xc5]|°C|ha|mmHg|min'
#units that primarily take binary prefixes
bin_unit='B|bit'
#additional non-standard units that cannot have a prefix (because I said so, not because there's any "ban" on it or anything)
solo_unit+='|°F|ft|in|kt|lb|nmi'
#define a generic unit
unit='(?:(?:'+prefix+prefixed_unit+')|(?:'+solo_unit+')|(?:'+bin_prefix+bin_unit+')|(?:'+ind_prefix+bin_unit'+))'+exponent
#separator: whitespace plus thin space
sep='[\s\u2009]?'
#multiply operator: whitespace, thin space, and cdots
multiply='[×\u2219\u22c5\xb7\s\u2009]'
#define compound units involving only multiplication
compound_unit='(?:'+unit+multiply+'{1,3})*'+unit
#define compound units involving division as well
mega_unit=compound_unit+sep+'(?:\/(?:'+sep+'\('+sep+compound_unit+sep+'\))|(?:\/'+sep+unit+'))*'
#put it all together
everything=number+sep+'('+mega_unit+')'+sep+'(?:±'+sep+number+sep+'\\1|\+'+sep+number+sep+'\\1'+sep+'\/'+sep+'-'+sep+number+sep+'\\1)|[<>\u2264\u2265]'+sep+number+sep+compound_unit
#compile!
re.compile(everything)

This will handle quantities with symmetric and asymmetric tolerances are specified, as well as inequalities with units. It doesn’t enforce asymmetric tolerances (the numbers could, theoretically, be the same), and the separators are lazy for now.

The goal is actually to highlight non-conforming quantities so that they can be corrected. I suspect that will be more than a minor challenge.