Pitfalls in UK postcode validation Lukas Mai (mauke) The Perl Conference Glasgow, 2018-08-17 ================================================================================ - we want to validate the format of postal codes - including international addresses - trivial for most countries: e.g. five digits ([0-9]{5}) for Germany - unexpectedly difficult: UK ================================================================================ First stop: Wikipedia https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation - many possible variants - many rules and restrictions - at the end: a regex! ================================================================================ Wikipedia: The UK government has also provided the following regular expression that can be used for the purpose of validation:[27] ^(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))$ 27. ^ "BULK DATA TRANSFER: ADDITIONAL VALIDATION FOR CAS UPLOAD" (PDF) ================================================================================ https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/488478/Bulk_Data_Transfer_-_additional_validation_valid_from_12_November_2015.pdf 3.1 Expression ^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$ 3.2 Logic "GIR 0AA" OR One letter followed by either one or two numbers OR One letter followed by a second letter that must be one of ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I) and then followed by either one or two numbers OR One letter followed by one number and then another letter OR A two part post code where the first part must be One letter followed by a second letter that must be one of ABCDEFGH JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that AND The second part (separated by a space from the first part) must be One number followed by two letters. A combination of upper and lower case characters is allowed. Note: the length is determined by the regular expression and is between 2 and 8 characters. ================================================================================ ^ ( [Gg][Ii][Rr][ ]0[Aa]{2} ) | ( ( ( [A-Za-z][0-9]{1,2} ) | ( ( [A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2} ) | ( ( [A-Za-z][0-9][A-Za-z] ) | ( [A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z] ) ) ) ) [ ][0-9][A-Za-z]{2} ) $ ================================================================================ ^ ( GIR[ ]0A{2} ) | ( ( ( [A-Z][0-9]{1,2} ) | ( ( [A-Z][A-HJ-Y][0-9]{1,2} ) | ( ( [A-Z][0-9][A-Z] ) | ( [A-Z][A-HJ-Y][0-9]?[A-Z] ) ) ) ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ GIR[ ]0A{2} | ( [A-Z][0-9]{1,2} | [A-Z][A-HJ-Y][0-9]{1,2} | [A-Z][0-9][A-Z] | [A-Z][A-HJ-Y][0-9]?[A-Z] ) [ ][0-9][A-Z]{2} $ ================================================================================ ^ GIR[ ]0A{2} | (?: [A-Z][0-9]{1,2} | [A-Z][A-HJ-Y][0-9]{1,2} | [A-Z][0-9][A-Z] | [A-Z][A-HJ-Y][0-9]?[A-Z] ) [ ][0-9][A-Z]{2} $ ================================================================================ ^ (?: GIR[ ]0A{2} | (?: [A-Z][0-9]{1,2} | [A-Z][A-HJ-Y][0-9]{1,2} | [A-Z][0-9][A-Z] | [A-Z][A-HJ-Y][0-9]?[A-Z] ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | (?: [A-Z][0-9]{1,2} | [A-Z][A-HJ-Y][0-9]{1,2} | [A-Z][0-9][A-Z] | [A-Z][A-HJ-Y][0-9]?[A-Z] ) [ ][0-9][A-Z]{2} ) $ 3.2 Logic ... OR ... OR ... OR ... OR ... AND ... ================================================================================ [A-Z][A-HJ-Y][0-9]{1,2} "One letter followed by a second letter that must be one of ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I)" What about Z? ================================================================================ [A-Z][A-HJ-Y][0-9]{1,2} "One letter followed by a second letter that must be one of ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I)" What about Z? Wikipedia: "The letters IJZ are not used in the second position." What about J? ================================================================================ ^ (?: GIR[ ]0A{2} | (?: [A-Z][0-9]{1,2} | [A-Z][A-HJ-Y][0-9]{1,2} | [A-Z][0-9][A-Z] | [A-Z][A-HJ-Y][0-9]?[A-Z] ) [ ][0-9][A-Z]{2} ) $ ================================================================================ [A-Z][A-HJ-Y][0-9]?[A-Z] "One letter followed by a second letter that must be one of ABCDEFGH JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that" [0-9]?[A-Z] makes the digit optional, not the following letter. ================================================================================ ^ (?: GIR[ ]0A{2} | (?: [A-Z][0-9]{1,2} | [A-Z][A-HJ-Y][0-9]{1,2} | [A-Z][0-9][A-Z] | [A-Z][A-HJ-Y][0-9]?[A-Z] ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | (?: [A-Z][0-9]{1,2} | [A-Z][A-HJ-Y][0-9]{1,2} | [A-Z][0-9][A-Z] | [A-Z][A-HJ-Y][0-9][A-Z]? ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] (?: [0-9]{1,2} | [A-HJ-Y][0-9]{1,2} | [0-9][A-Z] | [A-HJ-Y][0-9][A-Z]? ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] (?: [0-9]{1,2} | [A-HJ-Y][0-9]{1,2} | [0-9][A-Z] | [A-HJ-Y][0-9][A-Z]? ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] (?: [0-9][0-9]? | [A-HJ-Y][0-9][0-9]? | [0-9][A-Z] | [A-HJ-Y][0-9][A-Z]? ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] (?: [A-HJ-Y]?[0-9][0-9]? | [0-9][A-Z] | [A-HJ-Y][0-9][A-Z]? ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] (?: [A-HJ-Y]?[0-9][0-9]? | [0-9][A-Z] | [A-HJ-Y][0-9][A-Z]? ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] (?: [A-HJ-Y]?[0-9][0-9]? | [0-9][A-Z] | [A-HJ-Y][0-9][A-Z] ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] (?: [A-HJ-Y]?[0-9][0-9]? | [A-HJ-Y]?[0-9][A-Z] ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] (?: [A-HJ-Y]?[0-9][0-9]? | [A-HJ-Y]?[0-9][A-Z] ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] [A-HJ-Y]?[0-9] (?: [0-9]? | [A-Z] ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] [A-HJ-Y]?[0-9] (?: [0-9]? | [A-Z] ) [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] [A-HJ-Y]?[0-9] (?: [0-9] | [A-Z] )? [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] [A-HJ-Y]?[0-9] (?: [0-9A-Z] )? [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR[ ]0A{2} | [A-Z] [A-HJ-Y]?[0-9] [0-9A-Z]? [ ][0-9][A-Z]{2} ) $ ================================================================================ ^ (?: GIR [ ] 0AA | [A-Z] [A-HJ-Y]? [0-9] [0-9A-Z]? [ ] [0-9] [A-Z]{2} ) $ ================================================================================ ^ (?: GIR [ ] 0AA | [A-Z] [A-HJ-Y]? [0-9] [0-9A-Z]? [ ] [0-9] [A-Z]{2} ) $ Conclusions: - the official regex is complicated and wrong - the official explanation is also wrong, but in different ways - the regex from Wikipedia is complicated and wrong in a third way - the explanation on Wikipedia is probably (?) correct ================================================================================