1

I dont (need to) care about performance!

My regex matches the date format dd.mm.yyyy

((([0][1-9]|[12][\d])|[3][01])[./]([0][13578]|[1][02])[./][1-9]\d\d\d)|((([0][1-9]|[12][\d])|[3][0])[./]([0][13456789]|[1][012])[./][1-9]\d\d\d)|(([0][1-9]|[12][\d])[-/][0][2][./][1-9]\d([02468][048]|[13579][26]))|(([0][1-9]|[12][0-8])[./][0][2][./][1-9]\d\d\d)

Here are the dates my regex does not match yet. Any help appreciated.

09. Juni 1997
01.Aug.1995
27.06. 1997
29.02.1996
21. 01. 1999
28.05. 1996
07..4..1995
20:03:1998
9.4.1997
14 .03 - 1995

I started out by trying to add the month letters but failed (probably because of the whitespaces between them)

here is a regex that validates the months' letter order (Januar, Februar, März, April, Mai, Juni, August, September, Oktober, November, Dezember)

(?:J(anuar|u(n|li))|Februar|Mä(rz|i)|A(pril|ugust)|(((Sept|Nov|Dez)em)|Okto)ber)

I found this on the internet, which focusses on the issue if only 3 letters of the months are availiable

(((([1-9])|([0][1-9])|([1-2][0-9])|(30))\-([A,a][P,p][R,r]|[J,j][U,u][N,n]|[S,s][E,e][P,p]|[N,n][O,o][V,v]))|((([1-9])|([0][1-9])|([1-2][0-9])|([3][0-1]))\-([J,j][A,a][N,n]|[M,m][A,a][R,r]|[M,m][A,a][Y,y]|[J,j][U,u][L,l]|[A,a][U,u][G,g]|[O,o][C,c][T,t]|[D,d][E,e][C,c])))\-[0-9]{4}$)|(^(([1-9])|([0][1-9])|([1][0-9])|([2][0-8]))\-([F,f][E,e][B,b])\-[0-9]{2}(([02468][1235679])|([13579][01345789]))$)|(^(([1-9])|([0][1-9])|([1][0-9])|([2][0-9]))\-([F,f][E,e][B,b])\-[0-9]{2}(([02468][048])|([13579][26]))
1
  • You could try the dateparser package. Commented Oct 24, 2021 at 9:09

1 Answer 1

2

You can use

pattern = r"""(?x)(?<!d)(?:
  (?:(?:0?[1-9]|[12]\d)|3[01])\s?[./:-][\s.]?(?:0?[13578]|1[02]|J(?:an(?:uar)?|uli?)|M(?:ärz?|ai)|Aug(?:ust)?|Dez(?:ember)?|Okt(?:ober)?)\s?(?:[./:-][\s.]?)?[1-9]\d\d\d|
  (?:(?:0?[1-9]|[12]\d)|30)\s?[./:-][\s.]?(?:0?[13-9]|1[012]|J(?:an(?:uar)?|u[nl]i?)|M(?:ärz?|ai)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|(?:Nov|Dez)(?:ember)?|Okt(?:ober)?)\s?(?:[./:-][\s.]?)?[1-9]\d\d\d|
  (?:0?[1-9]|[12]\d)\s?[./:-][\s.]?(?:0?2|Fe(?:b(?:ruar)?)?)\s?(?:[./:-][\s.]?)?[1-9]\d(?:[02468][048]|[13579][26])|
  (?:0?[1-9]|[12][0-8])\s?[./:-][\s.]?(?:0?2|Fe(?:b(?:ruar)?)?)\s?(?:[./:-][\s.]?)?[1-9]\d\d\d
)(?!\d)"""

See the regex demo.

Main POIs:

  • The month regex is (?:J(?:an(?:uar)?|u[nl]i?)|Fe(?:b(?:ruar)?)?|M(?:ärz?|ai)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|(?:Nov|Dez)(?:ember)?|Okt(?:ober)?) and it is tested here. Adjust for shortenings as you see fit.
  • Febraury pattern is used separately for the last two alternations (they are specifically for Februrary) and is subtracted from the month pattern for the rest of the alternatives
  • From the first alternation, for 31-day months, February, April, June, September and November months are removed
  • Leading zeros in days and months is made optional by adding ? quantifier after 0
  • The separator between days and months is changed to \s?[./:-][\s.]?: an optional whitespace, a char from ./:- char set, and then an optional whitespace or .
  • The separator between months and years is changed to \s?(?:[./:-][\s.]?)?: an optional whitespace and then an optional sequence of a char from ./:- char set and then an optional whitespace or ..

The numeric boundaries, (?<!\d) / (?!\d), are added on both ends to make sure there are no other digits on both ends of the match.

Sign up to request clarification or add additional context in comments.

2 Comments

It looks like you made every single group a non-capturing one -- why? Wouldn't it work just as well with regular groups? The ?: seem to do nothing except adding additional clutter to this already extremely complex regex. (Very impressive nonetheless!)
@fsimonjetz See Are non-capturing groups redundant? Caprturing groups need extra memory allocation for the substrings captured, and that slows it all down.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.