7 Comments

The pedant in me has to report your regex is buggy!

More specifically: the 1970 iteration will match lines like:

DOB 05-24-1952; SSN 123-74-1970; etc

I'm not sure if they exist but having had experience with cleaning messy data I worry. No, this is not likely to change your results meaningfully but it's the point of the thing. (Also, is each person's record written on a single long line? If not, you have a second problem!)

Doing this in fully generality is difficult but what I'd probably try is something like (not really tested):

cat file.txt | grep DOB | sed -e 's/.*DOB\([^;]*\);.*/\1/'

to extract all the DOB strings, then pipe the result to [grep -v '^....$'] to find anything not matching just a year, and build another sed expression to canonicalize each such class of lines until all lines are just a year.

Normalization is really hard!

Expand full comment