7 Comments

The pedant in me has to report your regex is buggy!

More specifically: the 1970 iteration will match lines like:

DOB 05-24-1952; SSN 123-74-1970; etc

I'm not sure if they exist but having had experience with cleaning messy data I worry. No, this is not likely to change your results meaningfully but it's the point of the thing. (Also, is each person's record written on a single long line? If not, you have a second problem!)

Doing this in fully generality is difficult but what I'd probably try is something like (not really tested):

cat file.txt | grep DOB | sed -e 's/.*DOB\([^;]*\);.*/\1/'

to extract all the DOB strings, then pipe the result to [grep -v '^....$'] to find anything not matching just a year, and build another sed expression to canonicalize each such class of lines until all lines are just a year.

Normalization is really hard!

Expand full comment
author

Well that's concerning! Thanks for pointing it out, I'll look into it. I definitely agree that this is tricky.

Expand full comment

I mean in practice for rough counts you're fine. I wouldn't do anything that could kill people or lose my money with your system but you're ok as a blogger probably. But next time you find yourself in such a world...

(Also my approach of normalize then count once with is asymptotically faster, but again who cares...)

Expand full comment
author

What do you think about this: Just change the expression to limit the number of characters in the wildcard?

for i in {1910..2010}

do

c=($(grep "DOB .\{0,7\}$i;" sdnlist.txt | wc -l))

echo $i, $c

done

I tried this and it does make a small difference for a few of the different years. Just from spot-checking it seems correct. But... I noticed a couple other minor things. Like: (1) Some people have have multiple possible DOBs listed and (2) some people have their DOB split across a newline, meaning they don't get matched at all.

(I guess I'm fine to just live with these imperfections, but it underlines your point that these things aren't easy!)

Expand full comment

Yeah, that's most of what I wanted to express-not that any particular solution is best but that this is really frustrating to do well.

Expand full comment
author

OK I finally updated the code and the graph. The difference is very hard to see (fortunately) but it is there. Thanks again!

Expand full comment