The pedant in me has to report your regex is buggy!
More specifically: the 1970 iteration will match lines like:
DOB 05-24-1952; SSN 123-74-1970; etc
I'm not sure if they exist but having had experience with cleaning messy data I worry. No, this is not likely to change your results meaningfully but it's the point of the thing. (Also, is each person's record written on a single long line? If not, you have a second problem!)
Doing this in fully generality is difficult but what I'd probably try is something like (not really tested):
cat file.txt | grep DOB | sed -e 's/.*DOB\([^;]*\);.*/\1/'
to extract all the DOB strings, then pipe the result to [grep -v '^....$'] to find anything not matching just a year, and build another sed expression to canonicalize each such class of lines until all lines are just a year.
I mean in practice for rough counts you're fine. I wouldn't do anything that could kill people or lose my money with your system but you're ok as a blogger probably. But next time you find yourself in such a world...
(Also my approach of normalize then count once with is asymptotically faster, but again who cares...)
What do you think about this: Just change the expression to limit the number of characters in the wildcard?
for i in {1910..2010}
do
c=($(grep "DOB .\{0,7\}$i;" sdnlist.txt | wc -l))
echo $i, $c
done
I tried this and it does make a small difference for a few of the different years. Just from spot-checking it seems correct. But... I noticed a couple other minor things. Like: (1) Some people have have multiple possible DOBs listed and (2) some people have their DOB split across a newline, meaning they don't get matched at all.
(I guess I'm fine to just live with these imperfections, but it underlines your point that these things aren't easy!)
AFAIK it's never made it into the legislature, but far right defence larpers love suggesting letters of Marque against China.
https://www.usni.org/magazines/proceedings/2020/april/unleash-privateers#:~:text=What%20are%20Letters%20of%20Marque,capture%20or%20destroy%20enemy%20ships.
https://warontherocks.com/2021/02/unfurl-the-banner-privateers-and-commerce-raiding-of-chinas-merchant-fleet-in-developing-markets/
The pedant in me has to report your regex is buggy!
More specifically: the 1970 iteration will match lines like:
DOB 05-24-1952; SSN 123-74-1970; etc
I'm not sure if they exist but having had experience with cleaning messy data I worry. No, this is not likely to change your results meaningfully but it's the point of the thing. (Also, is each person's record written on a single long line? If not, you have a second problem!)
Doing this in fully generality is difficult but what I'd probably try is something like (not really tested):
cat file.txt | grep DOB | sed -e 's/.*DOB\([^;]*\);.*/\1/'
to extract all the DOB strings, then pipe the result to [grep -v '^....$'] to find anything not matching just a year, and build another sed expression to canonicalize each such class of lines until all lines are just a year.
Normalization is really hard!
Well that's concerning! Thanks for pointing it out, I'll look into it. I definitely agree that this is tricky.
I mean in practice for rough counts you're fine. I wouldn't do anything that could kill people or lose my money with your system but you're ok as a blogger probably. But next time you find yourself in such a world...
(Also my approach of normalize then count once with is asymptotically faster, but again who cares...)
What do you think about this: Just change the expression to limit the number of characters in the wildcard?
for i in {1910..2010}
do
c=($(grep "DOB .\{0,7\}$i;" sdnlist.txt | wc -l))
echo $i, $c
done
I tried this and it does make a small difference for a few of the different years. Just from spot-checking it seems correct. But... I noticed a couple other minor things. Like: (1) Some people have have multiple possible DOBs listed and (2) some people have their DOB split across a newline, meaning they don't get matched at all.
(I guess I'm fine to just live with these imperfections, but it underlines your point that these things aren't easy!)
Yeah, that's most of what I wanted to express-not that any particular solution is best but that this is really frustrating to do well.
OK I finally updated the code and the graph. The difference is very hard to see (fortunately) but it is there. Thanks again!