I read some insight into Drupal committing and they had a chart of the most common words in drupal commit messages. I thought it would be interesting to do something like that with WordPress Core, so I through together a bash one-liner to find this. It’s not the most eloquent solution, but it answers the question that I had. Here is what I initially came up with.
svn log http://develop.svn.wordpress.org/trunk -rHEAD:1 -v --xml | xq '.log.logentry | .[].msg' | sed 's/.$//' | sed 's/^.//' | sed 's/\\n/ /g' | tr ' \t' '\n' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 25
Let’s walk through this since there is enough piping going on, that it may not be the easiest to follow.
svn log http://develop.svn.wordpress.org/trunk -rHEAD:1 -v --xml
I start by getting an xml version of the SVN history, starting at the first changeset and going until the current head.
xq '.log.logentry | .[].msg'
Next, I use xq which takes xml and allows me to run jq commands on it. It’s a handy tool if you ever need to use xml data on the command line. In this case, I am taking what is inside <log><logentry>
and then for each sub element, extracting the msg
. At this point, the messages are on a single line wrapped in quotation marks with \n
to signify newlines. So I run three seds to fix that up.
sed 's/.$//' | sed 's/^.//' | sed 's/\\n/ /g'
I’m sure there is a better way to do this, but the first one removes the last character, the next one removes the first character, and the last one converts new lines to spaces. Since words are what we are aiming to look at, we need to get all the words onto their own lines.
tr ' \t' '\n'
tr
is a powerful program for doing transforms of text. In this case, I am taking whitespace and turning it into actual newlines (rather than just the new line charachters). There is likely a more elegant way to have solved this, but my goal isn’t the best solution it’s the working one.
tr '[:upper:]' '[:lower:]'
Word and word are not equal, so we need to make everything a single case. In this case, I am again using tr
, but now I am transforming upper case characters to lowercase.
sort | uniq -c | sort -nr | head -n 25
Counting things on the command line is something I have done so many times, I have an alias for a version of this. Sort puts everything in alphabetical order, uniq -c
then counts how many uniq values there are and outputs it along with how many of each it counted. uniq
requires things common things to be in adjacent lines, hence the initial sort. Next up, we want to sort based on the number and we want high numbers first. Finally, we output the top 25.
28997 the 27463 20429 fixes 17844 to 17818 props 15251 for 15189 in 14441 see 10856 and 10272 a 7549 of 5594 is 5227 when 5133 add 4444 from 4143 fix 3847 * 3821 on 3489 use 3320 that 3267 this 3064 with 3043 remove 2983 be 2766 as
That’s not super helpful. The
isn’t my idea of interesting. So I guess I need to remove useless words. Since I have groff
on this machine, I can use that and fgrep
fgrep -v -w -f /usr/share/groff/1.19.2/eign
I also noticed that the second most common word is whitespace. Remember when we used to put two spaces between sentences? WordPress Core commit messages remember. So let’s add another sed command to the chain:
sed '/^$/d'
And now the final command to see the 25 most used words in WordPress Core Commit messages:
svn log http://develop.svn.wordpress.org/trunk -rHEAD:1 -v --xml | xq '.log.logentry | .[].msg' | sed 's/.$//' | sed 's/^.//' | sed 's/\\n/ /g' | tr '[:upper:]' '[:lower:]' | tr ' \t' '\n' | fgrep -v -w -f /usr/share/groff/1.19.2/eign | sed '/^$/d' | sort | uniq -c | sort -nr | head -n 25
And since you’ve made it this far, here is the list
20429 fixes 17818 props 15189 in 5594 is 5133 add 4143 fix 3847 * 3320 that 3267 this 3064 with 3043 remove 2766 as 2435 an 2432 it 2109 post 2103 if 2080 are 1889 don't 1793 update 1735 - 1688 twenty 1523 more 1500 make 1471 docs: 1416 some
Have an idea for another way to do this with the command line? I would love to hear it.