Counting Keywords

08/24/2015 dev1up Comments 0


Today I found myself in need of something that would count all of the occurrences of every word in a document so that I might see which were the most relevant. This is a fairly common task in Search Engine Optimization and ad campaigning, and while you could manually tally up every word in a document, that would be quite tedious. While I was searching, I came across this pretty nifty bash script. It did most everything I wanted, I just had to refine it so I would get the type of results I was looking for.

How Does it work?

If you’ve ever used a bash script before, you’ve probably come across cat as well. cat concatenates and prints files. We will use this in conjunction with a pipe | to “pipe” the input of the file into the rest of our script.

tr is a useful tool and is prevalent throughout this script. It can translate, squeeze, and delete characters. First, it translates all uppercase characters to lowercase. This is useful because uniq, explained below, is case sensitive, which means that “the” and “The” would count as two separate words. In addition to translating uppercase to lowercase, tr also replaces all spaces with newlines and deletes punctuation. It is important to remove punctuation so that uniq can understand that something like “end” and “end.” or “email” and “e-mail” are the same thing.

grep is our simple find/replace tool. In the above script it finds empty lines and removes them.

sort does exactly what you would expect, it sorts input. In this case this is important for uniq to work correctly. After uniq is run, sort is used again to sort the results of our keywords so that they are in order from most to least common.

Once our input is sorted we can then ask uniq to count the number of times a line is repeated. It is important to use sort before uniq. uniq relies on repeated lines being next to eachother.

Improvements

  • One thing I noticed while using this script is that tr -d [:punct:] does not remove special characters, such as bullets (not asterisks).
  • It would also be beneficial to have a blacklist of common words to exclude from the count, such as “the”.

You Might Also Enjoy:

Sorry, there aren't any other posts quite like this one!

Discussion

Join the Discussion

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax