Home SysAdmin Commands & Shells Identify Overused Words

Identify Overused Words

July 25, 2024

The issue of overused words is a tricky one. For me, the problem occurs when I revisit my writing a few days later and make a few edits. I may find a word I would like to replace with a better match, forgetting that the same word was already used a couple of times elsewhere in the text.

I’ve been using Grammarly for some time now, and lately, I’ve also tried out TypeAI. Both can help you with syntax, punctuation, and overall style. However, these tools do not do well in keeping track of overused words.

If you use ChatGPT or another LLM AI platform to help you with your writing or editing, the problem of overused words becomes even more pronounced. Using word and phrase frequency analysis is one of the easiest ways of spotting AI-generated content. While it may be “important to consider” things and “delve into the intricacies” of them, overusing certain words and phrases may also reveal your dirty little AI secret.

This Bash script below (also available in my GitHub repo) is designed to analyze a text file to identify and enhance vocabulary by highlighting overused words and suggesting synonyms. It uses a combination of shell commands and a Python script to accomplish its tasks.

Script Breakdown:

Input Validation:
- The script starts by checking if an input file ($infile) was provided. If not, it displays the correct usage and exits.
- It then checks if the file exists. If the file is not found, it will output an error message and exit.
Download Common Words List:
- A temporary file is created to store a list of common words downloaded from a specified URL using curl. If the download fails or the file is empty, the script reports an error and exits.
Combine Word Lists:
- If a custom list of common words exists in the user’s home directory, it is appended to the downloaded list. The combined list is then sorted to keep only unique entries.
Python Script for Word Analysis:
- A Python script is embedded within the Bash script. It processes the text from the input file using the Natural Language Toolkit (nltk) to tokenize the text and filter out common words.
- The script counts occurrences of the remaining words, focusing on overused words, and prints them in formatted output.
Output Enhancement with WordNet:
- If WordNet (wn) is installed on the system, the script uses it to find synonyms for each overused word. It processes the synonyms to display only the first ten unique synonyms for each word for multiple word senses.
- The final output includes the overused word, its count, and a list of suggested synonyms.
Cleanup:
- Temporary files created during the script’s execution are removed to clean up the working environment.

Usage:

This script is ideal for writers, editors, and anyone else looking to refine their text for redundancy or enhance their vocabulary. It requires Python with nltk installed and optionally uses WordNet for synonym suggestions.

Note:

To run this script, save it to a file, make it executable (chmod +x scriptname.sh), and run it by passing a text file as an argument (./scriptname.sh filename.txt).

ScriptSample Output

Script

#!/bin/bash
infile=$1
if [ -z "$infile" ]; then
    echo "Usage: $0 <infile>"
    exit 1
fi
if [ ! -f "$infile" ]; then
    echo "Error: $infile not found."
    exit 1
fi
common_word_list_01="$(mktemp)"
common_word_list_01_url="https://gist.githubusercontent.com/igoros777/e6ae5761ef6635c61eb9bed661f0d0c1/raw/98d35708fa344717d8eee15d11987de6c8e26d7d/1-1000.txt"
curl -m10 -k -s0 -o "$common_word_list_01" "$common_word_list_01_url"
if [ ! -s "$common_word_list_01" ]; then
    echo "Error: failed to download $common_word_list_01_url"
    exit 1
fi
tmpfile="$(mktemp)"

common_word_list_custom_01="$HOME/common_word_list_custom_01.txt"
if [ -f "$common_word_list_custom_01" ]; then
    cat "$common_word_list_custom_01" >> "$common_word_list_01"
    sort -u -o "$common_word_list_01" "$common_word_list_01"
fi

python3 <<EOF | column -t > "${tmpfile}"
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter

try:
  nltk.data.find('tokenizers/punkt')
except LookupError:
  nltk.download('punkt')

with open('$infile', 'r') as f:
    text = f.read().lower()

with open('$common_word_list_01', 'r') as f:
    common_words = f.read().splitlines()

tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.isalpha() and not word in common_words]
counter = Counter(filtered_tokens)
overused_words = counter.most_common()

grouped_words = {}
for word, count in overused_words:
    if count in grouped_words:
        grouped_words[count].append(word)
    else:
        grouped_words[count] = [word]

for count in sorted(grouped_words.keys(), reverse=True):
    words = sorted(grouped_words[count])
    for word in words:
      if len(word) > 3 and count >= 2:
        print(f'{word}: {count}')
EOF

WN="$(which wn)"
if [ -z "$WN" ]; then
    cat "${tmpfile}"
else
    while read -r line; do
        word=$(echo "$line" | awk -F: '{print $1}')
        synonyms="$(${WN} "${word}" -synsn -synsv -synsr -synsa | 
            awk '/^Sense [0-9]+/{getline; split($0, a, " "); print a[1], a[2], a[3], a[4]}' | 
            sed 's/,$//g' | 
            awk 'BEGIN{RS=ORS="\n"} 
                { 
                    gsub(/^ +| +$/, "", $0); 
                    n = split($0, words, ", "); 
                    for (i=1; i<=n; i++) {
                        word = words[i];
                        if (!(word in seen)) {
                            seen[word]=1; 
                            print word;
                            if (++count == 10) exit;
                        }
                    }
                }' | paste -sd, - | sed 's/,/, /g')"
        echo -e "${line}\t${synonyms}"
    done < "${tmpfile}"
fi


/bin/rm -f "$common_word_list_01" "$tmpfile"

Sample Output

./word_frequency_analyzer.sh /tmp/text.txt

script:      13 script, book, playscript, handwriting, hand
words:       10 words, lyric, language, quarrel, wrangle, row, actor's line, speech, word, news
file:        9  file, data file, single file, Indian, file cabinet, filing, register, charge, lodge, file away
overused:    6  overuse, overdrive
synonyms:    5  synonym, equivalent word
text:        5  text, textual matter, textbook, text edition
output:      4  end product, output, yield, output signal, production, outturn, turnout
python:      4  python, Python
using:       4  exploitation, victimization, victimisation, using, use, utilize, utilise, apply, habituate, expend
input:       3  input signal, input, remark, comment, stimulation, stimulus, stimulant
uses:        3  use, usage, utilization, utilisation, function, purpose, role, consumption, economic consumption, usance
wordnet:     3  wordnet, WordNet, Princeton WordNet
analysis:    2  analysis, analytic thinking, psychoanalysis, depth psychology
bash:        2  knock, bash, bang, smash, do, brawl, sock, bop, whop, whap
created:     2  make, create, produce
download:    2  download
downloaded:  2  download
enhance:     2  enhance, heighten, raise
error:       2  mistake, error, fault, erroneousness, erroneous belief, misplay, wrongdoing, computer error
exists:      2  exist, be, survive, live, subsist
exits:       2  exit, issue, outlet, way, passing, loss, departure, go out, get, die
installed:   2  install, instal, put in, set up
nltk:        2
processes:   2  procedure, process, cognitive process, mental, summons, unconscious process, outgrowth, appendage, physical process, treat
temporary:   2  temp, temporary, temporary worker, impermanent (vs. permanent), irregular
unique:      2  alone(predicate), unique, unequaled, unequalled, unique(predicate), singular
usage:       2  use, usage, utilization, utilisation, custom, usance
vocabulary:  2  vocabulary, lexicon, mental lexicon

Igor

Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.

Symbol	USD	% 1h	% 24h	% 7d
BTC	37,157	0.55	2.50	7.72
ETH	1,716.5	0.31	3.66	4.71
USDT	0.9995	0.01	0.02	0.09
BNB	576.38	0.17	1.63	14.23
USDC	0.9997	0.01	0.01	0.01
XRP	0.3813	0.14	0.63	2.13
SOL	147.93	0.13	1.23	6.13
TRX	0.3206	0.07	1.25	6.74
FIGR_HELOC	1.029	0.00	0.96	0.30
	?	---	0.00	0.00

Bitcoin $ 37,157	Bitcoin 2.50 %
Ethereum $ 1,716.5	Ethereum 3.66 %
Litecoin $ 53.16	Litecoin 0.18 %
XRP $ 0.3813	XRP 0.63 %

Randomizing Filenames

Finding Gaps in Timestamps

Coronavirus Stats in Bash

Updating Lynis

Converting Geofency Data to Google Maps

The Black Box on the Org Chart

The Post-Language Future of AI Systems

Outsmarted by a River, a Rope, and an Anchor

The AI Bubble Isn’t a Bubble. It’s a Trap.

College Students Demand Refunds

Randomizing Filenames

To Mask or Not to Mask

Checking Linux Account Password

Finding Gaps in Timestamps

Converting Geofency Data to Google Maps

Analyzing atop Logs with atopsar

Verifying SNMP Connectivity on Multiple Hosts

Validating HTTPS Cache Peers for Squid

Verifying SNMP Connectivity on Multiple Hosts

Bulk-Adding IPTables Rules

Automatically Block Frequent Visitors

Canyonlands National Park, Utah

Yellowstone National Park, Wyoming

Sun Juan Mountains, Colorado

Detecting Blurry Photos with ImageMagick

Identify Overused Words

Script Breakdown:

Usage:

Note:

Synology NAS Hacks

Bulk-Adding IPTables Rules

Finding Prime Numbers

Automatic File Backups in VIM

Benford’s Law in Bash

Scraping a Web Page in Bash