The issue of overused words is a tricky one. For me, the problem occurs when I revisit my writing a few days later and make a few edits. I may find a word I would like to replace with a better match, forgetting that the same word was already used a couple of times elsewhere in the text.
I’ve been using Grammarly for some time now, and lately, I’ve also tried out TypeAI. Both can help you with syntax, punctuation, and overall style. However, these tools do not do well in keeping track of overused words.
If you use ChatGPT or another LLM AI platform to help you with your writing or editing, the problem of overused words becomes even more pronounced. Using word and phrase frequency analysis is one of the easiest ways of spotting AI-generated content. While it may be “important to consider” things and “delve into the intricacies” of them, overusing certain words and phrases may also reveal your dirty little AI secret.
This Bash script below (also available in my GitHub repo) is designed to analyze a text file to identify and enhance vocabulary by highlighting overused words and suggesting synonyms. It uses a combination of shell commands and a Python script to accomplish its tasks.
Script Breakdown:
- Input Validation:
- The script starts by checking if an input file (
$infile
) was provided. If not, it displays the correct usage and exits. - It then checks if the file exists. If the file is not found, it will output an error message and exit.
- The script starts by checking if an input file (
- Download Common Words List:
- A temporary file is created to store a list of common words downloaded from a specified URL using
curl
. If the download fails or the file is empty, the script reports an error and exits.
- A temporary file is created to store a list of common words downloaded from a specified URL using
- Combine Word Lists:
- If a custom list of common words exists in the user’s home directory, it is appended to the downloaded list. The combined list is then sorted to keep only unique entries.
- Python Script for Word Analysis:
- A Python script is embedded within the Bash script. It processes the text from the input file using the Natural Language Toolkit (
nltk
) to tokenize the text and filter out common words. - The script counts occurrences of the remaining words, focusing on overused words, and prints them in formatted output.
- A Python script is embedded within the Bash script. It processes the text from the input file using the Natural Language Toolkit (
- Output Enhancement with WordNet:
- If WordNet (
wn
) is installed on the system, the script uses it to find synonyms for each overused word. It processes the synonyms to display only the first ten unique synonyms for each word for multiple word senses. - The final output includes the overused word, its count, and a list of suggested synonyms.
- If WordNet (
- Cleanup:
- Temporary files created during the script’s execution are removed to clean up the working environment.
Usage:
This script is ideal for writers, editors, and anyone else looking to refine their text for redundancy or enhance their vocabulary. It requires Python with nltk
installed and optionally uses WordNet for synonym suggestions.
Note:
To run this script, save it to a file, make it executable (chmod +x scriptname.sh
), and run it by passing a text file as an argument (./scriptname.sh filename.txt
).
Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.