When dealing with large numbers of files generated programmatically or collected from disparate sources, it’s common to end up with filename duplicates that differ only by appended strings after a delimiter such as an underscore. For example, you may have files like:
[email protected]_Arizona
[email protected]_Arizona_State
[email protected]_Washington
This presents challenges when processing the files further, as scripts may unintentionally overwrite files or skip ones that should be distinct. To tackle this, we can use some handy Linux bash scripting to deduplicate the files by their base name.
The core premise is to:
- Iterate through each file
- Extract the base name by removing text after the first underscore
- Check if a file already exists with that base name
- If so, delete the « duplicate » file with longer name
- If not, rename current file to the base name
By the end, we condense the files down to:
[email protected]
[email protected]
To implement this in bash, we:
- Set up a for loop to process each file in the current directory
- Use a conditional to check if the filename contains an underscore
- If so, employ bash utilities like cut and parameter expansion to extract the base name
- Add back the .txt extension to make cleaned filenames
- Check if base name file exists already
- Execute rename or delete accordingly
The key bash capabilities that enable this workflow are:
- File globbing to loop through * all files
- Cut utility to parse on delimiter
- Conditional logic with if/then statements
- String concatenation and parameter expansion
- Bash pattern matching to simplify wildcards
- Filesystem commands like mv and rm
Here is what the full script looks like:
#!/bin/bash
for file in *; do
if [[ $file == *"_"* ]]; then
new_name=$(echo "$file" | cut -d'_' -f1)
new_name="$new_name.txt"
if [ -f "$new_name" ]; then
rm "$file"
else
mv "$file" "$new_name"
fi
fi
done
In this way, by combining just a few basic bash scripting capabilities you can easily deduplicate file collections to better organize your filesystem. The same approach could be adapted to other filename delimiters or file types as well. Bash makes easy work of tasks like this that would otherwise require tedious manual effort.