Processing thousands of files using Terminal commands
How to fix a problem that spawns over more than two thousand files without spending days with a word processor? CLI to the rescue.
Having a JAM1 stack with a static generator CMS2 under VC3 in Git with automated CI/CD4 through a global CDN5 is great for many reasons (speed, security, scalability…) but what happens when a new version decides to change which characters are accepted in the build and which ones are deemed invalid? Then you have a problem that in my case spawn over more than two thousand files.
I’m not nearly as much of a CLI6 master as my friends Alvaro or Santiago. But I love to learn new tricks, and get a kick out of the power of the terminal.
In this case I had to locate a number of “offending characters” (left overs from a previous blog engine migration, that handled the text encoding differently) in thousands of files, and then do a “replace with” in all of them at once.
What did I do? Thanks to tutorials by CLI Magic, Winaero, Linuxize, and Maketecheasier, I put together the following.
First, identify which files were causing trouble. After all, if it was just one or two, I could fix it manually. So I run (in the directory containing all my blog posts):
grep -iRl "&#"
Here are the options: -i - ignore text case -R - recursively search files in subdirectories -l - show file names instead of file contents portions
The reason why I used “&#” is because of the offending text encoding always included that partial string.
Once I realized we were talking about over two thousand files, I decided to use the CLI in order to substitute the offending strings. I won’t list them all, so I don’t give away clues and vulnerabilities, but the general command I used is:
find . -type f -exec sed -i 's/…/.../g' {} +
And, just like magic, in the blink of an eye, all the offending occurrences of …
were turned into ...
.
That is, in essence, what lies at the heart of modern software and data transformation. Unix, still so brilliant after over half a century.