I sped up the execution of a bash command line >500x to make it a breeze to make use of
When it’s a must to course of information in textual or tabular types, one of many long-time favorites is definitely GNU Bash, the Linux flagship shell with “batteries included”. In case you have by no means used it, you might be lacking out quite a bit and will positively give it a strive.
The instruments coming with Bash observe the Unix philosophy of “do just one factor and do it nicely”, and are super-optimized for a lot of totally different duties. Find, grep, sed, and awk are just some among the many highly effective instruments that may interoperate because of the Bash’s pipes and filters architecture for textual content information processing.
Not too long ago, I needed to carry out easy textual content processing for which bash makes good sense. I’ve an enter file containing an absolute file path per line, and I’ve to generate an output file the place every line is the basename of the corresponding path within the enter file, with a distinct extension. In observe, the 2 information are wanted as an enter to a different program that may convert the information talked about within the enter file (in wav format) into the information listed within the output file (in mp4 format, by including some video).
I may have completed it in Python, however bash simply look rather more sensible for this job. Nonetheless, on the finish of this story I’ll present a Python implementation for comparability. Then, I rushed to my keyboard and produced the next:
$ cat enter.txt | whereas learn line; do echo $(echo $(basename $line) | sed "
s/.wav/.mp4/") >> output.txt; completed
The code is appropriate, however extraordinarily gradual. My file was 3 Hundreds of thousands strains lengthy and this command would have taken 1 hour to finish. Simply to get an thought of the strains per second, let me run it on a file with 10000 strains and measure its runtime.
$ time cat enter.txt | whereas learn line; do echo $(echo $(basename $line) | sed "
s/.wav/.mp4/") >> output.txt; completedactual 0m13.297s
consumer 0m19.688s
sys 0m1.881s
Some clear inefficiencies are the usage of cat to start out the command, and the double use of echo (with nested command calls). They’re each IO-heavy operations and are then very gradual. They are often simply changed, and since we all know that every one paths within the enter file have the identical extension, we will additionally take away sed
and take away the extension utilizing basename
itself. Then, we run the brand new command with the identical file of 10000 strains:
$ time whereas learn line; do title=$(basename $line .txt); echo ${title}.mp4 >> output.txt; completed < enter.txtactual 0m6.626s
consumer 0m5.723s
sys 0m1.131s
We’ve got some critical enchancment right here. Eradicating cat
solely, we get an actual time proper under 13s (relative enchancment ~2%), and the remainder is completed by changing sed and the second echo with merely basename. Sadly, it’s nonetheless fairly gradual. With roughly 1500 strains/s, it might take 2000 seconds, or about half-hour, to finish 3,000,000 strains. Fortuitously, we will get a critical increase by changing learn. Read reads a line from the usual enter and assigns its content material to 1 or a number of variables (it may be used simply with tabular information), however it’s not wanted in our case, since working line by line is what any bash command does anyway.
Sadly, we’ve got to surrender the helpful basename to extract solely the file title, however we will exchange it with cut, which may take away items of texts in response to delimiters, and rev, which simply reverses the character sequence — this can be a frequent trick to extract the final discipline with lower, which isn’t doable by default.
The variety of operations carried out appears to be like greater than earlier than, however we lastly get an enormous speed-up as we will see from our instance toy file:
$ time rev enter.txt | lower -d/ -f1 | rev | sed "s/.wav/.mp4/" >> output.txtactual 0m0.011s
consumer 0m0.010s
sys 0m0.013s
With this new pace of ~910 Klines/second we will course of 3,000,000 strains in 3.3 SECONDS, and corresponds to a speed-up of 606x.
Most significantly, whereas the precise numbers rely upon the {hardware} the place the instructions are executed, the relative enchancment can be fixed amongst totally different items of {hardware}.
Right here we will see an equal python implementation for comparability’s sake:
# convert.py
import os
import sysdef convert(tgt_ext: str):
for line in sys.stdin:
base, _ = os.path.splitext(os.path.basename(line))
print(base + tgt_ext)
if __name__ == '__main__':
if len(sys.argv) != 2:
print(f"Utilization: {sys.argv[0]} TARGET_EXT")
sys.exit(1)
convert(sys.argv[1])
exit(0)
and now we will measure its time:
$ time python3 convert.py .mp4 < enter.txt > output.txtactual 0m0.022s
consumer 0m0.021s
sys 0m0.000s
And the time is about double the very best time we get with bash. It requires writing extra code but in addition this code is tremendous quick and doubtless extra simply modifiable for a lot of.
Bash comes very in helpful for a many information processing duties when working with textual information. It comes with many instruments very optimized, however nonetheless some are sooner than others to attain the identical outcome. The distinction might not matter when working with brief information, however on this article I present that it begins to matter with a whole bunch of hundreds, or tens of millions of strains.
Figuring out the efficiency implications of utilizing our favourite packages can save us hours of ready for our jobs, with big positive aspects for our productiveness. Additionally, we noticed {that a} Python implementation can also be very quick for our use case regardless of Python’s fame of being gradual. It absolutely requires extra coding but in addition extra flexibility. I’d absolutely attain for python for the circumstances which might be too sophisticated to unravel with bash.
Thanks for studying thus far, and completely satisfied scripting!
Do you want my writing and are contemplating subscribing for a Medium Membership for having limitless entry to the articles?
Should you subscribe by means of this hyperlink you’ll help me by means of your subscription with no extra value for you https://medium.com/@mattiadigangi/membership