Parallel PowerShell Loops
I recently was tasked with building an automation that downloaded files from an AWS S3 bucket. The basics of the initial script was, build a list of objects to download based upon criteria, then loop through them to download them. Seems basic enough right?
Well... It was slow. In testing, after building the list of files it could take an hour or more to download all the files, which could have numbered in the thousands. Each file was relatively small, but the overhead of downloading so many small files added up pretty fast.
PowerShell 7 has been out for a while, and up until this project, I've never needed to use one of the new features it has. The feature in question being the parallel version of the foreach-object function. Now, in general, multi-threading in programming is a difficult concept, and can also be difficult to get working properly depending on the complexity of the issue at hand, but with fairly little tweaks to the script I was able get it working using this new feature, and after I was all done that 1 hour runtime decreased to about 5 minutes.
I'm not going to post a complete script here, as the script is very proprietary and the point I want to get across is the parallelism and how to accomplish it. Lets move on to the basics of the script:
First we need to build a list of objects that we are going to run other commands in parallel on.
$objects=get-s3object -bucketname $bucket -keyprefix $prefix|where $_.key -match $criteria
Now, in a linear script we would use:
foreach($object in $objects){
read-s3object -bucketname $bucket -key $object.key -file "$object.filename" |out-null
}
But since this needs to work in parallel we use:
$objects|foreach-object -parallel {
#do stuff
} -throttlelimit 16
Now for the caveats... Since it spawns a distinct thread, variables declared outside of the thread don't work the same way as variables spawned inside of the thread. Previously defined variables can be read with the keyword using
as such:
$objects|foreach-object -parallel {
$internalVariable=$using:externalVariable
read-s3object -bucketname $using:bucket -key $_.key -file "$_.filename" |out-null
} -throttlelimit 16
Also it's important to note that since the $objects are passed in via the pipeline, you'd use $_
as the variable to access the object itself, which otherwise behaves normally.
One last thing to point out in my script example, by default the throttle limit is set to 5. This number can be important as setting it to more than the number of cores available to a system doesn't effectively end up providing a benefit since the processor would max out the number of cores available and then run the overflow in series anyway.
There are also cases where running things in parallel would take longer as there is an overhead for running things in parallel since PowerShell has to instantiate a new runspace for each thread. In other words, trivial tasks in parallel like a simple math computation or string manipulation could increase the time it takes if you try to just do that in parallel. So if you are going to utilize parallelism, to be effective it actually needs to do something complex or time consuming.