Add a `Join-Object` cmdlet to the standard PowerShell equipment
As the verb-noun convention suggests, I think that there are a lot of developers looking for a common way to join objects (lists), actually it appears a decennia old quest: as shows from the PowerShell Team Join-Object article and StackOverflow issues along with: In Powershell, what's the best way to join two tables into one?
Personally, I am convinced that a Join-Object cmdlet will generally result in a better syntax were an object (list) join is required (in comparison with alternative cmdlets along with Group-Object) and better performance (then using obvious iterators like ForEach-Object together with Where-Object or similar cmdlets and methods).
For that reason I am maintaining a Join-Object script/module for more than 3 years now. I don't think that my Join-Object version should be added to the standard PowerShell package (for one thing, it is written in PowerShell and not C# 🤔), but it might serve as an example for what I would expect from a syntax that comes together with a Join-Object cmdlet.
What do I expect from a "internal" Join-Object cmdlet?
- An intuitive idiomatic PowerShell syntax
The syntax should be PowerShell like and include features similar to the SQL Join clause where it basically joins (a list of) objects or items based on related properties (
-Onparameter) or a simple side-by-side list join (by omitting the-Onparameter) as a specific PowerShell cmdlet feature (see:#14732) . - Ability to do similar Join types as SQL, along with
InnerJoin,LeftJoin,RightJoin,FullJoin - Ability to join based on multiple property relations, e.g.:
-On Column1, Column2 - Smart property merging (e.g. if a properly is defined as related, it should only appear ones in the result)
- Respect the PowerShell pipeline (for the left object supplied through pipeline)
- A reasonable performance by using a binary search (
HashTable) on the right object (knowing that it needs to be iterated multiple times). Also note the LinqJoin-Objectsolution by @ili101 - Calculated properties
- ...
An aside re:
And it could even prevent to invest too much in traditional operators as shown here.
As I've argued in detail in https://github.com/PowerShell/PowerShell/issues/14724#issuecomment-777801399, there is no reason to pit cmdlet-based solutions against expression / statement-based solutions: both are necessary, and in certain cases use of one over the other is the only option. Instead, we should strive for feature parity, to the extent that is feasible.
vote for this feature
I use this type of feature all the time with a 3rd party module. If this could be natively part of the PowerShell command set, that would be awesome.
It would be a highly useful addition to the suite. There is actually even another maintained version: https://github.com/ili101/Join-Object. This goes to show that there is a realistic need for such a feature.
GitHub
PowerShell Join-Object LINQ Edition. Contribute to ili101/Join-Object development by creating an account on GitHub.
@moreaki
Thank you for your support.
another maintained version: https://github.com/ili101/Join-Object
I did already cover this one in the initial issue request. The Join-Object from ili101 is based on Linq and therefore probably always faster than my version, my counter arguments to this:
The PowerShell pipeline is designed to stream objects (which safes memory), meaning that both¹ lists of input objects usually aren't (shouldn't be) resident in memory. Normally they are retrieved from somewhere else (i.e. a remote server or a disk). Also, the output usually matters where linq solutions are fast but might easily put you on the wrong foot in drawing conclusions because linq literally defers the execution (lazy evaluation), see also: fastest way to get a uniquely index item from the property of an array.
In other words, if it comes to (measuring) performance in PowerShell, it is important to look to the complete end-to-end solution, which is more likely to look something like:
import-csv .\file1.csv |LeftJoin (import-csv .\file2.csv) -On IP |Export-Csv .\results.csv
(1) Note: unfortunately, there is no easy way to build two parallel input streams (see: #15206 Deferred input pipelines)
Anyways, I have just published a new version, see: https://github.com/iRon7/Join-Object which performs about twice as fast as the previous one due to modifications based on this discussion: Optimize static code reusage #19322
GitHub
PowerShell Join-Object LINQ Edition. Contribute to ili101/Join-Object development by creating an account on GitHub.
GitHub
Combines two objects lists based on a related property between them. - GitHub - iRon7/Join-Object: Combines two objects lists based on a related property between them.
@iRon7
Regarding the performance points you mentioned, I just run some tests to check. on 100k rows test data:
JoinModule LeftJoin-Object took 16s.
Join-Object Join-Object took 4s, and 2s when selecting DataTable output.
Regarding Laziness my module is not Lazy, it will only return once everything executed as I wanted to return the simple objects that everyone know and not the Linq lazy objects.
Regarding memory I didn't log the results as it never was a limitation for me but from opening task manager my module seems to consume much less (or Task manager was too slow to show it before it finished?).
Regarding "end-to-end solution" well this is depended on the user use case, I tend to not pipe the output usually. For example a lot of the time I will pass the data to ImportExcel Export-Excel, in this case NOT piping the data and providing it as DataTable type will result in multiple orders of magnitude faster export.
I note that Join-Object is in the PowerShell Gallery.
I don't see any need for it being added to the default install. Doing that prevents it being extended independently of the PowerShell releases.
It means you can run a new version of the module on an older version of PowerShell.
I don't see any need for it being added to the default install. Doing that prevents it being extended independently of the PowerShell releases.
It means you can run a new version of the module on an older version of PowerShell.
Adding it to the standard equipement doesn't exclude the possibility to have a independent corresponding version available on the PowerShell Gallery. In fact, including it in future PowerShell versions will lead to more consistent solutions for this more than a decennia old PowerShell desire.
A huge number of PowerShell questions on StackOverflow are about joining objects and could have a more common answer than the above outdated (Google top ranked) document or any of the Join-Object modules available on the PowerShell Gallery which all have different parameters and features.
Besides, the PowerShell Gallery might not always be easily available in e.g. a restricted environment or a remote session where it might be assumed that PowerShell (core) will once be included with future Window versions.
PowerShell is about mechanism, not policy. It originated from the idea from UNIX of do one thing and do it well. People often make the comparison between PowerShell and bash, but bash does not include any of the POSIX tools, cat/ls/grep etc.
Compare with somebody on Slashdot asking how to write a shell script to use JSON or XML, it does not mean that bash or zsh have to change. Instead, the suggestion is to use "jq" or whatever, if that is not installed, then make it a dependency/requirement for your script and add it to your system.
The locked/restricted down environment is often brought up as a reason why PowerShell must include everything. I suggest it demonstrates the complete opposite. With a locked down environment you have the minimal installed tool set to do the job. If the tool is not required to do the job it is not installed. If it is required, then it is installed. That allows you to have the smallest surface are. If the requirements of the system change and new capabilities are required then go through the change request process.
@rhubarb-geek-nz,
PowerShell is about mechanism, not policy.
Agree, yet I wouldn't gather joining objects under "PowerShell must include everything" and it is compared to a lot of other languages, as Python (which has an .Update() method). Besides, you might be right that it originated from UNIX, I think that PowerShell comes standard with a set of CMDlets to actually replace the old-fasion CMD CLI.
Anyways, as I see it: a Join-Object cmdlet would come close after the ForEach-Object, Where-Object and Select-Object cmdlets where you might even consider a PowerShell Object array method similar to the .foreach() and the .where(scriptblock expression[, WhereOperatorSelectionMode mode [, int numberToReturn]]):
`.Join(<array> [,On property[,Join mode]])`
Which by default does a side-by-side join (see also: #14732, similar to the Unix Paste command) and optionally bases a join on a related property (and/or a specific join type).
Certainly there is demand for Join-Object ( this is not an exhaustive list )
| PSGallery | Author | Total Downloads | Version | Date | Current Downloads |
|---|---|---|---|---|---|
| PoshFunctions | riedyw | 18,993 | 2.2.10 | 30/10/2023 | 9,925 |
| JoinModule | iRon | 149,522 | 3.8.3 | 10/06/2023 | 44,753 |
| Join | iRon | 17,412 | 3.8.3 | 10/06/2023 | 17,412 |
| Join-Object | illy | 114,887 | 2.0.3 | 05/04/2022 | 81,965 |
| PSExcel | ramblingcookiemonster | 623,341 | 1.0.2 | 09/08/2016 | 623,341 |
| WFTools | ramblingcookiemonster | 24,545 | 0.1.58 | 04/06/2018 | 21,099 |
Yet in that popularity lies a problem.
I note that they all are variations on a theme buit often with very few command line arguments in common.
The PowerShell team are very sensitive about breaking changes, be it interpretation of a command line argument or how a particular file is parsed. Unlike a compiled language, a change to PowerShell itself can break every installation, especially with automatic updates.
Now if PowerShell did adopt one of the above (or write a new Core only standards conforming Join-Object) it would potentially break anyone using any of the un-choosen implementations.
You would then have further incompatibility when new scripts assume that the new standard Join-Object exists and this usage is incompatible with any of the other implementations. Something will break, it does not matter which module gets the priority, the gallery one or the one in c:\program files\powershell\7\Modules.
As the Whopper said in War Games, the only way to win is to not play the game. So let people choose the one they want from the PSGallery.
Now if PowerShell did adopt one of the above (or write a new Core only standards conforming Join-Object) it would potentially break anyone using any of the un-choosen implementations.
Afaik, systems that already have a custom module/function ("any of the un-choosen implementations") installed won't break if a future PowerShell would include a common Join-Object cmdlet:
- The cmdlet of the
Microsoft.PowerShell.Corelibrary will be loaded first. Any custom module/function loaded afterwards with a command with the same name as an existing command in the session. The newly imported command replaces takes precedence over the existing command.- Users have the option to move to the core version by removing the module that includes "any of the un-choosen implementations" (they are in control).
- Users of any custom module/function that would like to use both (custom - and core
Join-Object). Might use the qualified name:Microsoft.PowerShell.Core\Join-Object.
Unfortunately, I am well known with how standard proliferate as it happens a lot in organizations that do not have control, but the PowerShell team is in fact the leader here and the entity that might take control of this Join-Object clutter knowing that there can only be one Microsoft.PowerShell.Core\Join-Object.
Besides, with my next JoinModule update, I will hide/archive the Join function version (one down), and if the PowerShell Team decides to equipe PowerShell with a standard Join-Object cmdlet with similar features as proposed in this issue, I will stop further development of my own implementation and happily refer to this standard PowerShell Join-Object instead (along with a migration document if required).
Ironically I found this post because I wanted a general mechanism to combine output from a pipeline, and the existing versions of Join-Object were not want I wanted. I wanted something that was like the existing joins.
PS> get-command -verb Join
CommandType Name Version Source
----------- ---- ------- ------
Cmdlet Join-Array 1.0.0 rhubarb-geek-nz.Joinery
Cmdlet Join-DtcDiagnosticResourceManager 1.0.0.0 MsDtc
Cmdlet Join-Path 7.0.0.0 Microsoft.PowerShell.Management
Cmdlet Join-String 7.0.0.0 Microsoft.PowerShell.Utility
So hence Join-Array.
I wanted a general mechanism to combine output from a pipeline
Isn't that just Foreach-Object { $_ }?
@( @(1,2) , @(3,4,5) ) | Foreach-Object { $_ } | ConvertTo-Json
[
1,
2,
3,
4,
5
]
I wanted a general mechanism to combine output from a pipeline
Isn't that just
Foreach-Object { $_ }?
No. I have some GZip tools that encode and decode GZip and work in a pipeline and a single file may be broken into multiple byte [] during the process
PS> Get-Content -LiteralPath bin\Release\netstandard2.0\publish\RhubarbGeekNz.GZip.dll -AsByteStream -Raw | ConvertTo-GZip | ForEach-Object { $_.Length }
10
8192
2997
The output is multiple byte[].
PS> Get-Content -LiteralPath bin\Release\netstandard2.0\publish\RhubarbGeekNz.GZip.dll -AsByteStream -Raw | ConvertTo-GZip | Join-Array -Type ([byte]) | ForEach-Object { $_.Length }
11199
So that efficiently combines them into a single record.
This would be a highly useful addition to PowerShell!
Seems like this should exist, along with the other LINQ equivalents like Any / Single / First / FirstOrDefault / Single / SingleOrDefault / etc..
While i think the heart of this issue is truly worthwhile i think we should be stressing the need for parity with the rest of the .NET ecosystem and we need the ability to handle both IEnumerable and IQueryable in a robust and powerful way. I've professionally spent 10's or 100's of hours building commandlets manually that would have been trivial if powershell handled IQueryable correctly.
I will take a leap and suggest most people using powershell care about the .NET ecosystem, so parity of features is paramount. Join-Object should have an exact 1-1 parity with system.linq.enumerable.join() And all the other features of LINQ should be fully supported.
@hannasm,
Thank you for supporting this PowerShell enhancement request. I agree with you that Linq is very powerful and there should be a possibility to boost the a join cmdlet where both element lists have strict properties and reside already in memory (see the correction from rhubarb-geek-nz below)
But I personally think that the general vision should be a cmdlet that supports idiomatic PowerShell (see the introduction of: PowerShell scripting performance considerations). Meaning that a recent StackOverflow question as Map csv content with Powershell table output [closed] from a novice PowerShell author could be answered with a single command line and that a Join-Object cmdlet should:
- support the PowerShell pipeline (One-at-a-time processing for one of the element sides)
- keep the element properties loosely, meaning: support for the general
PSCustomObjectclass (rather than custom classes).
Pipeline
In a common situation, the initial source of the object lists will not be resident in memory but usually require to be retrieved from a (slower) long term storage (as the file system or a database) and the same applies for the target storage. This make the PowerShell pipeline very useful where you might just load one (the smaller) element list into memory and do One-at-a-time processing on the (larger) element list. As for the example StackOverflow question:
Import-Csv <data file>
↓
Import-Csv <large data file> | Join-Object (-on Alias) | Export-Csv <output data file>
(Where Import-Csv <large data file> could also be another streaming source as a database, using a cmdlet as Get-ADUser)
This will consume less memory (knowing that the PowerShell objects are quiet heavy by design as they are actually optimized for the pipeline) and would hardly impact the performance if you measure the whole end-to-end solution.
Loosely
PowerShell is a loosely scripting language by design where each object property might contain a different type as apposed to Linq which pushes for strict element properties (as in a DataTable. See also comment from ili101 below) where each property in a column needs to be of the same type to be joined upon), this means that you might need to cast a whole column to a specific data type (as a [string]) where it might be desired to keep the individual types (along with a simple $Null, rather than empty string)
Versatile
Besides, unlike a rather restricted .net Linq Join method, a Join-Object cmdlet might pave the way to add some idiomatic features as proposed in other requests (as a side-by-side join of multiple arrays) without breaking the engine or current syntax.
where both element lists have strict properties and reside already in memory
Technically Linq does not require the data to be in memory. When working with database result sets you can use Linq and the actual select can be delayed until the last moment and the select performed on the database can be very efficient where an Any() can become a count(*) and the results are streamed through the result set. So in theory the database command result set streaming could be paralleled to a cmdlet's output pipeline. That said, there is a lot of infrastructure code that works with the different databases to make that happen.
Loosely
Linqwhich pushes for strict element properties
You can set the type to object, so it accepts practically any type. like here where the compare "column" is an object. it's not fun to use in Powershell as anything Linq/System.Func is very verbose but it's supported.
I don't believe the linq compatibility should be in conflict with conforming to idiomatic powershell. I mostly just think there is a clear feature set to base the design on.
Enumerable.Join() uses separate lambdas to specify the join keys for inner and outer collections being joined and then a final lambda to project the result. Something like this would seem an appropriate starting point.
If the collections implement Iqueryable they should delegate to an IQueryable implementation (I'm sure this will be very tricky because you have to convert the lambdas to expression trees). If the collections support IEnumerable they should delegate to an IEnumerable implementation. If there is a need for special cases against powershell collections that should certainly receive further attention.
Creating PSCustomObject instead of what frequently would be an anonymous type in C# seems completely reasonable. Supporting pipelines / cancellation / etc... would still be mandatory
On Mon, Jun 3, 2024, 12:33 rhubarb-geek-nz @.***> wrote:
where both element lists have strict properties and reside already in memory
Technically Linq does not require the data to be in memory. When working with database result sets you can use Linq and the actual select can be delayed until the last moment and the select performed on the database can be very efficient where an Any() can become a count(*) and the results are streamed through the result set. So in theory the database command result set streaming could be paralleled to a cmdlet's output pipeline. That said, there is a lot of infrastructure code that works with the different databases to make that happen.
— Reply to this email directly, view it on GitHub https://github.com/PowerShell/PowerShell/issues/14994#issuecomment-2145654895, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASHZJHPOWF3JWK7AU37AQTZFSLGPAVCNFSM4ZAJMQP2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJUGU3DKNBYHE2Q . You are receiving this because you were mentioned.Message ID: @.***>
I don't believe the linq compatibility should be in conflict with conforming to idiomatic powershell..
I would rather Join-Object does PowerShell well rather than Linq. Linq is just a technology to achieve goals. The goal is to provide a consistent Join-Object.
On a similar vein I would rather that Join-Object does not become part of the core product but remains separately available via the PSGallery. Independence will give it greater freedom to evolve without being tied to the PowerShell delivery cycle.