PowerShell icon indicating copy to clipboard operation
PowerShell copied to clipboard

Add a `Join-Object` cmdlet to the standard PowerShell equipment

Open iRon7 opened this issue 5 years ago • 16 comments

As the verb-noun convention suggests, I think that there are a lot of developers looking for a common way to join objects (lists), actually it appears a decennia old quest: as shows from the PowerShell Team Join-Object article and StackOverflow issues along with: In Powershell, what's the best way to join two tables into one? Personally, I am convinced that a Join-Object cmdlet will generally result in a better syntax were an object (list) join is required (in comparison with alternative cmdlets along with Group-Object) and better performance (then using obvious iterators like ForEach-Object together with Where-Object or similar cmdlets and methods). For that reason I am maintaining a Join-Object script/module for more than 3 years now. I don't think that my Join-Object version should be added to the standard PowerShell package (for one thing, it is written in PowerShell and not C# 🤔), but it might serve as an example for what I would expect from a syntax that comes together with a Join-Object cmdlet.

What do I expect from a "internal" Join-Object cmdlet?

  • An intuitive idiomatic PowerShell syntax The syntax should be PowerShell like and include features similar to the SQL Join clause where it basically joins (a list of) objects or items based on related properties (-On parameter) or a simple side-by-side list join (by omitting the -On parameter) as a specific PowerShell cmdlet feature (see: #14732) .
  • Ability to do similar Join types as SQL, along with InnerJoin, LeftJoin, RightJoin, FullJoin
  • Ability to join based on multiple property relations, e.g.: -On Column1, Column2
  • Smart property merging (e.g. if a properly is defined as related, it should only appear ones in the result)
  • Respect the PowerShell pipeline (for the left object supplied through pipeline)
  • A reasonable performance by using a binary search (HashTable) on the right object (knowing that it needs to be iterated multiple times). Also note the Linq Join-Object solution by @ili101
  • Calculated properties
  • ...

iRon7 avatar Mar 11 '21 13:03 iRon7

An aside re:

And it could even prevent to invest too much in traditional operators as shown here.

As I've argued in detail in https://github.com/PowerShell/PowerShell/issues/14724#issuecomment-777801399, there is no reason to pit cmdlet-based solutions against expression / statement-based solutions: both are necessary, and in certain cases use of one over the other is the only option. Instead, we should strive for feature parity, to the extent that is feasible.

mklement0 avatar Mar 11 '21 13:03 mklement0

vote for this feature

chenxizhang avatar Apr 11 '23 08:04 chenxizhang

I use this type of feature all the time with a 3rd party module. If this could be natively part of the PowerShell command set, that would be awesome.

christopherbaxter avatar Apr 11 '23 19:04 christopherbaxter

It would be a highly useful addition to the suite. There is actually even another maintained version: https://github.com/ili101/Join-Object. This goes to show that there is a realistic need for such a feature.

GitHub
PowerShell Join-Object LINQ Edition. Contribute to ili101/Join-Object development by creating an account on GitHub.

moreaki avatar Apr 29 '23 20:04 moreaki

@moreaki

Thank you for your support.

another maintained version: https://github.com/ili101/Join-Object

I did already cover this one in the initial issue request. The Join-Object from ili101 is based on Linq and therefore probably always faster than my version, my counter arguments to this: The PowerShell pipeline is designed to stream objects (which safes memory), meaning that both¹ lists of input objects usually aren't (shouldn't be) resident in memory. Normally they are retrieved from somewhere else (i.e. a remote server or a disk). Also, the output usually matters where linq solutions are fast but might easily put you on the wrong foot in drawing conclusions because linq literally defers the execution (lazy evaluation), see also: fastest way to get a uniquely index item from the property of an array.
In other words, if it comes to (measuring) performance in PowerShell, it is important to look to the complete end-to-end solution, which is more likely to look something like:

import-csv .\file1.csv |LeftJoin (import-csv .\file2.csv) -On IP |Export-Csv .\results.csv

(1) Note: unfortunately, there is no easy way to build two parallel input streams (see: #15206 Deferred input pipelines)

Anyways, I have just published a new version, see: https://github.com/iRon7/Join-Object which performs about twice as fast as the previous one due to modifications based on this discussion: Optimize static code reusage #19322

GitHub
PowerShell Join-Object LINQ Edition. Contribute to ili101/Join-Object development by creating an account on GitHub.
GitHub
Combines two objects lists based on a related property between them. - GitHub - iRon7/Join-Object: Combines two objects lists based on a related property between them.

iRon7 avatar May 12 '23 13:05 iRon7

@iRon7 Regarding the performance points you mentioned, I just run some tests to check. on 100k rows test data: JoinModule LeftJoin-Object took 16s. Join-Object Join-Object took 4s, and 2s when selecting DataTable output.

Regarding Laziness my module is not Lazy, it will only return once everything executed as I wanted to return the simple objects that everyone know and not the Linq lazy objects.

Regarding memory I didn't log the results as it never was a limitation for me but from opening task manager my module seems to consume much less (or Task manager was too slow to show it before it finished?).

Regarding "end-to-end solution" well this is depended on the user use case, I tend to not pipe the output usually. For example a lot of the time I will pass the data to ImportExcel Export-Excel, in this case NOT piping the data and providing it as DataTable type will result in multiple orders of magnitude faster export.

ili101 avatar May 15 '23 18:05 ili101

I note that Join-Object is in the PowerShell Gallery.

I don't see any need for it being added to the default install. Doing that prevents it being extended independently of the PowerShell releases.

It means you can run a new version of the module on an older version of PowerShell.

rhubarb-geek-nz avatar Apr 14 '24 07:04 rhubarb-geek-nz

I don't see any need for it being added to the default install. Doing that prevents it being extended independently of the PowerShell releases.

It means you can run a new version of the module on an older version of PowerShell.

Adding it to the standard equipement doesn't exclude the possibility to have a independent corresponding version available on the PowerShell Gallery. In fact, including it in future PowerShell versions will lead to more consistent solutions for this more than a decennia old PowerShell desire.

A huge number of PowerShell questions on StackOverflow are about joining objects and could have a more common answer than the above outdated (Google top ranked) document or any of the Join-Object modules available on the PowerShell Gallery which all have different parameters and features.

Besides, the PowerShell Gallery might not always be easily available in e.g. a restricted environment or a remote session where it might be assumed that PowerShell (core) will once be included with future Window versions.

iRon7 avatar Apr 14 '24 09:04 iRon7

PowerShell is about mechanism, not policy. It originated from the idea from UNIX of do one thing and do it well. People often make the comparison between PowerShell and bash, but bash does not include any of the POSIX tools, cat/ls/grep etc.

Compare with somebody on Slashdot asking how to write a shell script to use JSON or XML, it does not mean that bash or zsh have to change. Instead, the suggestion is to use "jq" or whatever, if that is not installed, then make it a dependency/requirement for your script and add it to your system.

The locked/restricted down environment is often brought up as a reason why PowerShell must include everything. I suggest it demonstrates the complete opposite. With a locked down environment you have the minimal installed tool set to do the job. If the tool is not required to do the job it is not installed. If it is required, then it is installed. That allows you to have the smallest surface are. If the requirements of the system change and new capabilities are required then go through the change request process.

rhubarb-geek-nz avatar Apr 14 '24 11:04 rhubarb-geek-nz

@rhubarb-geek-nz,

PowerShell is about mechanism, not policy.

Agree, yet I wouldn't gather joining objects under "PowerShell must include everything" and it is compared to a lot of other languages, as Python (which has an .Update() method). Besides, you might be right that it originated from UNIX, I think that PowerShell comes standard with a set of CMDlets to actually replace the old-fasion CMD CLI. Anyways, as I see it: a Join-Object cmdlet would come close after the ForEach-Object, Where-Object and Select-Object cmdlets where you might even consider a PowerShell Object array method similar to the .foreach() and the .where(scriptblock expression[, WhereOperatorSelectionMode mode [, int numberToReturn]]):

`.Join(<array> [,On property[,Join mode]])`

Which by default does a side-by-side join (see also: #14732, similar to the Unix Paste command) and optionally bases a join on a related property (and/or a specific join type).

iRon7 avatar Apr 14 '24 14:04 iRon7

Certainly there is demand for Join-Object ( this is not an exhaustive list )

PSGallery Author Total Downloads Version Date Current Downloads
PoshFunctions riedyw 18,993 2.2.10 30/10/2023 9,925
JoinModule iRon 149,522 3.8.3 10/06/2023 44,753
Join iRon 17,412 3.8.3 10/06/2023 17,412
Join-Object illy 114,887 2.0.3 05/04/2022 81,965
PSExcel ramblingcookiemonster 623,341 1.0.2 09/08/2016 623,341
WFTools ramblingcookiemonster 24,545 0.1.58 04/06/2018 21,099

Yet in that popularity lies a problem.

I note that they all are variations on a theme buit often with very few command line arguments in common.

The PowerShell team are very sensitive about breaking changes, be it interpretation of a command line argument or how a particular file is parsed. Unlike a compiled language, a change to PowerShell itself can break every installation, especially with automatic updates.

Now if PowerShell did adopt one of the above (or write a new Core only standards conforming Join-Object) it would potentially break anyone using any of the un-choosen implementations.

You would then have further incompatibility when new scripts assume that the new standard Join-Object exists and this usage is incompatible with any of the other implementations. Something will break, it does not matter which module gets the priority, the gallery one or the one in c:\program files\powershell\7\Modules.

As the Whopper said in War Games, the only way to win is to not play the game. So let people choose the one they want from the PSGallery.

rhubarb-geek-nz avatar Apr 15 '24 17:04 rhubarb-geek-nz

Now if PowerShell did adopt one of the above (or write a new Core only standards conforming Join-Object) it would potentially break anyone using any of the un-choosen implementations.

Afaik, systems that already have a custom module/function ("any of the un-choosen implementations") installed won't break if a future PowerShell would include a common Join-Object cmdlet:

Unfortunately, I am well known with how standard proliferate as it happens a lot in organizations that do not have control, but the PowerShell team is in fact the leader here and the entity that might take control of this Join-Object clutter knowing that there can only be one Microsoft.PowerShell.Core\Join-Object. Besides, with my next JoinModule update, I will hide/archive the Join function version (one down), and if the PowerShell Team decides to equipe PowerShell with a standard Join-Object cmdlet with similar features as proposed in this issue, I will stop further development of my own implementation and happily refer to this standard PowerShell Join-Object instead (along with a migration document if required).

iRon7 avatar Apr 16 '24 09:04 iRon7

Ironically I found this post because I wanted a general mechanism to combine output from a pipeline, and the existing versions of Join-Object were not want I wanted. I wanted something that was like the existing joins.

PS> get-command -verb Join

CommandType     Name                                               Version    Source
-----------     ----                                               -------    ------
Cmdlet          Join-Array                                         1.0.0      rhubarb-geek-nz.Joinery
Cmdlet          Join-DtcDiagnosticResourceManager                  1.0.0.0    MsDtc
Cmdlet          Join-Path                                          7.0.0.0    Microsoft.PowerShell.Management
Cmdlet          Join-String                                        7.0.0.0    Microsoft.PowerShell.Utility

So hence Join-Array.

rhubarb-geek-nz avatar Apr 18 '24 00:04 rhubarb-geek-nz

I wanted a general mechanism to combine output from a pipeline

Isn't that just Foreach-Object { $_ }?

@( @(1,2) , @(3,4,5) ) | Foreach-Object { $_ } | ConvertTo-Json
[
  1,
  2,
  3,
  4,
  5
]

iRon7 avatar Apr 18 '24 06:04 iRon7

I wanted a general mechanism to combine output from a pipeline

Isn't that just Foreach-Object { $_ }?

No. I have some GZip tools that encode and decode GZip and work in a pipeline and a single file may be broken into multiple byte [] during the process

PS> Get-Content -LiteralPath bin\Release\netstandard2.0\publish\RhubarbGeekNz.GZip.dll -AsByteStream -Raw | ConvertTo-GZip | ForEach-Object { $_.Length }
10
8192
2997

The output is multiple byte[].

PS> Get-Content -LiteralPath bin\Release\netstandard2.0\publish\RhubarbGeekNz.GZip.dll -AsByteStream -Raw | ConvertTo-GZip | Join-Array -Type ([byte])  | ForEach-Object { $_.Length }
11199

So that efficiently combines them into a single record.

rhubarb-geek-nz avatar Apr 18 '24 07:04 rhubarb-geek-nz

This would be a highly useful addition to PowerShell!

0x7FFFFFFFFFFFFFFF avatar Apr 19 '24 16:04 0x7FFFFFFFFFFFFFFF

Seems like this should exist, along with the other LINQ equivalents like Any / Single / First / FirstOrDefault / Single / SingleOrDefault / etc..

While i think the heart of this issue is truly worthwhile i think we should be stressing the need for parity with the rest of the .NET ecosystem and we need the ability to handle both IEnumerable and IQueryable in a robust and powerful way. I've professionally spent 10's or 100's of hours building commandlets manually that would have been trivial if powershell handled IQueryable correctly.

I will take a leap and suggest most people using powershell care about the .NET ecosystem, so parity of features is paramount. Join-Object should have an exact 1-1 parity with system.linq.enumerable.join() And all the other features of LINQ should be fully supported.

hannasm avatar Jun 03 '24 03:06 hannasm

@hannasm,

Thank you for supporting this PowerShell enhancement request. I agree with you that Linq is very powerful and there should be a possibility to boost the a join cmdlet where both element lists have strict properties and reside already in memory (see the correction from rhubarb-geek-nz below) But I personally think that the general vision should be a cmdlet that supports idiomatic PowerShell (see the introduction of: PowerShell scripting performance considerations). Meaning that a recent StackOverflow question as Map csv content with Powershell table output [closed] from a novice PowerShell author could be answered with a single command line and that a Join-Object cmdlet should:

  • support the PowerShell pipeline (One-at-a-time processing for one of the element sides)
  • keep the element properties loosely, meaning: support for the general PSCustomObject class (rather than custom classes).

Pipeline

In a common situation, the initial source of the object lists will not be resident in memory but usually require to be retrieved from a (slower) long term storage (as the file system or a database) and the same applies for the target storage. This make the PowerShell pipeline very useful where you might just load one (the smaller) element list into memory and do One-at-a-time processing on the (larger) element list. As for the example StackOverflow question:

                               Import-Csv <data file>
                                         ↓
Import-Csv <large data file> | Join-Object (-on Alias) | Export-Csv <output data file>

(Where Import-Csv <large data file> could also be another streaming source as a database, using a cmdlet as Get-ADUser) This will consume less memory (knowing that the PowerShell objects are quiet heavy by design as they are actually optimized for the pipeline) and would hardly impact the performance if you measure the whole end-to-end solution.

Loosely

PowerShell is a loosely scripting language by design where each object property might contain a different type as apposed to Linq which pushes for strict element properties (as in a DataTable. See also comment from ili101 below) where each property in a column needs to be of the same type to be joined upon), this means that you might need to cast a whole column to a specific data type (as a [string]) where it might be desired to keep the individual types (along with a simple $Null, rather than empty string)

Versatile

Besides, unlike a rather restricted .net Linq Join method, a Join-Object cmdlet might pave the way to add some idiomatic features as proposed in other requests (as a side-by-side join of multiple arrays) without breaking the engine or current syntax.

iRon7 avatar Jun 03 '24 08:06 iRon7

where both element lists have strict properties and reside already in memory

Technically Linq does not require the data to be in memory. When working with database result sets you can use Linq and the actual select can be delayed until the last moment and the select performed on the database can be very efficient where an Any() can become a count(*) and the results are streamed through the result set. So in theory the database command result set streaming could be paralleled to a cmdlet's output pipeline. That said, there is a lot of infrastructure code that works with the different databases to make that happen.

rhubarb-geek-nz avatar Jun 03 '24 16:06 rhubarb-geek-nz

Loosely Linq which pushes for strict element properties

You can set the type to object, so it accepts practically any type. like here where the compare "column" is an object. it's not fun to use in Powershell as anything Linq/System.Func is very verbose but it's supported.

ili101 avatar Jun 03 '24 17:06 ili101

I don't believe the linq compatibility should be in conflict with conforming to idiomatic powershell. I mostly just think there is a clear feature set to base the design on.

Enumerable.Join() uses separate lambdas to specify the join keys for inner and outer collections being joined and then a final lambda to project the result. Something like this would seem an appropriate starting point.

If the collections implement Iqueryable they should delegate to an IQueryable implementation (I'm sure this will be very tricky because you have to convert the lambdas to expression trees). If the collections support IEnumerable they should delegate to an IEnumerable implementation. If there is a need for special cases against powershell collections that should certainly receive further attention.

Creating PSCustomObject instead of what frequently would be an anonymous type in C# seems completely reasonable. Supporting pipelines / cancellation / etc... would still be mandatory

On Mon, Jun 3, 2024, 12:33 rhubarb-geek-nz @.***> wrote:

where both element lists have strict properties and reside already in memory

Technically Linq does not require the data to be in memory. When working with database result sets you can use Linq and the actual select can be delayed until the last moment and the select performed on the database can be very efficient where an Any() can become a count(*) and the results are streamed through the result set. So in theory the database command result set streaming could be paralleled to a cmdlet's output pipeline. That said, there is a lot of infrastructure code that works with the different databases to make that happen.

— Reply to this email directly, view it on GitHub https://github.com/PowerShell/PowerShell/issues/14994#issuecomment-2145654895, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASHZJHPOWF3JWK7AU37AQTZFSLGPAVCNFSM4ZAJMQP2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJUGU3DKNBYHE2Q . You are receiving this because you were mentioned.Message ID: @.***>

hannasm avatar Jun 03 '24 17:06 hannasm

I don't believe the linq compatibility should be in conflict with conforming to idiomatic powershell..

I would rather Join-Object does PowerShell well rather than Linq. Linq is just a technology to achieve goals. The goal is to provide a consistent Join-Object.

On a similar vein I would rather that Join-Object does not become part of the core product but remains separately available via the PSGallery. Independence will give it greater freedom to evolve without being tied to the PowerShell delivery cycle.

rhubarb-geek-nz avatar Jun 03 '24 20:06 rhubarb-geek-nz