Skip to content

Initcap behaves differently in Spark and in DataFusion (also Comet) #1052

@Blizzara

Description

@Blizzara

Describe the bug

DataFusion's initcap behaves differently than Spark's. While both do "upper-case the first letter of each word and lowercase others", Spark considers as words anything separated by whitespace (' '), while DataFusion considers anything separated by non-ascii-alphanumeric as words. (DF's code would also fail to uppercase or lowercase non-ascii chars, but that doesn't materialize as a separate issue as it considers them separators already in the first place.)

#1051 shows the problem by adding two cases to the test, one using a dash and one using non-ascii letters (from Finnish).

== Results ==
!== Correct Answer - 7 ==       == Spark Answer - 7 ==
 struct<initcap(name):string>   struct<initcap(name):string>
 [James Smith]                  [James Smith]
 [James Smith]                  [James Smith]
![James Ähtäri]                 [James äHtäRi]
 [Michael Rose]                 [Michael Rose]
 [Rames Rose]                   [Rames Rose]
![Robert Rose-smith]            [Robert Rose-Smith]
 [Robert Williams]              [Robert Williams]

Steps to reproduce

Call initcap with an input containing non-ascii-alphanumeric non-whitespace characters

Expected behavior

Match Spark

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions