[SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast

sarutak · cloud-fan · commit 57a4f310df30 · 2021-07-13T20:28:47.000+08:00
### What changes were proposed in this pull request? This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`. The comment for `UTF8String.trimAll` says like as follows. ``` Trims whitespaces ({literal <=} ASCII 32) from both ends of this string. ``` Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows. ``` In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and below, when casting string to integrals and booleans, it does not trim the whitespaces from both ends; the foregoing results is `null`, while to datetimes, only the trailing spaces (= ASCII 32) are removed. ``` But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1. ### Why are the changes needed? To follow the previous change. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed the document built by the following command. ``` SKIP_API=1 bundle exec jekyll build ``` Closes #33287 from sarutak/fix-utf8string-trim-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
diff --git a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
@@ -562,24 +562,24 @@ public UTF8String trim() {
   }
 
   /**
-   * Trims whitespaces ({@literal <=} ASCII 32) from both ends of this string.
+   * Trims whitespace ASCII characters from both ends of this string.
    *
-   * Note that, this method is the same as java's {@link String#trim}, and different from
-   * {@link UTF8String#trim()} which remove only spaces(= ASCII 32) from both ends.
+   * Note that, this method is different from {@link UTF8String#trim()} which removes
+   * only spaces(= ASCII 32) from both ends.
    *
    * @return A UTF8String whose value is this UTF8String, with any leading and trailing white
    * space removed, or this UTF8String if it has no leading or trailing whitespace.
    *
    */
   public UTF8String trimAll() {
     int s = 0;
-    // skip all of the whitespaces (<=0x20) in the left side
+    // skip all of the whitespaces in the left side
     while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++;
     if (s == this.numBytes) {
       // Everything trimmed
       return EMPTY_UTF8;
     }
-    // skip all of the whitespaces (<=0x20) in the right side
+    // skip all of the whitespaces in the right side
     int e = this.numBytes - 1;
     while (e > s && Character.isWhitespace(getByte(e))) e--;
     if (s == 0 && e == numBytes - 1) {
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
@@ -159,6 +159,8 @@ license: |
 
 - In Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Since version 3.0.1, the timestamp type inference is disabled by default. Set the JSON option `inferTimestamp` to `true` to enable such type inference.
 
+- In Spark 3.0, when casting string to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing characters (<= ASCII 32) will be trimmed. For example, `cast('\b1\b' as int)` results `1`. Since Spark 3.0.1, only the leading and trailing whitespace ASCII characters will be trimmed. For example, `cast('\t1\t' as int)` results `1` but `cast('\b1\b' as int)` results `NULL`.
+
 ## Upgrading from Spark SQL 2.4 to 3.0
 
 ### Dataset/DataFrame APIs

Original file line number	Diff line number	Diff line change
`@@ -562,24 +562,24 @@ public UTF8String trim() {`
`562`	`562`	`}`
`563`	`563`
`564`	`564`	`/**`
`565`		`- * Trims whitespaces ({@literal <=} ASCII 32) from both ends of this string.`
	`565`	`+ * Trims whitespace ASCII characters from both ends of this string.`
`566`	`566`	`*`
`567`		`- * Note that, this method is the same as java's {@link String#trim}, and different from`
`568`		`- * {@link UTF8String#trim()} which remove only spaces(= ASCII 32) from both ends.`
	`567`	`+ * Note that, this method is different from {@link UTF8String#trim()} which removes`
	`568`	`+ * only spaces(= ASCII 32) from both ends.`
`569`	`569`	`*`
`570`	`570`	`* @return A UTF8String whose value is this UTF8String, with any leading and trailing white`
`571`	`571`	`* space removed, or this UTF8String if it has no leading or trailing whitespace.`
`572`	`572`	`*`
`573`	`573`	`*/`
`574`	`574`	`public UTF8String trimAll() {`
`575`	`575`	`int s = 0;`
`576`		`- // skip all of the whitespaces (<=0x20) in the left side`
	`576`	`+ // skip all of the whitespaces in the left side`
`577`	`577`	`while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++;`
`578`	`578`	`if (s == this.numBytes) {`
`579`	`579`	`// Everything trimmed`
`580`	`580`	`return EMPTY_UTF8;`
`581`	`581`	`}`
`582`		`- // skip all of the whitespaces (<=0x20) in the right side`
	`582`	`+ // skip all of the whitespaces in the right side`
`583`	`583`	`int e = this.numBytes - 1;`
`584`	`584`	`while (e > s && Character.isWhitespace(getByte(e))) e--;`
`585`	`585`	`if (s == 0 && e == numBytes - 1) {`