Improve NormalizedFilePath#453
Conversation
|
|
||
| We always store UTF-8 encoded file path in 'NormalizedFilePath'. This function first convert 'OsPath' | ||
| to 'FilePath' using current system encoding (see 'decodeFS'), then convert 'FilePath' to an UTF-8 | ||
| encoded 'ShortByteString'. Due to the encoding conversion, this function can fail. But DO NOTE THAT | ||
| encoding mismatch doesn't always mean an exception will be thrown. [Possibly your encoding simply won't | ||
| throw exception on failure](https://hackage.haskell.org/package/base-4.17.0.0/docs/src/GHC.IO.Encoding.html#initFileSystemEncoding). | ||
| Possibly the conversion function can't find any invalid byte sequence, giving a sucessful but wrong result. |
There was a problem hiding this comment.
Why do we need OsPath at all?
There was a problem hiding this comment.
Why do we not store an OsPath directly in NormalizedFilePath?
There was a problem hiding this comment.
Storing OsPath in NormalizedFilePath will make the conversion functions from/to FilePath having a MonadThrow constraint. Then we need to modify plenty of code in HLS to adapt it. And we can not avoid encoding/decoding completely before we adopt OsPath everywhere in HLS, so using OsPath has no significant benefit.
Since conversion functions to/from OsPath in lsp-types is provided now, we can make the migration to OsPath in HLS gradually when GHC 9.6 is released (or maybe when 9.6 becomes the oldest version of GHC we support? then we can avoid plenty of CPP)
There was a problem hiding this comment.
Storing OsPath in NormalizedFilePath will make the conversion functions from/to FilePath having a MonadThrow constraint.
I don't understand, why can't you just write something like the below that instantiates the MonadThrow constraint and immediately fails:
fromNormalizedFilePath (NormalizedFilePath _ osPath) = either (error . show) id (OsPath.decodeWith localEncoding osPath)
I don't see how this can ever error assuming that decoding is the inverse of encoding.
And we can not avoid encoding/decoding completely before we adopt OsPath everywhere in HLS, so using OsPath has no significant benefit.
Maybe it would help clarify what is the adoption plan for HLS. Are you saying that this is an intermediate state? Then lay out the plan in the comment so that we can ensure it gets executed when the time comes!
There was a problem hiding this comment.
why can't you just write something like the below that instantiates the
MonadThrowconstraint and immediately fails:
The idea is to avoid partial functions, and report errors properly.
I don't see how this can ever error assuming that decoding is the inverse of encoding.
Yes, but toNormalizedFilePath can fail if a given Char is not representable in the system encoding.
Then lay out the plan in the comment so that we can ensure it gets executed when the time comes!
Sure.
And note that OsPath is system-dependent. On Windows, it uses UTF16, which consumes more memory. On Unix, it depends on the system locale. Using OsPath inside NormalizedFilePath will lead to significantly different memory footprints on different systems.
In summary, OsPath doesn't provide enough benefits compared to UTF-8 encoded ShortByteString now for these reasons:
- can't avoid encoding/decoding, just like
ShortByteString - consume more memory on Windows and other non-UTF-8 systems.
- incurs partial functions
1 and 3 are solvable when OsPath becomes mainstream. 2 is worth discussing when we decide to use OsPath. So, I'm really not so sure about whether this is an intermediate situation or not. It will be more clear in the future.
There was a problem hiding this comment.
In summary, OsPath doesn't provide enough benefits compared to UTF-8 encoded ShortByteString now for these reasons.
Thanks, that clarifies the current choices a little bit for me. So this change is not about migrating to OsPath, but to a compact encoding.
Then I would vouch to simplify and use just the text library. To encode, use Text.pack and then extract the Array from inside, and to decode wrap in a Text constructor and call. Or simplify and just store a Text value.
This is compact only when text-2.0 is used, and semi-compact in earlier versions of the text library. I think that's fine, still a massive improvement over Strings.
7b38c20 to
4a009b3
Compare
4a009b3 to
77a4340
Compare
| encodeFilePath = BS.toShort . T.encodeUtf8 . T.pack | ||
| -- | Convert 'FilePath' to a UTF-8 encoded 'ShortByteString' | ||
| encodeFilePath :: FilePath -> ShortByteString | ||
| encodeFilePath = BS.pack . UTF8.encode |
There was a problem hiding this comment.
My local benchmark results show that this takes more time than BS.toShort . T.encodeUtf8 . T.pack in most cases. Quite surprising. Waiting for benchmark results on CI. haskell/haskell-language-server#3067
There was a problem hiding this comment.
Ok, maybe not that surprising given that text is likely much more optimised than utf8-string.
The best option here would be to leverage the fact that text-2.0 is internally isomorphic to a ShortByteString plus a pair of ints for slicing, and encoded in UTF-8. So you could just do:
#if min_version_text(2,0)
import Data.ByteString.Short
import qualified Data.Text as Text
import Data.Text.Array(Array(..))
import Data.Text.Internal (Text(Text))
...
encodeFilePath :: FilePath -> ShortByteString
encodeFilePath fp = case Text.pack fp of
Text (ByteArray sbs) _ _ -> SBS sbs
decodeFilePath :: ShortByteString -> FilePath
decodeFilePath (SBS fp) = Text.unpack $ Text (ByteArray fp) 0 (length fp)
michaelpj
left a comment
There was a problem hiding this comment.
Thanks everyone for the great discussion 👍
* improve NormalizedFilePath * add adoption plan * try Text * finalize

No description provided.