Skip to content

Conversation

@norberttech
Copy link
Member

@norberttech norberttech commented Jul 26, 2024

Change Log

Added

  • Filesystem appendTo - which allows to append data to existing stream

Fixed

  • External Sort memory consumption

Changed

  • Simplified Cache abstraction
  • Cache implementations does not need to maintain index anymore, this logic was moved outside of Cache
  • Extracted Sorting related configuration to standalone Config/ConfigBuilder
  • Extracted Caching related configuration to standalone Config/ConfigBuilder
  • Reduced default cache batch size to 100 from 1000

Removed

  • NullCache implementation that wasn't used anywhere

Deprecated

Security


Description

Resolves: #1142
Resolves: #1035

@domis86
Copy link

domis86 commented Jul 26, 2024

some results of manual checks in local:

  • sorting of 10k rows (4.2 MB file): memory peak usage went down from ~150MB to ~67MB
  • sorting of 100k rows (42 MB file): memory peak usage went down from ~1500MB to ~100MB
  • sorting of 1000k rows (420 MB file): error:
PHP Warning:  fopen(/tmp/flow-php-external-sort/4b7d9f696d01dbf3fbbaffc532f514fc/rows.php.cache): Failed to open stream: Too many open files in /var/www/other/open-source/flow-php/flow/src/lib/filesystem/src/Flow/Filesystem/Stream/NativeLocalSourceStream.php on line 31
PHP Warning:  include(/var/www/other/open-source/flow-php/flow/src/lib/filesystem/src/Flow/Filesystem/Exception/RuntimeException.php): Failed to open stream: Too many open files in /var/www/other/open-source/flow-php/flow/vendor/composer/ClassLoader.php on line 576
PHP Warning:  include(): Failed opening '/var/www/other/open-source/flow-php/flow/vendor/composer/../../src/lib/filesystem/src/Flow/Filesystem/Exception/RuntimeException.php' for inclusion (include_path='.:/usr/share/php') in /var/www/other/open-source/flow-php/flow/vendor/composer/ClassLoader.php on line 576
PHP Fatal error:  Uncaught Error: Class "Flow\Filesystem\Exception\RuntimeException" not found in /var/www/other/open-source/flow-php/flow/src/lib/filesystem/src/Flow/Filesystem/Stream/NativeLocalSourceStream.php:34
Stack trace:
#0 /var/www/other/open-source/flow-php/flow/src/lib/filesystem/src/Flow/Filesystem/Local/NativeLocalFilesystem.php(80): Flow\Filesystem\Stream\NativeLocalSourceStream::open()
#1 /var/www/other/open-source/flow-php/flow/src/core/etl/src/Flow/ETL/Sort/ExternalSort/RowCache/FilesystemSortRowCache.php(36): Flow\Filesystem\Local\NativeLocalFilesystem->readFrom()
#2 [internal function]: Flow\ETL\Sort\ExternalSort\RowCache\FilesystemSortRowCache->get()
#3 /var/www/other/open-source/flow-php/flow/src/core/etl/src/Flow/ETL/Sort/CacheSort.php(30): Generator->valid()
#4 /var/www/other/open-source/flow-php/flow/src/core/etl/src/Flow/ETL/Sort/ExternalSort.php(48): Flow\ETL\Sort\CacheSort->sort()
#5 /var/www/other/open-source/flow-php/flow/src/core/etl/src/Flow/ETL/DataFrame.php(810): Flow\ETL\Sort\ExternalSort->sortBy()
#6 /var/www/other/open-source/flow-php/flow/src/test_issue_1035.php(74): Flow\ETL\DataFrame->sortBy()

^ stats of tmp directory after this error:

tree /tmp/flow-php-external-sort | tail -1
10000 directories, 10000 files

du -sh /tmp/flow-php-external-sort
5,5G	/tmp/flow-php-external-sort

^ If possible we should try to implement closing of handles to files which are not used at certain moment. Or advice to increase file system open files limit?

Checked example file "/tmp/flow-php-external-sort/*/rows.php.cache"

  • its size is ~577172 . Has 100 lines, so each has ~5700 size. In original csv file size of line is ~500 - like:
8a7c3f29-e669-3aba-8777-8d5075f0960b,2023-05-20T08:59:30+00:00,2023-10-12T03:24:25+00:00,,239.94,47.79,"{""name"":""Kasandra"",""last_name"":""Reynolds"",""email"":""sadie.herzog@ledner.com"",""has_account"":false}","{""name"":""Kasandra"",""last_name"":""Reynolds"",""email"":""sadie.herzog@ledner.com"",""has_account"":false}","{""name"":""Kasandra"",""last_name"":""Reynolds"",""email"":""sadie.herzog@ledner.com"",""has_account"":false}"
  • example serialized line is
TzoxMjoiRmxvd1xFVExcUm93IjoxOntzOjIxOiIARmxvd1xFVExcUm93AGVudHJpZXMiO086MjA6IkZsb3dcRVRMXFJvd1xFbnRyaWVzIjoxOntzOjI5OiIARmxvd1xFVExcUm93XEVudHJpZXMAZW50cmllcyI7YTo5OntzOjg6Im9yZGVyX2lkIjtPOjI4OiJGbG93XEVUTFxSb3dcRW50cnlcVXVpZEVudHJ5Ijo0OntzOjM0OiIARmxvd1xFVExcUm93XEVudHJ5XFV1aWRFbnRyeQB0eXBlIjtPOjM0OiJGbG93XEVUTFxQSFBcVHlwZVxMb2dpY2FsXFV1aWRUeXBlIjoxOntzOjQ0OiIARmxvd1xFVExcUEhQXFR5cGVcTG9naWNhbFxVdWlkVHlwZQBudWxsYWJsZSI7YjowO31zOjM1OiIARmxvd1xFVExcUm93XEVudHJ5XFV1aWRFbnRyeQB2YWx1ZSI7TzoyMzoiRmxvd1xFVExcUEhQXFZhbHVlXFV1aWQiOjE6e3M6MzA6IgBGbG93XEVUTFxQSFBcVmFsdWVcVXVpZAB2YWx1ZSI7czozNjoiMDA5NTYxZjEtNjYxOC0zMjEwLWFkMWQtYzhjZTc1OTRiM2NiIjt9czozNDoiAEZsb3dcRVRMXFJvd1xFbnRyeVxVdWlkRW50cnkAbmFtZSI7czo4OiJvcmRlcl9pZCI7czozMzoiAEZsb3dcRVRMXFJvd1xFbnRyeVxVdWlkRW50cnkAcmVmIjtOO31zOjEwOiJjcmVhdGVkX2F0IjtPOjMwOiJGbG93XEVUTFxSb3dcRW50cnlcU3RyaW5nRW50cnkiOjQ6e3M6MzY6IgBGbG93XEVUTFxSb3dcRW50cnlcU3RyaW5nRW50cnkAdHlwZSI7TzozNToiRmxvd1xFVExcUEhQXFR5cGVcTmF0aXZlXFNjYWxhclR5cGUiOjI6e3M6NDE6IgBGbG93XEVUTFxQSFBcVHlwZVxOYXRpdmVcU2NhbGFyVHlwZQB0eXBlIjtzOjY6InN0cmluZyI7czo0NToiAEZsb3dcRVRMXFBIUFxUeXBlXE5hdGl2ZVxTY2FsYXJUeXBlAG51bGxhYmxlIjtiOjA7fXM6MzY6IgBGbG93XEVUTFxSb3dcRW50cnlcU3RyaW5nRW50cnkAbmFtZSI7czoxMDoiY3JlYXRlZF9hdCI7czozNzoiAEZsb3dcRVRMXFJvd1xFbnRyeVxTdHJpbmdFbnRyeQB2YWx1ZSI7czoyNToiMjAyMy0wNS0yNVQyMDoyMzoyOSswMDowMCI7czozNToiAEZsb3dcRVRMXFJvd1xFbnRyeVxTdHJpbmdFbnRyeQByZWYiO047fXM6MTA6InVwZGF0ZWRfYXQiO086MzA6IkZsb3dcRVRMXFJvd1xFbnRyeVxTdHJpbmdFbnRyeSI6NDp7czozNjoiAEZsb3dcRVRMXFJvd1xFbnRyeVxTdHJpbmdFbnRyeQB0eXBlIjtPOjM1OiJGbG93XEVUTFxQSFBcVHlwZVxOYXRpdmVcU2NhbGFyVHlwZSI6Mjp7czo0MToiAEZsb3dcRVRMXFBIUFxUeXBlXE5hdGl2ZVxTY2FsYXJUeXBlAHR5cGUiO3M6Njoic3RyaW5nIjtzOjQ1OiIARmxvd1xFVExcUEhQXFR5cGVcTmF0aXZlXFNjYWxhclR5cGUAbnVsbGFibGUiO2I6MDt9czozNjoiAEZsb3dcRVRMXFJvd1xFbnRyeVxTdHJpbmdFbnRyeQBuYW1lIjtzOjEwOiJ1cGRhdGVkX2F0IjtzOjM3OiIARmxvd1xFVExcUm93XEVudHJ5XFN0cmluZ0VudHJ5AHZhbHVlIjtzOjI1OiIyMDIzLTA5LTE5VDIxOjAxOjQwKzAwOjAwIjtzOjM1OiIARmxvd1xFVExcUm93XEVudHJ5XFN0cmluZ0VudHJ5AHJlZiI7Tjt9czoxMjoiY2FuY2VsbGVkX2F0IjtPOjMwOiJGbG93XEVUTFxSb3dcRW50cnlcU3RyaW5nRW50cnkiOjQ6e3M6MzY6IgBGbG93XEVUTFxSb3dcRW50cnlcU3RyaW5nRW50cnkAdHlwZSI7TzozNToiRmxvd1xFVExcUEhQXFR5cGVcTmF0aXZlXFNjYWxhclR5cGUiOjI6e3M6NDE6IgBGbG93XEVUTFxQSFBcVHlwZVxOYXRpdmVcU2NhbGFyVHlwZQB0eXBlIjtzOjY6InN0cmluZyI7czo0NToiAEZsb3dcRVRMXFBIUFxUeXBlXE5hdGl2ZVxTY2FsYXJUeXBlAG51bGxhYmxlIjtiOjE7fXM6MzY6IgBGbG93XEVUTFxSb3dcRW50cnlcU3RyaW5nRW50cnkAbmFtZSI7czoxMjoiY2FuY2VsbGVkX2F0IjtzOjM3OiIARmxvd1xFVExcUm93XEVudHJ5XFN0cmluZ0VudHJ5AHZhbHVlIjtOO3M6MzU6IgBGbG93XEVUTFxSb3dcRW50cnlcU3RyaW5nRW50cnkAcmVmIjtOO31zOjExOiJ0b3RhbF9wcmljZSI7TzozMDoiRmxvd1xFVExcUm93XEVudHJ5XFN0cmluZ0VudHJ5Ijo0OntzOjM2OiIARmxvd1xFVExcUm93XEVudHJ5XFN0cmluZ0VudHJ5AHR5cGUiO086MzU6IkZsb3dcRVRMXFBIUFxUeXBlXE5hdGl2ZVxTY2FsYXJUeXBlIjoyOntzOjQxOiIARmxvd1xFVExcUEhQXFR5cGVcTmF0aXZlXFNjYWxhclR5cGUAdHlwZSI7czo2OiJzdHJpbmciO3M6NDU6IgBGbG93XEVUTFxQSFBcVHlwZVxOYXRpdmVcU2NhbGFyVHlwZQBudWxsYWJsZSI7YjowO31zOjM2OiIARmxvd1xFVExcUm93XEVudHJ5XFN0cmluZ0VudHJ5AG5hbWUiO3M6MTE6InRvdGFsX3ByaWNlIjtzOjM3OiIARmxvd1xFVExcUm93XEVudHJ5XFN0cmluZ0VudHJ5AHZhbHVlIjtzOjY6IjI3Ni45MiI7czozNToiAEZsb3dcRVRMXFJvd1xFbnRyeVxTdHJpbmdFbnRyeQByZWYiO047fXM6ODoiZGlzY291bnQiO086MzA6IkZsb3dcRVRMXFJvd1xFbnRyeVxTdHJpbmdFbnRyeSI6NDp7czozNjoiAEZsb3dcRVRMXFJvd1xFbnRyeVxTdHJpbmdFbnRyeQB0eXBlIjtPOjM1OiJGbG93XEVUTFxQSFBcVHlwZVxOYXRpdmVcU2NhbGFyVHlwZSI6Mjp7czo0MToiAEZsb3dcRVRMXFBIUFxUeXBlXE5hdGl2ZVxTY2FsYXJUeXBlAHR5cGUiO3M6Njoic3RyaW5nIjtzOjQ1OiIARmxvd1xFVExcUEhQXFR5cGVcTmF0aXZlXFNjYWxhclR5cGUAbnVsbGFibGUiO2I6MDt9czozNjoiAEZsb3dcRVRMXFJvd1xFbnRyeVxTdHJpbmdFbnRyeQBuYW1lIjtzOjg6ImRpc2NvdW50IjtzOjM3OiIARmxvd1xFVExcUm93XEVudHJ5XFN0cmluZ0VudHJ5AHZhbHVlIjtzOjQ6IjguOTUiO3M6MzU6IgBGbG93XEVUTFxSb3dcRW50cnlcU3RyaW5nRW50cnkAcmVmIjtOO31zOjg6ImN1c3RvbWVyIjtPOjI4OiJGbG93XEVUTFxSb3dcRW50cnlcSnNvbkVudHJ5Ijo1OntzOjM2OiIARmxvd1xFVExcUm93XEVudHJ5XEpzb25FbnRyeQBvYmplY3QiO2I6MTtzOjM0OiIARmxvd1xFVExcUm93XEVudHJ5XEpzb25FbnRyeQB0eXBlIjtPOjM0OiJGbG93XEVUTFxQSFBcVHlwZVxMb2dpY2FsXEpzb25UeXBlIjoxOntzOjQ0OiIARmxvd1xFVExcUEhQXFR5cGVcTG9naWNhbFxKc29uVHlwZQBudWxsYWJsZSI7YjowO31zOjM1OiIARmxvd1xFVExcUm93XEVudHJ5XEpzb25FbnRyeQB2YWx1ZSI7YTo0OntzOjQ6Im5hbWUiO3M6NzoiS2FtZXJvbiI7czo5OiJsYXN0X25hbWUiO3M6MzoiVm9uIjtzOjU6ImVtYWlsIjtzOjIwOiJlbGlzZW81NkBob3RtYWlsLmNvbSI7czoxMToiaGFzX2FjY291bnQiO2I6MDt9czozNDoiAEZsb3dcRVRMXFJvd1xFbnRyeVxKc29uRW50cnkAbmFtZSI7czo4OiJjdXN0b21lciI7czozMzoiAEZsb3dcRVRMXFJvd1xFbnRyeVxKc29uRW50cnkAcmVmIjtOO31zOjc6ImFkZHJlc3MiO086Mjg6IkZsb3dcRVRMXFJvd1xFbnRyeVxKc29uRW50cnkiOjU6e3M6MzY6IgBGbG93XEVUTFxSb3dcRW50cnlcSnNvbkVudHJ5AG9iamVjdCI7YjoxO3M6MzQ6IgBGbG93XEVUTFxSb3dcRW50cnlcSnNvbkVudHJ5AHR5cGUiO086MzQ6IkZsb3dcRVRMXFBIUFxUeXBlXExvZ2ljYWxcSnNvblR5cGUiOjE6e3M6NDQ6IgBGbG93XEVUTFxQSFBcVHlwZVxMb2dpY2FsXEpzb25UeXBlAG51bGxhYmxlIjtiOjA7fXM6MzU6IgBGbG93XEVUTFxSb3dcRW50cnlcSnNvbkVudHJ5AHZhbHVlIjthOjQ6e3M6NDoibmFtZSI7czo3OiJLYW1lcm9uIjtzOjk6Imxhc3RfbmFtZSI7czozOiJWb24iO3M6NToiZW1haWwiO3M6MjA6ImVsaXNlbzU2QGhvdG1haWwuY29tIjtzOjExOiJoYXNfYWNjb3VudCI7YjowO31zOjM0OiIARmxvd1xFVExcUm93XEVudHJ5XEpzb25FbnRyeQBuYW1lIjtzOjc6ImFkZHJlc3MiO3M6MzM6IgBGbG93XEVUTFxSb3dcRW50cnlcSnNvbkVudHJ5AHJlZiI7Tjt9czo1OiJub3RlcyI7TzoyODoiRmxvd1xFVExcUm93XEVudHJ5XEpzb25FbnRyeSI6NTp7czozNjoiAEZsb3dcRVRMXFJvd1xFbnRyeVxKc29uRW50cnkAb2JqZWN0IjtiOjE7czozNDoiAEZsb3dcRVRMXFJvd1xFbnRyeVxKc29uRW50cnkAdHlwZSI7TzozNDoiRmxvd1xFVExcUEhQXFR5cGVcTG9naWNhbFxKc29uVHlwZSI6MTp7czo0NDoiAEZsb3dcRVRMXFBIUFxUeXBlXExvZ2ljYWxcSnNvblR5cGUAbnVsbGFibGUiO2I6MDt9czozNToiAEZsb3dcRVRMXFJvd1xFbnRyeVxKc29uRW50cnkAdmFsdWUiO2E6NDp7czo0OiJuYW1lIjtzOjc6IkthbWVyb24iO3M6OToibGFzdF9uYW1lIjtzOjM6IlZvbiI7czo1OiJlbWFpbCI7czoyMDoiZWxpc2VvNTZAaG90bWFpbC5jb20iO3M6MTE6Imhhc19hY2NvdW50IjtiOjA7fXM6MzQ6IgBGbG93XEVUTFxSb3dcRW50cnlcSnNvbkVudHJ5AG5hbWUiO3M6NToibm90ZXMiO3M6MzM6IgBGbG93XEVUTFxSb3dcRW50cnlcSnNvbkVudHJ5AHJlZiI7Tjt9fX19

which base64 decoded is for example:

O:12:"Flow\ETL\Row":1:{s:21:"Flow\ETL\Rowentries";O:20:"Flow\ETL\Row\Entries":1:{s:29:"Flow\ETL\Row\Entriesentries";a:9:{s:8:"order_id";O:28:"Flow\ETL\Row\Entry\UuidEntry":4:{s:34:"Flow\ETL\Row\Entry\UuidEntrytype";O:34:"Flow\ETL\PHP\Type\Logical\UuidType":1:{s:44:"Flow\ETL\PHP\Type\Logical\UuidTypenullable";b:0;}s:35:"Flow\ETL\Row\Entry\UuidEntryvalue";O:23:"Flow\ETL\PHP\Value\Uuid":1:{s:30:"Flow\ETL\PHP\Value\Uuidvalue";s:36:"009561f1-6618-3210-ad1d-c8ce7594b3cb";}s:34:"Flow\ETL\Row\Entry\UuidEntryname";s:8:"order_id";s:33:"Flow\ETL\Row\Entry\UuidEntryref";N;}s:10:"created_at";O:30:"Flow\ETL\Row\Entry\StringEntry":4:{s:36:"Flow\ETL\Row\Entry\StringEntrytype";O:35:"Flow\ETL\PHP\Type\Native\ScalarType":2:{s:41:"Flow\ETL\PHP\Type\Native\ScalarTypetype";s:6:"string";s:45:"Flow\ETL\PHP\Type\Native\ScalarTypenullable";b:0;}s:36:"Flow\ETL\Row\Entry\StringEntryname";s:10:"created_at";s:37:"Flow\ETL\Row\Entry\StringEntryvalue";s:25:"2023-05-25T20:23:29+00:00";s:35:"Flow\ETL\Row\Entry\StringEntryref";N;}s:10:"updated_at";O:30:"Flow\ETL\Row\Entry\StringEntry":4:{s:36:"Flow\ETL\Row\Entry\StringEntrytype";O:35:"Flow\ETL\PHP\Type\Native\ScalarType":2:{s:41:"Flow\ETL\PHP\Type\Native\ScalarTypetype";s:6:"string";s:45:"Flow\ETL\PHP\Type\Native\ScalarTypenullable";b:0;}s:36:"Flow\ETL\Row\Entry\StringEntryname";s:10:"updated_at";s:37:"Flow\ETL\Row\Entry\StringEntryvalue";s:25:"2023-09-19T21:01:40+00:00";s:35:"Flow\ETL\Row\Entry\StringEntryref";N;}s:12:"cancelled_at";O:30:"Flow\ETL\Row\Entry\StringEntry":4:{s:36:"Flow\ETL\Row\Entry\StringEntrytype";O:35:"Flow\ETL\PHP\Type\Native\ScalarType":2:{s:41:"Flow\ETL\PHP\Type\Native\ScalarTypetype";s:6:"string";s:45:"Flow\ETL\PHP\Type\Native\ScalarTypenullable";b:1;}s:36:"Flow\ETL\Row\Entry\StringEntryname";s:12:"cancelled_at";s:37:"Flow\ETL\Row\Entry\StringEntryvalue";N;s:35:"Flow\ETL\Row\Entry\StringEntryref";N;}s:11:"total_price";O:30:"Flow\ETL\Row\Entry\StringEntry":4:{s:36:"Flow\ETL\Row\Entry\StringEntrytype";O:35:"Flow\ETL\PHP\Type\Native\ScalarType":2:{s:41:"Flow\ETL\PHP\Type\Native\ScalarTypetype";s:6:"string";s:45:"Flow\ETL\PHP\Type\Native\ScalarTypenullable";b:0;}s:36:"Flow\ETL\Row\Entry\StringEntryname";s:11:"total_price";s:37:"Flow\ETL\Row\Entry\StringEntryvalue";s:6:"276.92";s:35:"Flow\ETL\Row\Entry\StringEntryref";N;}s:8:"discount";O:30:"Flow\ETL\Row\Entry\StringEntry":4:{s:36:"Flow\ETL\Row\Entry\StringEntrytype";O:35:"Flow\ETL\PHP\Type\Native\ScalarType":2:{s:41:"Flow\ETL\PHP\Type\Native\ScalarTypetype";s:6:"string";s:45:"Flow\ETL\PHP\Type\Native\ScalarTypenullable";b:0;}s:36:"Flow\ETL\Row\Entry\StringEntryname";s:8:"discount";s:37:"Flow\ETL\Row\Entry\StringEntryvalue";s:4:"8.95";s:35:"Flow\ETL\Row\Entry\StringEntryref";N;}s:8:"customer";O:28:"Flow\ETL\Row\Entry\JsonEntry":5:{s:36:"Flow\ETL\Row\Entry\JsonEntryobject";b:1;s:34:"Flow\ETL\Row\Entry\JsonEntrytype";O:34:"Flow\ETL\PHP\Type\Logical\JsonType":1:{s:44:"Flow\ETL\PHP\Type\Logical\JsonTypenullable";b:0;}s:35:"Flow\ETL\Row\Entry\JsonEntryvalue";a:4:{s:4:"name";s:7:"Kameron";s:9:"last_name";s:3:"Von";s:5:"email";s:20:"eliseo56@hotmail.com";s:11:"has_account";b:0;}s:34:"Flow\ETL\Row\Entry\JsonEntryname";s:8:"customer";s:33:"Flow\ETL\Row\Entry\JsonEntryref";N;}s:7:"address";O:28:"Flow\ETL\Row\Entry\JsonEntry":5:{s:36:"Flow\ETL\Row\Entry\JsonEntryobject";b:1;s:34:"Flow\ETL\Row\Entry\JsonEntrytype";O:34:"Flow\ETL\PHP\Type\Logical\JsonType":1:{s:44:"Flow\ETL\PHP\Type\Logical\JsonTypenullable";b:0;}s:35:"Flow\ETL\Row\Entry\JsonEntryvalue";a:4:{s:4:"name";s:7:"Kameron";s:9:"last_name";s:3:"Von";s:5:"email";s:20:"eliseo56@hotmail.com";s:11:"has_account";b:0;}s:34:"Flow\ETL\Row\Entry\JsonEntryname";s:7:"address";s:33:"Flow\ETL\Row\Entry\JsonEntryref";N;}s:5:"notes";O:28:"Flow\ETL\Row\Entry\JsonEntry":5:{s:36:"Flow\ETL\Row\Entry\JsonEntryobject";b:1;s:34:"Flow\ETL\Row\Entry\JsonEntrytype";O:34:"Flow\ETL\PHP\Type\Logical\JsonType":1:{s:44:"Flow\ETL\PHP\Type\Logical\JsonTypenullable";b:0;}s:35:"Flow\ETL\Row\Entry\JsonEntryvalue";a:4:{s:4:"name";s:7:"Kameron";s:9:"last_name";s:3:"Von";s:5:"email";s:20:"eliseo56@hotmail.com";s:11:"has_account";b:0;}s:34:"Flow\ETL\Row\Entry\JsonEntryname";s:5:"notes";s:33:"Flow\ETL\Row\Entry\JsonEntryref";N;}}}}

^ the info about object properties etc is making serialized value to be ~x10 times bigger than original. Probably different serialization method should be used?

@norberttech
Copy link
Member Author

thanks @domis86 this is super helpful, will take a look into this after weekend!! 🍻

@github-actions github-actions bot added size: XL and removed size: L labels Jul 28, 2024
@norberttech
Copy link
Member Author

I just pushed improved version, some answers below.

sorting of 10k rows (4.2 MB file): memory peak usage went down from ~150MB to ~67MB
sorting of 100k rows (42 MB file): memory peak usage went down from ~1500MB to ~100MB
sorting of 1000k rows (420 MB file): error:

It's because the default batch size that was used by the implementation is 1. This way external sort was creating as many buckets as many rows you tried to sort.
It's now optimized and you can set a number of used buckets in one run, through configuration. The algorithm will sort and merge buckets recursively until it reduces output to one bucket.

^ If possible we should try to implement closing of handles to files which are not used at certain moment. Or advice to increase file system open files limit?

This was a side effect of batch size issue, it wont happen anymore.

its size is ~577172 . Has 100 lines, so each has ~5700 size. In original csv file size of line is ~500 - like:

Well, you are comparing plain CSV with a serialized collection of Row objects, it can't be smaller or even similar in size.

Probably different serialization method should be used?

I don't think this have any relevant impact, sort cache is anyway purged immediately.
But also, improving serialization is out of the scope of this PR.

@norberttech norberttech marked this pull request as ready for review July 28, 2024 12:19
@norberttech
Copy link
Member Author

Ok, I pushed what I believe can become a final version, some stats:

Before:

{"processed_rows":9500,"usage":"182Mb","peak":"199Mb"}
{"processed_rows":9600,"usage":"182Mb","peak":"199Mb"}
{"processed_rows":9700,"usage":"182Mb","peak":"199Mb"}
{"processed_rows":9800,"usage":"182Mb","peak":"199Mb"}
{"processed_rows":9900,"usage":"182Mb","peak":"199Mb"}
{"processed_rows":10000,"usage":"182Mb","peak":"199Mb"}
Time: 1.3926739692688s

after:

{"processed_rows":9500,"usage":"12Mb","peak":"12Mb"}
{"processed_rows":9600,"usage":"12Mb","peak":"12Mb"}
{"processed_rows":9700,"usage":"12Mb","peak":"12Mb"}
{"processed_rows":9800,"usage":"12Mb","peak":"12Mb"}
{"processed_rows":9900,"usage":"12Mb","peak":"12Mb"}
{"processed_rows":10000,"usage":"12Mb","peak":"12Mb"}
Time: 0.86830592155457s

So it's not only consuming less memory but it's also significantly faster now.

@norberttech norberttech force-pushed the bug/external-sort-memory-usage branch from b2d5472 to 6add818 Compare July 29, 2024 10:24
@github-actions
Copy link
Contributor

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+-----------------+------------------+----------------+
| benchmark             | subject           | revs | its | mem_peak        | mode             | rstdev         |
+-----------------------+-------------------+------+-----+-----------------+------------------+----------------+
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 3.922mb +0.38%  | 517.929ms +1.14% | ±0.77% +4.64%  |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 3.954mb +0.38%  | 1.079s +0.99%    | ±0.62% -25.65% |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 28.515mb +0.05% | 439.460ms +3.19% | ±0.28% -3.24%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 3.681mb +0.41%  | 34.545ms -0.71%  | ±0.64% -61.80% |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 3.628mb +0.41%  | 445.429ms +0.28% | ±0.94% +47.84% |
+-----------------------+-------------------+------+-----+-----------------+------------------+----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+---------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev        |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+---------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 115.976mb +0.05% | 59.443ms +0.11% | ±0.45% -3.32% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+---------------+
Loaders
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode            | rstdev          |
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 54.077mb +0.15%  | 86.377ms +3.15% | ±1.16% +31.08%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 106.509mb +0.02% | 54.186ms +6.49% | ±0.49% +102.89% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 123.792mb +0.01% | 1.237s +0.74%   | ±0.76% +127.93% |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 16.931mb +0.13%  | 43.142ms -0.18% | ±0.32% +70.47%  |
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 59.706mb +0.01%  | 451.808ms +6.19% | ±2.89% +628.17% |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 14.245mb +0.04%  | 88.029ms +2.87%  | ±0.57% -83.17%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 116.524mb +0.00% | 509.572ms +2.41% | ±2.75% +82.50%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 60.002mb +0.01%  | 253.368ms +2.01% | ±0.94% +96.39%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.936mb +0.04%  | 54.598ms +2.32%  | ±3.14% +191.22% |
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 86.794mb +0.01%  | 3.597ms +5.40%   | ±2.53% +111.33% |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 102.392mb +0.01% | 189.250ms +0.69% | ±0.09% -93.88%  |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 85.112mb +0.01%  | 19.075ms +0.92%  | ±0.72% -55.51%  |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 88.034mb +0.01%  | 1.878ms +7.95%   | ±1.37% -60.86%  |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 88.034mb +0.01%  | 1.793ms +5.60%   | ±2.52% -17.94%  |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 85.146mb +0.01%  | 2.876ms +9.86%   | ±0.52% -57.77%  |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 85.675mb +0.01%  | 15.233ms -5.54%  | ±1.13% +49.62%  |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 85.675mb +0.01%  | 14.961ms -6.75%  | ±1.04% -14.74%  |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 83.579mb +0.01%  | 1.894μs +17.92%  | ±2.53% -12.50%  |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 83.579mb +0.01%  | 0.300μs 0.00%    | ±0.00% 0.00%    |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 92.929mb +0.01%  | 12.549ms +4.32%  | ±2.49% +171.00% |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 122.300mb +0.00% | 63.602ms +4.46%  | ±3.09% +168.58% |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 86.195mb +0.01%  | 1.600ms +35.31%  | ±0.64% +45.88%  |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 89.541mb +0.01%  | 64.764ms +3.34%  | ±0.73% -60.78%  |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 88.296mb +0.01%  | 4.043ms +5.81%   | ±2.36% +677.89% |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 83.723mb +0.01%  | 40.121ms +3.29%  | ±0.91% -22.39%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 83.723mb +0.01%  | 40.060ms +3.66%  | ±0.60% -43.95%  |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 83.723mb +0.01%  | 39.267ms +1.45%  | ±0.85% -62.50%  |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 86.021mb +0.01%  | 7.539ms +3.28%   | ±0.73% -13.81%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 83.579mb +0.01%  | 30.432ms +7.54%  | ±0.78% +43.36%  |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 83.579mb +0.01%  | 14.242μs +4.57%  | ±2.09% +66.24%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 83.579mb +0.01%  | 16.876μs +6.94%  | ±1.12% +4.99%   |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 102.393mb +0.01% | 191.713ms +0.09% | ±0.81% +2.06%   |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@norberttech norberttech merged commit 3f8b6fb into flow-php:1.x Jul 29, 2024
@domis86
Copy link

domis86 commented Jul 30, 2024

sorry for late, but i re-ran local checks (previously #1147 (comment) ) for external sort:

  • sorting of 10k rows (4.2 MB file): memory peak usage went down from ~150MB to ~7MB
  • sorting of 100k rows (42 MB file): memory peak usage went down from ~1500MB to ~11MB
  • sorting of 1000k rows (420 MB file): it went from previously being "error" to memory usage ~11 MB

summary: great job 👍 ✔️

@norberttech norberttech deleted the bug/external-sort-memory-usage branch December 5, 2024 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

External Sort - Cache Implementation - Memory Consumption PSRSimpleCache - Performance issue

2 participants