Fix incorrect output from prints originating from different processes by basnijholt · Pull Request #604 · executablebooks/MyST-NB

basnijholt · 2024-05-31T06:23:01Z

In the PipeFunc documentation I have the following problem when executing code with a ProcessPoolExecutor:

With this fix it becomes:

The root of the issue is that nbconvert --execute produces this output:

  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "92",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-31T05:42:39.297713Z",
     "iopub.status.busy": "2024-05-31T05:42:39.297474Z",
     "iopub.status.idle": "2024-05-31T05:42:40.477462Z",
     "shell.execute_reply": "2024-05-31T05:42:40.475729Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-05-30 22:42:39.410279 - Running double_it for x=3"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-05-30 22:42:39.408318 - Running double_it for x=0"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-05-30 22:42:39.410888 - Running double_it for x=1"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-05-30 22:42:39.416024 - Running double_it for x=2"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-05-30 22:42:39.431485 - Running half_it for x=0"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-05-30 22:42:39.434285 - Running half_it for x=1"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-05-30 22:42:39.433559 - Running half_it for x=2"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-05-30 22:42:39.439223 - Running half_it for x=3"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-05-30 22:42:40.459668 - Running take_sum"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "14\n"
     ]
    }
   ],
   "source": [
    "from concurrent.futures import ProcessPoolExecutor\n",
    "import datetime\n",
    "import numpy as np\n",
    "import time\n",
    "from pipefunc import Pipeline, pipefunc\n",
    "\n",
    "\n",
    "@pipefunc(output_name=\"double\", mapspec=\"x[i] -> double[i]\")\n",
    "def double_it(x: int) -> int:\n",
    "    print(f\"{datetime.datetime.now()} - Running double_it for x={x}\")\n",
    "    time.sleep(1)\n",
    "    return 2 * x\n",
    "\n",
    "\n",
    "@pipefunc(output_name=\"half\", mapspec=\"x[i] -> half[i]\")\n",
    "def half_it(x: int) -> int:\n",
    "    print(f\"{datetime.datetime.now()} - Running half_it for x={x}\")\n",
    "    time.sleep(1)\n",
    "    return x // 2\n",
    "\n",
    "\n",
    "@pipefunc(output_name=\"sum\")\n",
    "def take_sum(half: np.ndarray, double: np.ndarray) -> int:\n",
    "    print(f\"{datetime.datetime.now()} - Running take_sum\")\n",
    "    return sum(half + double)\n",
    "\n",
    "\n",
    "pipeline = Pipeline([double_it, half_it, take_sum])\n",
    "inputs = {\"x\": [0, 1, 2, 3]}\n",
    "run_folder = \"my_run_folder\"\n",
    "executor = ProcessPoolExecutor(max_workers=8)  # use 8 processes\n",
    "results = pipeline.map(\n",
    "    inputs,\n",
    "    run_folder=run_folder,\n",
    "    parallel=True,\n",
    "    executor=executor,\n",
    "    storage=\"shared_memory_dict\",\n",
    ")\n",
    "print(results[\"sum\"].output)"
   ]
  },

In the PipeFunc documentation I got: ``` ``` { "cell_type": "code", "execution_count": 47, "id": "92", "metadata": { "execution": { "iopub.execute_input": "2024-05-31T05:42:39.297713Z", "iopub.status.busy": "2024-05-31T05:42:39.297474Z", "iopub.status.idle": "2024-05-31T05:42:40.477462Z", "shell.execute_reply": "2024-05-31T05:42:40.475729Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2024-05-30 22:42:39.410279 - Running double_it for x=3" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2024-05-30 22:42:39.408318 - Running double_it for x=0" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2024-05-30 22:42:39.410888 - Running double_it for x=1" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2024-05-30 22:42:39.416024 - Running double_it for x=2" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2024-05-30 22:42:39.431485 - Running half_it for x=0" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2024-05-30 22:42:39.434285 - Running half_it for x=1" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2024-05-30 22:42:39.433559 - Running half_it for x=2" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2024-05-30 22:42:39.439223 - Running half_it for x=3" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2024-05-30 22:42:40.459668 - Running take_sum" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "14\n" ] } ], "source": [ "from concurrent.futures import ProcessPoolExecutor\n", "import datetime\n", "import numpy as np\n", "import time\n", "from pipefunc import Pipeline, pipefunc\n", "\n", "\n", "@pipefunc(output_name=\"double\", mapspec=\"x[i] -> double[i]\")\n", "def double_it(x: int) -> int:\n", " print(f\"{datetime.datetime.now()} - Running double_it for x={x}\")\n", " time.sleep(1)\n", " return 2 * x\n", "\n", "\n", "@pipefunc(output_name=\"half\", mapspec=\"x[i] -> half[i]\")\n", "def half_it(x: int) -> int:\n", " print(f\"{datetime.datetime.now()} - Running half_it for x={x}\")\n", " time.sleep(1)\n", " return x // 2\n", "\n", "\n", "@pipefunc(output_name=\"sum\")\n", "def take_sum(half: np.ndarray, double: np.ndarray) -> int:\n", " print(f\"{datetime.datetime.now()} - Running take_sum\")\n", " return sum(half + double)\n", "\n", "\n", "pipeline = Pipeline([double_it, half_it, take_sum])\n", "inputs = {\"x\": [0, 1, 2, 3]}\n", "run_folder = \"my_run_folder\"\n", "executor = ProcessPoolExecutor(max_workers=8) # use 8 processes\n", "results = pipeline.map(\n", " inputs,\n", " run_folder=run_folder,\n", " parallel=True,\n", " executor=executor,\n", " storage=\"shared_memory_dict\",\n", ")\n", "print(results[\"sum\"].output)" ] }, ```

welcome · 2024-05-31T06:23:04Z

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out EBP's Code of Conduct and our Contributing Guide, as this will greatly help the review process.

Welcome to the EBP community! 🎉

basnijholt · 2024-05-31T06:44:22Z

@agoose77 the failing CI check is unrelated to these changes

basnijholt · 2024-08-30T17:28:21Z

@agoose77, friendly ping. Could you take a look at this?

bsipocz · 2024-09-19T17:05:54Z

I have a very similar problem in one of my repos, but this fix doesn't seem to work, I still get the fragmented output in the rendered HTML.

basnijholt · 2024-09-19T18:07:29Z

@bsipocz, did you set nb_merge_streams = True?

bsipocz · 2024-09-19T20:30:46Z

@bsipocz, did you set nb_merge_streams = True?

Yeap, I didn't have that, but discovered the option by following the link to your pipefunc/pipefunc#125 PR. So thank you.

So now my issue is fixed even without this PR.

While this PR does pass all tests, and doesn't seem to break anything, I suppose it would be nice to add your failing case to the tests, too.

bsipocz · 2024-09-19T20:31:09Z

myst_nb/core/utils.py

+                    streams[output["name"]]["text"] += f"{out}\n"
            else:
                new_outputs.append(output)
+                output["text"] = output["text"].strip() + "\n"


Is this line necessary?

Yes, perhaps it would be clearer to move it one line up. We're mutating the dict we just added to new_outputs.

edit: done

oh, I was just nitpicking that you stripe off whitespace and add a newline; practically the same, so is this really needed?

Oh this is in case there are multiple newlines at the end which can happen after merging the cells.

See the example in my first post.

I am hard at work on trying to write a test. Turns out that on MacOS the issue does not exist ... took me a good while to realize that.

@bsipocz, just added a test!

See my comment here #604 (comment)

@basnijholt @bsipocz just a hint, you should be using rstrip not strip, because now you have removed any possible indentation at the start of the streams 🤷

Thanks! I'll open a follow-up to fix that.

Good point! I just opened #640

bsipocz

This doesn't seem to break anything, and thus I would go ahead and merge it, however it would be nicer to have a test covering it, too.

basnijholt · 2024-09-19T21:58:30Z

I added a test. Without the changes here the test will fail:

________________________________________________ test_merge_streams_parallel ________________________________________________

sphinx_run = <conftest.SphinxFixture object at 0x7e810491f9b0>
file_regression = <conftest.FileRegression object at 0x7e80b65fe3f0>

    @pytest.mark.sphinx_params(
        "merge_streams_parallel.ipynb",
        conf={"nb_execution_mode": "off", "nb_merge_streams": True},
    )
    def test_merge_streams_parallel(sphinx_run, file_regression):
        """Test configuring multiple concurrent stdout/stderr outputs to be merged."""
        sphinx_run.build()
        assert sphinx_run.warnings() == ""
        doctree = sphinx_run.get_resolved_doctree("merge_streams_parallel")
>       file_regression.check(doctree.pformat(), extension=".xml", encoding="utf-8")

/home/bas.nijholt/repos/MyST-NB/tests/test_render_outputs.py:116:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <conftest.FileRegression object at 0x7e80b65fe3f0>
data = '<document source="merge_streams_parallel" translation_progress="{\'total\': 0, \'translated\': 0}">\n    <container c...        \n                \n                \n                \n                \n                \n                \n'
kwargs = {'encoding': 'utf-8', 'extension': '.xml'}

    def check(self, data, **kwargs):
>       return self.file_regression.check(self._strip_ignores(data), **kwargs)
E       AssertionError: FILES DIFFER:
E       /home/bas.nijholt/.tmp/pytest-of-bas.nijholt/pytest-49/test_merge_streams_parallel0/test_render_outputs/test_merge_streams_parallel.xml
E       /home/bas.nijholt/.tmp/pytest-of-bas.nijholt/pytest-49/test_merge_streams_parallel0/test_render_outputs/test_merge_streams_parallel.obtained.xml
E       HTML DIFF: /home/bas.nijholt/.tmp/pytest-of-bas.nijholt/pytest-49/test_merge_streams_parallel0/test_render_outputs/test_merge_streams_parallel.obtained.diff.html
E       ---
E       +++
E       @@ -9,13 +9,13 @@
E                                pass
E                <container classes="cell_output" nb_element="cell_code_output">
E                    <literal_block classes="output stream" language="myst-ansi" linenos="False" xml:space="preserve">
E       +                000000000
E                        0
E       -                0
E       -                0
E       -                0
E       -                0
E       -                0
E       -                0
E       -                0
E       -                0
E       -                0
E       +
E       +
E       +
E       +
E       +
E       +
E       +
E       +

The notebook is executed via jupyter nbconvert --execute merge_streams_parallel.ipynb --to ipynb and I committed the executed ipynb.

bsipocz

Thanks @basnijholt!

executablebooks/MyST-NB#604 was merged and made it into 1.21

executablebooks/MyST-NB#604 was merged and made it into [v1.1.2](https://github.com/executablebooks/MyST-NB/releases/tag/v1.1.2).

basnijholt mentioned this pull request May 31, 2024

Documentation MyST fixes and style changes pipefunc/pipefunc#125

Merged

Merge branch 'master' into coalesce_streams

4744bf0

bsipocz reviewed Sep 19, 2024

View reviewed changes

bsipocz previously approved these changes Sep 19, 2024

View reviewed changes

more line up

467fac8

basnijholt dismissed bsipocz’s stale review via 467fac8 September 19, 2024 20:35

Add parallel streams test

b4286b2

bsipocz added the bug Something isn't working label Sep 19, 2024

bsipocz approved these changes Sep 19, 2024

View reviewed changes

bsipocz merged commit 25fa81d into executablebooks:master Sep 19, 2024

bsipocz mentioned this pull request Sep 20, 2024

Skip empty output text blocks? #593

Open

basnijholt mentioned this pull request Sep 27, 2024

Use rstrip over strip in coalesce_streams #640

Closed

basnijholt added a commit to pipefunc/pipefunc that referenced this pull request Sep 27, 2024

DOC: Rely on latest release of MyST

f22ebcf

executablebooks/MyST-NB#604 was merged and made it into 1.21

basnijholt mentioned this pull request Sep 27, 2024

DOC: Rely on latest release of MyST pipefunc/pipefunc#360

Merged

basnijholt added a commit to pipefunc/pipefunc that referenced this pull request Sep 27, 2024

DOC: Rely on latest release of MyST (#360)

cf878b1

executablebooks/MyST-NB#604 was merged and made it into [v1.1.2](https://github.com/executablebooks/MyST-NB/releases/tag/v1.1.2).

bpuchala mentioned this pull request Aug 28, 2025

Combine multiple stream output and maintain spacing like in Jupyter lab #690

Open

Conversation

basnijholt commented May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

welcome bot commented May 31, 2024

Uh oh!

basnijholt commented May 31, 2024

Uh oh!

basnijholt commented Aug 30, 2024

Uh oh!

bsipocz commented Sep 19, 2024

Uh oh!

basnijholt commented Sep 19, 2024

Uh oh!

bsipocz commented Sep 19, 2024

Uh oh!

bsipocz Sep 19, 2024

Choose a reason for hiding this comment

Uh oh!

basnijholt Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bsipocz Sep 19, 2024

Choose a reason for hiding this comment

Uh oh!

basnijholt Sep 19, 2024

Choose a reason for hiding this comment

Uh oh!

basnijholt Sep 19, 2024

Choose a reason for hiding this comment

Uh oh!

chrisjsewell Sep 24, 2024

Choose a reason for hiding this comment

Uh oh!

bsipocz Sep 24, 2024

Choose a reason for hiding this comment

Uh oh!

basnijholt Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

bsipocz left a comment

Choose a reason for hiding this comment

Uh oh!

basnijholt commented Sep 19, 2024

Uh oh!

bsipocz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

basnijholt commented May 31, 2024 •

edited

Loading

basnijholt Sep 19, 2024 •

edited

Loading