Skip to content

Conversation

@merlimat
Copy link
Contributor

Motivation

In some deployment scenarios, applications might want to skip the journal, avoiding the double writing of data to disks.

One such examples is a deployment in a cloud environment with writes on locally attached disks, where absolute throughput is more important than strict data durability. In this case a journal is not providing a lot of value, considering that if the VM is terminated, the local disk is gone.

In such deployments, it's better to just rely on application level replication of data instead of the properties of the journal.

Changes

Added option to bypass the journal at the bookie level.

@merlimat merlimat added this to the 4.12.0 milestone Aug 26, 2020
@merlimat merlimat self-assigned this Aug 26, 2020
@karanmehta93
Copy link
Member

In such deployments, it's better to just rely on application level replication of data instead of the properties of the journal.

@merlimat Are we betting that the replication factor of data and low MTTR of bookies in public cloud would be good enough that we would never loose all copies of data? Or are you okay with all the data going away in case of catastrophe? What is the standard configuration that you use?

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is good and we already discussed about it with @jvrao and @sijie

I see a few limitations with this proposal:

  • we are going to violate fencing and LAC in case of bookie restart
  • we are not coordinating with checkpoint, lastMark mechanism....(the position in the journal does not advance?)
  • the client is not aware of what's going on

I would accept this patch if we make that the bookie is not able to restart, this way as soon as the bookie is up and running every guarantee is safe, when you reboot the bookie it is like you lost the bookie forever.

Otherwise I was thinking with my colleague @nicoloboschi about a more complex solution, with a new WriteFlag.
But we are still working on BP41 and we had not the chance to share our ideas with the community.

Drafting a BP would also be a good step as this will be a significant feature for BK

(We should also add tests that cover the change)

@ivankelly
Copy link
Contributor

@eolivelli we have another change that handles the situation on reboot. The problem here isn't so much about losing the replica of an entry. The problem is losing a replica which we have claimed to persist.

The solution we have is an integrity checker, which on checks on boot (after a non-clean shutdown), whether the ledgers on the bookie have been closed. If they have not, they are marked a in limbo, which means that they will not be able to respond "I don't have this entry" because that could cause data loss during ledger recovery operations. Equally, an in limbo ledger will never accept a write, as we don't know if that ledger has been fenced.

@eolivelli
Copy link
Contributor

@ivankelly so we are on the same page.

Can you share the second part of the story as a PR?
I would feel comfortable in committing both of the parts and ensure that we deliver them in the same release

@ivankelly
Copy link
Contributor

@eolivelli there are a bunch of dependent changes to get that change in, such as some work to make Bookie an interface, dependency injection, changes in cookie handling, etc. I do have a backlog item to start merging these, but time as always is my greatest foe.

I won't get around to looking at this until mid September at least.

IMO, there's no rush in getting the change in this PR in. I think we should go a bit further and completely remove the journal in the case that it is disabled, but again, time hasn't permitted.

@eolivelli
Copy link
Contributor

Yes I agree
I was also thinking that we should not even start/create the journal.

Let's keep this PR open as a reminder. I am leaving my 'request changes' vote

I minimal design document about this important change would be very awesome.

@eolivelli
Copy link
Contributor

@merlimat @ivankelly do you have plans to port all of this precious work to the community repository?

I am very interested in this work and I would like not to duplicate the work and especially diverge on the implementation.

From my side I would like to have a WriteFlag on the client side in order to control this behaviour, because in EmailSuccess.com we are sharing the BK cluster among lots of different components and only one (https://github.com/diennea/blobit) is very interested in this change, because it helps a lot in reducing the overall write amplification.

@karanmehta93
Copy link
Member

@eolivelli How would blobit use it to provide durability guarantees on ledger close (or blob close)? Does the feature ensure that all previous writes have been persisted too?

@joefk
Copy link

joefk commented Nov 11, 2020

It would be good to have journal writes controlled by the client on a per ledger basis. From Pulsar, then this can be controlled on a topic/application level (like all the qourum settings that can be set at the client), and would be useful in any environment, even where disks are not ephemeral.

@eolivelli
Copy link
Contributor

@joefk
I have a similar use case, and I would like to see this controlled by a WriteFlag (that is per-ledger and not per-cluster or per-bookie).
In my case the BK cluster (very small, like 3-5 bookies) is shared by several applications (like Database/HerdDB WAL) and for some of them I need strict durability and for other applications (like bulk data storage) I can accept partial data loss.

it looks like that @ivankelly and @merlimat are going to contribute this fix with a cluster wide configuration.

@eolivelli
Copy link
Contributor

@merlimat @ivankelly
is there any update on your roadmap for the backport of this feature ?

@dlg99 dlg99 modified the milestones: 4.13.0, 4.14.0 Feb 16, 2021
@merlimat
Copy link
Contributor Author

merlimat commented Mar 18, 2021

@eolivelli I think it's important to get this PR in, even though the integrity check is not ready. This is a valid mode of running that uses different tradeoffs compared to the default behavior. Nonetheless it's a very useful mode for several use cases.

This does not preclude a more fine graned approach later on. But in many cases, a cluster wide settings is more than appropriate.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat can we add a unit test (no need to start a bookie)?

@merlimat
Copy link
Contributor Author

What do you mean exactly with "no need to start a bookie" ? And what kind of test?

@eolivelli
Copy link
Contributor

I mean that we can mock the components and just verify the sequence of calls internally to the bookie.
We already have lots of tests that start ZK + BK and they are very slow.

Btw add the test the way you prefer

@eolivelli
Copy link
Contributor

eolivelli commented Mar 18, 2021

The test is about verifying that we are not calling the Journal with that configuration option

@eolivelli
Copy link
Contributor

@merlimat if you want I can pickup this branch and add a test case if you do not have cycles

@hsaputra
Copy link
Contributor

hsaputra commented May 6, 2021

@rdhabalia , @ivankelly - any comments or concerns about this PR?

@dlg99 dlg99 modified the milestones: 4.14.0, 4.15.0 May 7, 2021
@merlimat
Copy link
Contributor Author

merlimat commented May 8, 2021

@eolivelli @hsaputra added test

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@merlimat merlimat modified the milestones: 4.15.0, 4.14.0 May 8, 2021
@merlimat merlimat merged commit 8091096 into apache:master May 8, 2021
@merlimat merlimat deleted the bypass-journal branch May 8, 2021 13:27
merlimat pushed a commit that referenced this pull request May 24, 2021
### Motivation
After add label for prometheus metric by #2650, it will cause prometheus metric format check failed when no label specified for a statsLogger. The metric list as follow.
```
replication_bookkeeper_client_bookkeeper_client_bookie_watcher_NEW_ENSEMBLE_TIME{success="false",quantile="0.9999", } NaN
```

### Modification
1. add label empty check for `PrometheusTextFormatUtil`
2. add label scope check test cover
3. add prometheus metric regex pattern check in test case

Reviewers: lipenghui <penghui@apache.org>, Andrey Yegorov <None>, Matteo Merli <mmerli@apache.org>, Jia Zhai <zhaijia@apache.org>, Addison Higham <None>, Enrico Olivelli <eolivelli@gmail.com>

This closes #2718 from hangc0276/chenhang/fix_bookeeper_metric_bug and squashes the following commits:

8590704 [hangc0276] format code
a6942d4 [chenhang] fix prometheus metric provider bug and add test to cover label scope and metric format check
bb8b1e0 [Andrey Yegorov] Include gradle files into the source artifact for releases, exclude site2/**
732b6cf [Andrey Yegorov] [maven-release-plugin] prepare for next development iteration
b0d9f10 [Andrey Yegorov] [maven-release-plugin] prepare branch branch-4.14
8dc108b [Matteo Merli] y
73e22ca [Don Inghram] ISSUE2620: RocksDB log path configurable
034ef85 [Shoothzj] Fix logger member not correct; (#2605)
b824a60 [hangc0276] fix always select the same region set bug for RegionAwareEnsemblePlacementPolicy (#2658)
683ad45 [Matteo Merli] Allow to attach labels to metrics (#2650)
8091096 [Matteo Merli] Allow to bypass journal for writes (#2401)
63867a9 [Matteo Merli] Impose a memory limit on the bookie journal (#2710)
87579b0 [Matteo Merli] Read entry error should print lastAddConfirmed in the log (#2707)
merlimat pushed a commit that referenced this pull request May 29, 2021
### Motivation
After add label for prometheus metric by #2650, it will cause prometheus metric format check failed when no label specified for a statsLogger. The metric list as follow.
```
replication_bookkeeper_client_bookkeeper_client_bookie_watcher_NEW_ENSEMBLE_TIME{success="false",quantile="0.9999", } NaN
```

### Modification
1. add label empty check for `PrometheusTextFormatUtil`
2. add label scope check test cover
3. add prometheus metric regex pattern check in test case

Reviewers: lipenghui <penghui@apache.org>, Andrey Yegorov <None>, Matteo Merli <mmerli@apache.org>, Jia Zhai <zhaijia@apache.org>, Addison Higham <None>, Enrico Olivelli <eolivelli@gmail.com>

This closes #2718 from hangc0276/chenhang/fix_bookeeper_metric_bug and squashes the following commits:

8590704 [hangc0276] format code
a6942d4 [chenhang] fix prometheus metric provider bug and add test to cover label scope and metric format check
bb8b1e0 [Andrey Yegorov] Include gradle files into the source artifact for releases, exclude site2/**
732b6cf [Andrey Yegorov] [maven-release-plugin] prepare for next development iteration
b0d9f10 [Andrey Yegorov] [maven-release-plugin] prepare branch branch-4.14
8dc108b [Matteo Merli] y
73e22ca [Don Inghram] ISSUE2620: RocksDB log path configurable
034ef85 [Shoothzj] Fix logger member not correct; (#2605)
b824a60 [hangc0276] fix always select the same region set bug for RegionAwareEnsemblePlacementPolicy (#2658)
683ad45 [Matteo Merli] Allow to attach labels to metrics (#2650)
8091096 [Matteo Merli] Allow to bypass journal for writes (#2401)
63867a9 [Matteo Merli] Impose a memory limit on the bookie journal (#2710)
87579b0 [Matteo Merli] Read entry error should print lastAddConfirmed in the log (#2707)
Sunny-Island pushed a commit to Sunny-Island/bookkeeper that referenced this pull request Jun 2, 2021
### Motivation
After add label for prometheus metric by apache#2650, it will cause prometheus metric format check failed when no label specified for a statsLogger. The metric list as follow.
```
replication_bookkeeper_client_bookkeeper_client_bookie_watcher_NEW_ENSEMBLE_TIME{success="false",quantile="0.9999", } NaN
```

### Modification
1. add label empty check for `PrometheusTextFormatUtil`
2. add label scope check test cover
3. add prometheus metric regex pattern check in test case

Reviewers: lipenghui <penghui@apache.org>, Andrey Yegorov <None>, Matteo Merli <mmerli@apache.org>, Jia Zhai <zhaijia@apache.org>, Addison Higham <None>, Enrico Olivelli <eolivelli@gmail.com>

This closes apache#2718 from hangc0276/chenhang/fix_bookeeper_metric_bug and squashes the following commits:

8590704 [hangc0276] format code
a6942d4 [chenhang] fix prometheus metric provider bug and add test to cover label scope and metric format check
bb8b1e0 [Andrey Yegorov] Include gradle files into the source artifact for releases, exclude site2/**
732b6cf [Andrey Yegorov] [maven-release-plugin] prepare for next development iteration
b0d9f10 [Andrey Yegorov] [maven-release-plugin] prepare branch branch-4.14
8dc108b [Matteo Merli] y
73e22ca [Don Inghram] ISSUE2620: RocksDB log path configurable
034ef85 [Shoothzj] Fix logger member not correct; (apache#2605)
b824a60 [hangc0276] fix always select the same region set bug for RegionAwareEnsemblePlacementPolicy (apache#2658)
683ad45 [Matteo Merli] Allow to attach labels to metrics (apache#2650)
8091096 [Matteo Merli] Allow to bypass journal for writes (apache#2401)
63867a9 [Matteo Merli] Impose a memory limit on the bookie journal (apache#2710)
87579b0 [Matteo Merli] Read entry error should print lastAddConfirmed in the log (apache#2707)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants