Add address-based index (attempt 4?) by marcinja · Pull Request #14053 · bitcoin/bitcoin

marcinja · 2018-08-24T18:51:45Z

Adds index to transactions by scriptPubKey. Based off of #2802. Stores hashes of scripts (hashed using Murmurhash3, with a hash seed that is stored in the index database), along with the COutPoint for the output which was spent/created, and the CDiskTxPos for the transaction in which this happened.

DrahtBot · 2018-08-24T19:23:51Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

Remove names from translatable strings #20404 (Remove names from translatable strings by hebasto)
rpc: Remove duplicate name and argNames from CRPCCommand #20012 (rpc: Remove duplicate name and argNames from CRPCCommand by MarcoFalke)
Flush dbcache early if system is under memory pressure #19873 (Flush dbcache early if system is under memory pressure by luke-jr)
validation: UTXO snapshot activation #19806 (validation: UTXO snapshot activation by jamesob)
Coinstats Index #19521 (Coinstats Index (without UTXO set hash) by fjahr)
Allow maintaining the blockfilterindex when using prune #15946 (Allow maintaining the blockfilterindex when using prune by jonasschnelli)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

ryanofsky

Reviewed most of the code, but just skimmed tests. It looks to me like this PR could be merged basically in its current form, so I'm curious if you're intending to make the improvements cited in the PR description above here or in a separate PRs.

I left a few minor suggestions about the code, which you should feel free to ignore. The only changes I would definitely like to see here are:

adding some python code in test/functional/ to call the new rpc method
adding a blurb in doc/release notes.md to describe the feature and maybe mention some use-cases

ryanofsky · 2018-08-29T16:53:38Z

+std::unique_ptr<AddrIndex> g_addrindex;
+
+/**
+ * Access to the addrindex database (indexes/addrindex/)


In commit "Introduce address index" (3c7cc3c)

Note: new index/addrindex.cpp, index/addrindex.h, and test/addrindex_tests.cpp files in this commit mirror existing index/txindex.cpp and index/txindex.h, test/txindex_tests.cpp files and have some code and comments in common. It can help to diff the addr files against the tx files when reviewing this PR.

luke-jr · 2018-08-30T12:40:10Z

+            "\nArguments:\n"
+            "1. \"address\"    (string, required) The address to search for\n"
+            "2. \"verbose\"    (bool, optional, default = false) If set to false, only returns data for hex-encoded `txid`s. \n"
+            "3. \"skip\"       (numeric, optional, default = 0) If set, the result skips this number of initial values. \n"


skip and count probably make sense on an options object instead.

jimpo

Nice work! I'm glad someone's working on this. Concept ACK.

The AddrIndex should return information about the outpoints and differentiate between outputs and spends, not just return the raw transactions. In fact, the AddrIndex could just return outpoints, then the client code could use the TxIndex could to fetch the tx bodies. It'd involve a separate lookup though.
I don't think it's necessary to delete keys from the database when a block is disconnected. There's no harm in leaving it. The higher level methods to search the index can then filter for results that are on the main chain if that's what the client wants. It'd have to do this anyway to avoid races with reorgs and such.
I'm worried about collisions on address IDs because they are only 64 bits. I can think of three options, 1) use a 32 byte cryptographic hash, 2) use a 20 byte cryptographic hash of the script plus some randomly generated, database-global salt, 3) use a 32- or 64-bit non-cryptographic hash (might as well use Murmur3 or SipHash, not truncated SHA256), then store the full script as the database value to double check against. Option 3 feels best to me.
What's the purpose of having the first byte of the block hash as the value? It doesn't seem robust nor particularly useful.

marcinja · 2018-08-31T15:19:08Z

Thanks for all the reviews.

To answer some of @jimpo's questions:
2 & 4. I included part of the block hash so that in BlockDisconnected we ae sure to remove the entries in the index from this block only (that's where filter_by_value is used). The reason I chose to remove entries from the database is to prevent reading into a block file using an old CDiskTxPos that may no longer be a valid position. Otherwise in FindTxsByScript you could run into errors. You're right that this problem would be better handled by higher level methods.

I think that returning just the outpoint is a better idea than the current choice so I'll switch to that and try to incorporate all the other feedback here.

Adds index that relates scriptPubKeys to location of transactions in the filesystem, along with the hash of the block they are found in, and the outpoint information of the txout with the related script.

Setup address index in initialization process. Add initialization warning and wallet feature request warning as suggested by ryanofsky.

Adds searchrawtransactions RPC that uses the address index to lookup transactions and outpoints by script and address. Adds basic functional tests for searchrawtransactions.

c78867886 · 2020-10-13T07:01:53Z

Can someone plz approve this?

decryp2kanon · 2020-11-08T01:48:30Z

Concept ACK

Talkless · 2020-12-13T12:47:55Z

@marcinja will this allow other wallets like Electrum to utilize this feature (RPC credentials provided, of course) and avoid having to run ElecturmX (https://github.com/spesmilo/electrumx), EPS (https://github.com/chris-belcher/electrum-personal-server) or BWT (https://github.com/shesek/bwt) intermediary software?

@ecdsa @SomberNight @chris-belcher @shesek could you provide your input on how this feature might be useful (or not)?

romanz · 2020-12-13T14:30:13Z

The main issue AFAIU is that Electrum is using SHA256(scriptPubKey) while this PR is using MurmurHash3(scriptPubKey).
Also, ElectrumX & electrs are using RocksDB for the index storage - resulting in better performance and disk usage (compared to LevelDB).

SomberNight · 2020-12-13T15:04:58Z

@Talkless note that while some Electrum users run their own bitcoind, many do not. Electrum wants to support both use cases, and in fact the suspicion is that most users just use a public server. When using a public server, the client cannot use bitcoind RPC, hence in that case I don't see how the middleware (e.g. ElectrumX) could be avoided.

For the own bitcoind use case, maybe the client could have another optional mode of operation where it uses bitcoind RPC directly, which is I guess what you are asking about. For that, an address-index in bitcoind is the main thing missing indeed, however not the only one. For one, the electrum protocol (the client<->"middleware server" connection) has address subscriptions - the client gets a notification when a history of one its addresses changes. We are also planning on soon adding another method into the protocol that allows txoutpoint->spender_txid lookups (and notifications); I guess that could be implemented using the index in this PR albeit in a very inefficient way for heavily reused addresses.

IMHO there are multiple upsides for having this middleware setup for Electrum:

for the project, it keeps the codebase simpler (again, we want to support users without own bitcoind)
for the project, it also allows for more flexibility for implementing new functionality: just consider present PR here, we have needed such an index for 9 years but bitcoind did not have it or want it, so we could just implement it ourselves
for the server operator, even if they don't want to open up the server for the public, they could share it with their friends and family. I don't think that's feasible with bitcoind RPC. I think this is a common use case.

Nevertheless if someone steps up and contributes patches, this kind of thing could be added.

The total size of the address index right now is 223GB

That sounds much larger than expected.
Even with the txoutpoint->spender_txid map I mentioned above, when using LevelDB, ElectrumX uses around 90 GiB of disk space.

The DB keys are structured as follows: <addr_id, key_type, outpoint>
The DB values are simply: <CDiskTxPos, CScript>

Instead of having both addr_id and CScript, why not just put a long hash, e.g. sha256(CScript) into the key?

Also, I expect most users of this index would also want txindex enabled. You might want to consider making address index dependent on txindex. Have you investigated how much space that would save? There would be no need to store CDiskTxPos.

Another trick that ElectrumX uses is that only the best chain is indexed. We have a tx_num->txid map as raw files on disk. A tx_num is the index of the transaction in the linear history of the blockchain, a 5-byte integer (so e.g. the genesis block coinbase tx has tx_num=0). This map uses around 17 GiB atm. With this, you can encode the txid part of the outpoint as 5 bytes instead of 32 bytes.

maflcko · 2020-12-17T19:09:02Z

Concept ACK (might have already done that)

sipa · 2020-12-17T19:25:09Z

I'm concept -0 on this.

My primary objection is that I think it's a bad idea for any infrastructure to be built all that relies on having fully-indexed blockchain data available (this also applies to txindex, but we can't just remove support...).

However, it seems many people want something like this, and are going to use it anyway. The question is then whether it belongs in the bitcoin-core codebase. Alternative, and more performant presumably, like electrs exist already too, so it isn't exactly impossible to do this elsewhere.

Still, given that we now have the indexes infrastructure, it means that things like this are easy to add in a fairly modular way without invading consensus code. So if people really want this, fine.

Overall approach comment: I don't think MurmurHash should be used for anything new; there are strictly better hash functions available. I'd suggest SipHash if that's fast enough.

jamesob · 2020-12-17T19:42:23Z

I'm also a little more negative on having this in Core than I previously was. After working in a few industrial contexts on wallet stuff, it's clear to me that an address index is really only required if you want to implement a block explorer or do chain analysis. For both of these applications, using something like electrs seems sufficient.

For personal wallet management, a full address index is not required. I think the origin of some confusion is that things like the Electrum Personal Server have become synonymous with this kind of usage, but in reality a full index is overkill when descriptor-specific rescans can be done once for a historical backfill and then per-block scanning can be done from there on out.

I want to point out that this is a nice implementation and good work by @marcinja, but I'm leaning slightly against the inclusion of such an index in Core at this point.

jonasschnelli · 2020-12-17T19:48:52Z

My primary objection is that I think it's a bad idea for any infrastructure to be built that relies on having fully-indexed blockchain data available.

I agree on this.

IMO the only use cases to ever use a full address index are:

1. Instant seed/xpriv backup recovery including spent history
1. Backend service for thousands of wallets
1. Debug/explore purposes

1 (instant backup recovery) could be solved with either scantxoutset (take a minute or two) or by an address-index for the utxo set only. But both would not restore the spent history.
A scalable non-enterprise solution to restore the spent history is using blockfilters. Scan through the filters and rescan only the relevant blocks (a matter of minutes), see #20664.

2 (a backend for thousands of wallets): out of scope for this project.

3 (explore purposes): I think this is a valid use case. Though adding this PR to Bitcoin Core will lead to many many projects using it in production increasing the traffic in this project and eventually steal time from existing contributors (rebase, maintenance, drag-along)

My main fear is that people are going to use this index (a full address index) to use it as an electrum(ish) backend for a handful of wallets.

With multiwallet, watch-only-wallets, PSBT, we have all tools to server multiple wallets in a scalable way for external applications.

I also think merging this as it is, would be in contradiction to the process- and repository-separation effort.

Therefore I'm ~0 (slightly towards NACK) to add this.
If this would be in another repository (still under bitcoin/*) and process separated, I would ACK it.

marcinja · 2021-01-04T16:21:10Z

Hi all, thanks for the feedback and review. This was an enjoyable PR to work on and I learned a lot from all your comments.

I'm closing this PR because its size probably requires stronger support from contributors to get in. It also seems more clear now that all of the practical use-cases are covered by existing features and some lightweight alternatives (#20664) .

I also agree that it would be bad to incentivize using an address index to support an external electrum wallet, when it's not the intended use-case and would cause unnecessary burden on contributors and maintainers in this project, e.g. from users of those wallets wanting new features or updates.

jonatack · 2021-01-04T16:53:22Z

@marcinja thank you and I hope to see more contributions of this quality from you.

marcinja mentioned this pull request Aug 24, 2018

Utxoscriptindex #14035

Closed

This was referenced Aug 24, 2018

util: Replace boost::signals2 with std::function #13961

Merged

Refactoring CRPCCommand with enum category #13945

Closed

refactor: Removal of circular dependency between index/txindex, validation and index/base #13942

Closed

laanwj added the UTXO Db and Indexes label Aug 25, 2018

DrahtBot added the Needs rebase label Aug 25, 2018

marcinja force-pushed the add-address-index branch 2 times, most recently from 2157893 to 196ff3d Compare August 27, 2018 15:26

DrahtBot removed the Needs rebase label Aug 27, 2018

marcinja force-pushed the add-address-index branch 2 times, most recently from 54a8e72 to 4865275 Compare August 28, 2018 20:57

ryanofsky reviewed Aug 29, 2018

View reviewed changes

luke-jr reviewed Aug 30, 2018

View reviewed changes

Comment thread src/init.cpp Outdated

luke-jr suggested changes Aug 30, 2018

View reviewed changes

DrahtBot mentioned this pull request Aug 30, 2018

index: Create IndexRunner class for activing indexes. #14111

Closed

marcinja commented Aug 30, 2018

View reviewed changes

Comment thread src/index/addrindex.cpp Outdated

jimpo suggested changes Aug 30, 2018

View reviewed changes

DrahtBot mentioned this pull request Aug 31, 2018

Index for BIP 157 block filters #14121

Merged