Skip to content

Orion V2 initial implementation draft #48

@Lezek123

Description

@Lezek123

Orion V2 initial implementation draft

The draft implementation I'm describing in this document can be found HERE.

This repository is based on the current squid-substrate-template and contains all the input schemas, refactored Atlas queries, drafts of the custom graphql resolvers etc., as well as some basic setup and reference code to ilustrate some of the ideas of how certain issues can be addressed.

The official Subsquid documentation, which is often referenced throught this document can be found here: https://docs.subsquid.io/

How to run the local setup

I'll be explaining each part of the setup more deeply later in this document, but here is just a quick-start reference:

  1. Clone the Joystream repository (if not already done)
git clone https://github.com/Joystream/joystream.git
  1. Run the joystream-node service in the Joystream repository
# You can also specify the usual environment variables like `RUNTIME_PROFILE` etc.
export JOYSTREAM_NODE_TAG=$(./scripts/runtime-code-shasum.sh)
docker-compose up -d joystream-node
  1. Clone the Subsquid-Orion repository
cd ..
git clone https://github.com/Lezek123/subsquid-orion.git
  1. Run the archive (indexer)
cd subsquid-orion/archive
docker-compose up -d
  1. Build the processor
cd ..
npm install
make codegen
make build
  1. Run and migrate the processor datbase
make up
make migrate
  1. Run the processor
make process
  1. Run the GraphQL server
make serve

After performing those steps you should be able to go to http://localhost:4350/graphql and see something like this:
orion-graphql

Currently the processor will produce some mock data on each block so you can also test some of the existing queries:
orion-query

On-chain data indexing and processing

Squid Archive

Squid Arachive is analogous concept to the Hydra Indexer, it uses the Joystream node websocket rpc endpoint to fetch data about on-chain events and extrinsics and store it in a relational database (PostgreSQL).

We can configure the archive via a docker-compose file located in archive/docker-compose.yml.

The current squid archive configuration is using a local Joystream docker node (ws://joystream-node:9944) that runs on joystream_default network as a source.

SubstrateBatchProcessor

SubstrateBatchProcessor is a class we use to instantialize the events processor. As opposed to Hydra, where we would only implement the "mapping" functions (or "mappings"), Subsquid let's us instantialize and programatically configure the processor ourselves (manifest.yml file no longer required), which gives us more controll over its behavior.

SubstrateBatchProcessor is just one of the many processor implementations available in Subsquid, but it's the one currently recommended for processing substrate events and extrinsics. This specific processor implementation queries all blocks along with the events of interest from the Squid Arachive (using the @subsquid/substrate-gateway service). The maximum number of blocks in a single batch currently depends on the @subsquid/substrate-gateway implementation, it's still a little unclear how this will work in the future, but currently there are two main components that affect the batch size:

Current processor implementation:

In the current draft implementation:

  • the processor is given new TypeormDatabase({ isolationLevel: 'READ COMMITTED' }) instance (which is then used to insert data into the database). As you can see we can easily provide config for TypeormDatabase here, which we take advantage of by specifing isolationLevel: 'READ COMMITTED'. This isolation level reduces the possibility of conflicts, since the database state will be modified by both the processor and through the external api (ie. counting video views, featuring, reporting video/channel etc., as explained further down below);
  • I added some code which populates the database with a bunch of "mock" entities on each System.ExstrinsicSuccess event;
  • I specified Content.VideoCreated and Content.ChannelCreated among events of interest for ilustration purposes.

This impelementation provides a decent general overview of how the "mappings" are written in Subsquid and how one can extract the events&data of interest from a batch and then perform bulk inserts/updates at the end of processing a batch, which considerably increases the performance.

The API

Input schema

The current input schema files can be found here: https://github.com/Lezek123/subsquid-orion/tree/main/schema
I tried to preserve a similar schema to the one we currently use in Hydra, however there are some notable differences:

  • I only used entities that are actually of interest for Atlas (so no proposals, working group, forum etc.);
  • In some entities I reduced the set of fields to only those that are currently of use for Atlas. For example, since Atlas currently doesn't support channel collaborators, I removed collaborators field from the Channel entity. This is mainly for simplification purposes and to reduce the initial scope of work, they can of course be added later if needed;
  • Interfaces are no longer supported in Hydra, however, since unions can be used as an alternative, I refactored the events to use EventData union instead. The result can be seen here. I will explain the differences that come from this change further down below;
  • Since deeply nested filters are now supported, as well as nested field queries, I removed some redundant entity relationships etc., which I assumed were mostly serving as a workaround for the lack of those features before;
  • I replaced fields like nftOwnerMember, isNftOwnerChannel, nftOwnerCuratorGroup in the NFT entity with a NftOwner union for better clarity
  • Unified state: The input schema now also includes entities that were previously only existing in Orion, like ChannelReport etc. Some of those entities will probably be moved away from the input schema, unless we want to take advantage of the autogenerated queries that Subsquid will provide for them. If not - we can just use custom models instead (as described further below).
  • Entities like Video and Channel now include new followsNum and videoViewsNum counters, as Atlas relies on being able to execute queries that include sorting based on those values, as well as recieving those values as part of the result set. Having that in mind, I decided that the cost of introducing those additional fields is relatively low compared to the efficiency&simplicity benefits, however some custom aggregation queries (like sorting based on number of follows within given time period) will also be needed, as will be explained further below.
  • activeVideoCounter fields have been removed, as this data is now accessible through extended queries. See Custom type-graphql Resolvers section for more details;
  • createdAt and updatedAt fields are no longer automatically added, so in some cases I included them in the input schema (for Event entity however I decided to name the field timestamp instead);
  • Many-to-Many relationships are not supported in Subsquid so they were refactored to 2-side Many-to-One relationships with a specific "join entity".

Custom models

Subsquid comes with a nice directory structure alowing us to define our own TypeORM models separately from the autogenerated ones, however they will all become part of the same database.

Use cases:

The primary use-case for definig those custom models is when we don't want Subquid to autogenerate the public api endpoints for quering certain (private) data, but we still want to keep this data as part of the same database to take advantage of the relational model. Take User entity for example. We want to be able to connect users with channels through User>-ChannelFollow-<Channel relationship, but we don't necessarily want to expose any User data through the api, that's why we define custom models for User and ChannelFollow, but we don't include those entities in the input schema.

Custom GraphQL api extensions

Subsquid allows use to add some custom extensions to the autogenerated GraphQL api. Those are stored in the src/server-extension/ library and constitute a significant part of the project.

Custom type-graphql Resolvers

Custom type-graphql resolvers are classes where we can define our custom GraphQL queries, mutations and subsriptions that will then be included in the final API.

Normally we run a Subsquid graphql server using the @subsquid/graphql-server library/service, which generates and runs a GraphQL server based on the input schema. For the purpose of generating the final ("output") schema and resolves it uses another library called @subsquid/openreader. The schema generated by @subsquid/openreader is then merged with the schema generated from our custom resolvers that we are providing in src/server-extension/resolvers. For this merge, the mergeSchemas method from graphql-tools library is used.

The interesting property of mergeSchemas is that this method also merges all individual GraphQL types defined in both schemas, which makes us able to reuse the autogenerated types like Video, VideoWhereInput, VideoOrderByInput etc. All we have to do is define a graphql object with the same name in our resolvers space and at least one property which matches with the autogenerated object (for entities it can be, for example, id: string). Then when the types are merged, we will get a consistent Video object with all the expected properties in the final schema.

This can be probably better understood by looking at the implementation inside https://github.com/Lezek123/subsquid-orion/tree/main/src/server-extension/resolvers, especially https://github.com/Lezek123/subsquid-orion/blob/main/src/server-extension/resolvers/baseTypes.ts where the "placeholders" for the to-be-autogenerated types are defined.

There are also many other useful references in this directory:

Use-cases summary for custom resolvers:

  • Introducing custom queries or extending the autogenerated ones:
    • extendedChannels query (allows querying channel along with activeVideosCount aggregation)
    • extendedVideoCategories query (allows querying video categories along with activeVideosCount aggregation)
    • mostRecentChannels query (a query that allows filtering and ordering results among X most recent channels)
    • channelNftCollectors query (allows querying the list of members who collected the highest number of nfts issued by a given channel)
    • searchChannels query (allows implementing custom channel search logic)
    • searchVideos query (allows implementing custom video search logic)
    • mostViewedVideosConnection query (allows querying videos with the highest number of views in a given time period)
    • getKillSwitch query (allows retrieving the current Atlas "killSwitch" status)
    • videoHero query (allows retrieving information about content currently featured in the Atals Hero section)
  • Introduing mutations:
    • followChannel
    • unfollowChannel
    • reportChannel
    • reportVideo
    • addVideoView
    • setKillSwitch (operator only)
    • setVideoHero (operator only)
  • Introducing subscriptions:
    • processorState (allows Atlas to stay updated about the current processing state, similar to the Hydra's stateSubscription)

checkRequest plugin

The checkRequest plugin is a Subsquid feature that allows us act on the Apollo server's requestDidStart event. The handler function can be implemented inside src/server-extension/checkRequest.ts and recieves information like request headers, ip of the origin, all the data specific to the graphql request etc.

The current example implementation shows how this plugin can be used to introduce some authentication for all mutation requests.

Use cases:

  • Authentication & access restriction: The main use-case I see here is restricting access to operator-only mutations like setKillSwitch, setVideoHero etc., however, as described in the Known issues section, I believe this is a suboptimal way of doing this.

Atlas queries: Refactored!

I refactored all existring Atlas queries (https://github.com/Joystream/atlas/tree/master/packages/atlas/src/api/queries) to match the new schema.

The results can be seen here. The directory structure matches the one in the Atlas repository, which makes it easy to do side-by-side comparison. I also added CHANGE: comment in all places where the changes where introduced.

The most notbale changes can be observed in the notifications / events queries, due to the refactorization of Event entities. It is now easier to query all the events of interest together and apply filtering, sorting and limit on the results of one query instead of making separate queries for each event type and then post-processing the results client-side.

Some other notbale changes include:

  • Some unused (by Atlas) queries were removed;
  • Wherever there was a reference to entityId, it had to be replaced with entity.id, as Subsquid doesn't support the former syntax anymore;
  • Wherever there was a reference to ID graphql type id had to be replaced with String, as Subsquid doesn't support the former anymore;
  • Wherever there was entityByUniqueInput query used it had to be replaced w/ entityById, as entityByUniqueInput is not supported in Subsquid. For members, which used to be queryable by handle, we can now either add a custom query or use the existing members query (providing handle in the where clause);
  • Some redundant relations (like event.data.bidder, if it can be also accessed through, for example event.data.bid.bidder) were removed, so filtering now goes more deep in some cases;
  • For NFTs, fields like nftOwnerMember, isNftOwnerChannel, nftOwnerCuratorGroup were replaced with NftOwner union;
  • Some queries were renamed (like admin => getKillSwitch);
  • Channel/videoCategory queries that included activeVideosCounter were changed to extendedChannels/extendedVideoCategories queries;
  • Videos featured in a category are now simply queried via category.featuredVideos relation;
  • Some very specific, REST-api-like Orion queries like top10Videos were replaced with a little bit more generic/customizable queries like mostVievedVideosConnection;
  • New search queries (separate one for channels and videos);
  • Changes related to Many-to-Many relationships no longer being supported.

Custom migrations: Setting up the database

Subsquid allows us to generate database migration files that we can then use to setup the processor database.
Besides that, we can also specify some custom migrations that will be ran before or after the generated ones.
In the draft implementation I introduced 2 custom migrations: Views and Indexes (since the filenames and class names need to include timestamp I've just choosen some arbitrarly high values to make sure those migrations are always ran after the autogenerated one)

Use cases for custom migrations:

  • We can specify indexes on jsonb fields or expressions, which is not possible through the input schema. This is useful when dealing with unions where some of the variants include a reference to another entity, like the new EventData union.
  • We can introduce views, which has a few benefits:
    • We can simplify complex queries
    • We can replace tables with views, for example, a channel can be replaced with channel view, which allows us to filter out certain channels from the results of any autogenerated query. In the draft implementaton I use channel view to exclude moderated channels, this way Atals doesn't need to worry about including this filter in each query, the censored channels are also hidden from anyone trying to query the server directly. If the channel gets "unmoderated" however, it will instantly re-appear in the results (which wouldn't be possible if we just deleted moderated channels permanently).

Performance

Using the mocked data I did some performance tests against the current implementation, here are some results:

  1. GetExtendedBasicChannels query

Arguments:

  where: { activeVideosCount_gt: 2 },
  orderBy: createdAt_DESC,
  limit: 50

Number of channel entries: 12,921
Number of video entries: 257,400
Time to execute the query: 86ms

  1. GetNotifications query

Arguments:

  channelId: "1",
  memberId: "1",
  limit: 50

Number of event entries: 2,574,000
Time to execute the query: 880ms

  1. GetNftHistory query

Arguments:

  nftId: "1"

Number of event entries: 2,974,000
Time to execute the query: ~9 seconds (!)

Potential candidate for optimalization

Benchmarks to be continued...

Known issues and unresolved questions

  • The Context provided by @subsquid/graphql-server is very limited, for example, we cannot access client's IP address or any request headers inside the graphql resolver. This is problematic, as it makes authentication more complex (we have to use the separate "checkRequest" plugin) and also makes it difficult to handle mutations like addVideoView, where such data would be necessary to prevent abuse.
  • There are two ways I can think of in terms of introducing global category filtering, but both have their pros and cons:
    1. We can avoid storing unrelated videos/events completely, which seems like a natural approach to avoid unnecessary bloat of the database with unrelated content, however this has two obvious drawbacks:
    • if the video category changes there is no easy way to get all the requeired information in order to (re)store this video,
    • if the operator changes the supported category set we run into the same problem - we have no access to data from categories that were not supported before (unless we re-process the chain from scratch)
    1. We store all videos/events, which kind of defeats the idea of vertical scaling, however we can then use views to filter-out unrelated content as described in Custom migrations: Setting up the database section.

Alternatives to consider

"Manually" setting up graphql server instead of using @subsquid/graphql-server

To have more control of the setup we can run the graphql server from within our own codebase instead of using the @subsquid/graphql-server, we can still take advantage of @subsquid/openreader however to generate the initial schema and resolvers.

Pros of this approach:

  • We solve the Context issue, so we can more easily implement our own authentication, authorization, rate limits and other restrictions.
  • Generally more freedom when writing our own resolvers, we can also decide how we want to merge them with the autogenerated schema.
  • We avoid potential issues if the @subsquid/graphql-server changes in the future in a way that no longer supports the current assumptions.

Cons:

  • Requires more work
  • We will be probably repeating a lot of work that Subsquid team already did/does, while we could try to push for more customizability instead (which I think Subsquid as a project would benefit from too)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions