Orion V2 initial implementation draft

# Orion V2 initial implementation draft

**The draft implementation I'm describing in this document can be found [HERE](https://github.com/Lezek123/subsquid-orion/tree/main)**.

This repository is based on the current [squid-substrate-template](https://github.com/subsquid/squid-substrate-template) and contains all the input schemas, refactored Atlas queries, drafts of the custom graphql resolvers etc., as well as some basic setup and reference code to ilustrate some of the ideas of how certain issues can be addressed.

The official Subsquid documentation, which is often referenced throught this document can be found here: https://docs.subsquid.io/

## How to run the local setup

I'll be explaining each part of the setup more deeply later in this document, but here is just a quick-start reference:

1. Clone the Joystream repository (if not already done)

```bash
git clone https://github.com/Joystream/joystream.git
```

2. Run the `joystream-node` service in the Joystream repository

```bash
# You can also specify the usual environment variables like `RUNTIME_PROFILE` etc.
export JOYSTREAM_NODE_TAG=$(./scripts/runtime-code-shasum.sh)
docker-compose up -d joystream-node
```

3. Clone the Subsquid-Orion repository

```bash
cd ..
git clone https://github.com/Lezek123/subsquid-orion.git
```

4. Run the archive (indexer)

```bash
cd subsquid-orion/archive
docker-compose up -d
```

5. Build the processor

```bash
cd ..
npm install
make codegen
make build
```

5. Run and migrate the processor datbase
```bash
make up
make migrate
```

6. Run the processor
```bash
make process
```

7. Run the GraphQL server
```bash
make serve
```

After performing those steps you should be able to go to `http://localhost:4350/graphql` and see something like this:
![orion-graphql](https://user-images.githubusercontent.com/19978260/203102071-574097f7-6470-44c4-9ba3-53c904074e09.png)


Currently the processor will produce some mock data on each block so you can also test some of the existing queries:
![orion-query](https://user-images.githubusercontent.com/19978260/203102088-a23cb357-2e26-479c-928c-fb5c2b0125ae.png)

## On-chain data indexing and processing

### Squid Archive

Squid Arachive is analogous concept to the Hydra Indexer, it uses the Joystream node websocket rpc endpoint to fetch data about on-chain events and extrinsics and store it in a relational database (PostgreSQL).

We can configure the archive via a docker-compose file located in [`archive/docker-compose.yml`](https://github.com/Lezek123/subsquid-orion/blob/main/archive/docker-compose.yml).

The current squid archive configuration is using a local Joystream docker node (`ws://joystream-node:9944`) that runs on `joystream_default` network as a source.

### `SubstrateBatchProcessor`

[`SubstrateBatchProcessor`](https://docs.subsquid.io/develop-a-squid/substrate-processor/) is a class we use to instantialize the events processor. As opposed to Hydra, where we would only implement the "mapping" functions (or "mappings"), Subsquid let's us instantialize and programatically configure the processor ourselves (`manifest.yml` file no longer required), which gives us more controll over its behavior.

`SubstrateBatchProcessor` is just one of the many processor implementations available in Subsquid, but it's the one currently recommended for processing substrate events and extrinsics. This specific processor implementation queries all blocks along with the events of interest from the Squid Arachive (using the `@subsquid/substrate-gateway` service). The maximum number of blocks in a single batch currently depends on the `@subsquid/substrate-gateway` implementation, it's still a little unclear how this will work in the future, but currently there are two main components that affect the batch size:
- [the time it takes to read & prepare a batch (by the gateway) is limited to 5 seconds](https://github.com/subsquid/substrate-gateway/blob/main/substrate-archive/src/postgres/partial.rs#L32)
- [the size is limited to "1 MB" (however note that this currently depends on some assumptions about "average" event/call/extrinsic size, which are not very reliable, so the final size may be much greater)](https://github.com/subsquid/substrate-gateway/blob/main/substrate-archive/src/postgres/partial.rs#L57)

#### Current processor implementation:

In [the current draft implementation](https://github.com/Lezek123/subsquid-orion/blob/main/src/processor.ts#L51):
- the processor is given `new TypeormDatabase({ isolationLevel: 'READ COMMITTED' })` instance (which is then used to insert data into the database). As you can see we can easily provide config for `TypeormDatabase` here, which we take advantage of by specifing `isolationLevel: 'READ COMMITTED'`. This isolation level reduces the possibility of conflicts, since the database state will be modified by both the processor and through the external api (ie. counting video views, featuring, reporting video/channel etc., as explained further down below);
- I added some code which populates the database with a bunch of "mock" entities on each `System.ExstrinsicSuccess` event;
- I specified `Content.VideoCreated` and `Content.ChannelCreated` among events of interest for ilustration purposes.

This impelementation provides a decent general overview of how the "mappings" are written in Subsquid and how one can extract the events&data of interest from a batch and then perform bulk inserts/updates at the end of processing a batch, which considerably increases the performance.

## The API

### Input schema

The current input schema files can be found here: https://github.com/Lezek123/subsquid-orion/tree/main/schema
I tried to preserve a similar schema to the one we currently use in Hydra, however there are some notable differences:
- I only used entities that are actually of interest for Atlas (so no proposals, working group, forum etc.);
- In some entities I reduced the set of fields to only those that are currently of use for Atlas. For example, since Atlas currently doesn't support channel collaborators, I removed `collaborators` field from the `Channel` entity. This is mainly for simplification purposes and to reduce the initial scope of work, they can of course be added later if needed;
- **Interfaces are no longer supported in Hydra**, however, since unions can be used as an alternative, I refactored the events to use `EventData` union instead. The result can be seen [here](https://github.com/Lezek123/subsquid-orion/blob/main/schema/events.graphql). I will explain the differences that come from this change further down below;
- Since deeply nested filters are now supported, as well as [`nested field queries`](https://docs.subsquid.io/query-squid/nested-field-queries/), I removed some redundant entity relationships etc., which I assumed were mostly serving as a workaround for the lack of those features before; 
- I replaced fields like `nftOwnerMember`, `isNftOwnerChannel`, `nftOwnerCuratorGroup` in the NFT entity with a `NftOwner` union for better clarity
- Unified state: The input schema now also includes entities that were previously only existing in Orion, like `ChannelReport` etc. Some of those entities will probably be moved away from the input schema, unless we want to take advantage of the autogenerated queries that Subsquid will provide for them. If not - we can just use custom models instead (as described further below). 
- Entities like `Video` and `Channel` now include new `followsNum` and `videoViewsNum` counters, as Atlas relies on being able to execute queries that include sorting based on those values, as well as recieving those values as part of the result set. Having that in mind, I decided that the cost of introducing those additional fields is relatively low compared to the efficiency&simplicity benefits, however some custom aggregation queries (like sorting based on number of follows within given time period) will also be needed, as will be explained further below.
- `activeVideoCounter` fields have been removed, as this data is now accessible through extended queries. See _Custom `type-graphql` Resolvers_ section for more details;
- `createdAt` and `updatedAt` fields are no longer automatically added, so in some cases I included them in the input schema (for `Event` entity however I decided to name the field `timestamp` instead);
- `Many-to-Many` relationships are not supported in Subsquid so they were refactored to 2-side `Many-to-One` relationships with a specific "join entity".


### Custom models

Subsquid comes with a nice directory structure alowing us to define our own TypeORM models separately from the autogenerated ones, however they will all become part of the same database.

#### Use cases:

The primary use-case for definig those custom models is when we don't want Subquid to autogenerate the public api endpoints for quering certain (private) data, but we still want to keep this data as part of the same database to take advantage of the relational model. Take [`User`](https://github.com/Lezek123/subsquid-orion/blob/main/src/model/User.ts) entity for example. We want to be able to connect users with channels through `User`>-`ChannelFollow`-<`Channel` relationship, but we don't necessarily want to expose any `User` data through the api, that's why we define custom models for `User` and `ChannelFollow`, but we don't include those entities in the input schema.


### Custom GraphQL api extensions

Subsquid allows use to add some custom extensions to the autogenerated GraphQL api. Those are stored in the `src/server-extension/` library and constitute a significant part of the project.

#### Custom `type-graphql` Resolvers

Custom `type-graphql` resolvers are classes where we can define our custom GraphQL queries, mutations and subsriptions that will then be included in the final API.

Normally we run a Subsquid graphql server using the `@subsquid/graphql-server` library/service, which generates and runs a GraphQL server based on the input schema. For the purpose of generating the final ("output") schema and resolves it uses another library called `@subsquid/openreader`. The schema generated by `@subsquid/openreader` is then merged with the schema generated from our custom resolvers that we are providing in `src/server-extension/resolvers`. For this merge, the [`mergeSchemas`](https://the-guild.dev/graphql/tools/docs/schema-merging) method from `graphql-tools` library is used.

The interesting property of `mergeSchemas` is that this method also merges all individual GraphQL types defined in both schemas, which makes us able to reuse the autogenerated types like `Video`, `VideoWhereInput`, `VideoOrderByInput` etc. All we have to do is define a graphql object with the same name in our resolvers space and at least one property which matches with the autogenerated object (for entities it can be, for example, `id: string`). Then when the types are merged, we will get a consistent `Video` object with all the expected properties in the final schema.

This can be probably better understood by looking at the implementation inside https://github.com/Lezek123/subsquid-orion/tree/main/src/server-extension/resolvers, especially https://github.com/Lezek123/subsquid-orion/blob/main/src/server-extension/resolvers/baseTypes.ts where the "placeholders" for the to-be-autogenerated types are defined.

There are also many other useful references in this directory:
- In https://github.com/Lezek123/subsquid-orion/blob/main/src/server-extension/resolvers/ChannelsResolver/index.ts#L31 I implemented `extendedChannels` query to ilustrate how we can take advantage of the subsquid libraries to define queries that build "on top of" the autogenerated queries like `channels`. In this example implementation we make use of the autogenerated `ChannelWhereInput`, `ChannelOrderByInput` types and translate the client's request to a corresponding SQL query, same way it would've been done by the `channels` query. However, on top of this, we're able to include some custom, additional fields (like, in this case, the `count()` of videos that meet certain criteria in each channel) and allow the client to filter and sort the results by those fields (they can also be included in the results set if needed). In this specific implementation I showed how we can get rid of `Channel.activeVideosCount` field, which would otherwise need to be constantly updated on many different occasions, by replacing it with a subquery that can be ran for each queried channel upon client's request instead;
- In https://github.com/Lezek123/subsquid-orion/blob/main/src/server-extension/resolvers/StateResolver/index.ts#L30 I've shown how we can implement a custom subscription to keep Atlas informed about the current processing state;
- In https://github.com/Lezek123/subsquid-orion/blob/main/src/server-extension/resolvers/ChannelsResolver/index.ts#L164 and a few other places you can see how we can easily add some custom mutations.

Use-cases summary for custom resolvers:
- Introducing custom queries or extending the autogenerated ones:
  - `extendedChannels` query (allows querying channel along with `activeVideosCount` aggregation)
  - `extendedVideoCategories` query (allows querying video categories along with `activeVideosCount` aggregation)
  - `mostRecentChannels` query (a query that allows filtering and ordering results among X most recent channels)
  - `channelNftCollectors` query (allows querying the list of members who collected the highest number of nfts issued by a given channel)
  - `searchChannels` query (allows implementing custom channel search logic)
  - `searchVideos` query (allows implementing custom video search logic)
  - `mostViewedVideosConnection` query (allows querying videos with the highest number of views in a given time period)
  - `getKillSwitch` query (allows retrieving the current Atlas "killSwitch" status)
  - `videoHero` query (allows retrieving information about content currently featured in the Atals Hero section)
- Introduing mutations:
  - `followChannel`
  - `unfollowChannel`
  - `reportChannel`
  - `reportVideo`
  - `addVideoView`
  - `setKillSwitch` (operator only)
  - `setVideoHero` (operator only)
- Introducing subscriptions:
  - `processorState` (allows Atlas to stay updated about the current processing state, similar to the Hydra's `stateSubscription`)

#### `checkRequest` plugin

The `checkRequest` plugin is a Subsquid feature that allows us act on the Apollo server's [requestDidStart](https://www.apollographql.com/docs/apollo-server/integrations/plugins/#responding-to-request-lifecycle-events) event. The handler function can be implemented inside `src/server-extension/checkRequest.ts` and recieves information like request headers, ip of the origin, all the data specific to the graphql request etc.

The current [example implementation](https://github.com/Lezek123/subsquid-orion/blob/main/src/server-extension/check.ts) shows how this plugin can be used to introduce some authentication for all `mutation` requests.

Use cases:
- Authentication & access restriction: The main use-case I see here is restricting access to operator-only mutations like `setKillSwitch`, `setVideoHero` etc., however, as described in the _Known issues_ section, I believe this is a suboptimal way of doing this. 

### Atlas queries: Refactored!

I refactored all existring Atlas queries (https://github.com/Joystream/atlas/tree/master/packages/atlas/src/api/queries) to match the new schema.

The results can be seen [here](https://github.com/Lezek123/subsquid-orion/tree/main/atlasQueries). The directory structure matches the one in the Atlas repository, which makes it easy to do side-by-side comparison. I also added `CHANGE:` comment in all places where the changes where introduced.

The most notbale changes can be observed in the [notifications / events queries](https://github.com/Lezek123/subsquid-orion/blob/main/atlasQueries/notifications.graphql), due to the refactorization of Event entities. It is now easier to query all the events of interest together and apply filtering, sorting and limit on the results of one query instead of making separate queries for each event type and then post-processing the results client-side.

Some other notbale changes include:
- Some unused (by Atlas) queries were removed;
- Wherever there was a reference to `entityId`, it had to be replaced with `entity.id`, as Subsquid doesn't support the former syntax anymore;
- Wherever there was a reference to `ID` graphql type id had to be replaced with `String`, as Subsquid doesn't support the former anymore;
- Wherever there was `entityByUniqueInput` query used it had to be replaced w/ `entityById`, as `entityByUniqueInput` is not supported in Subsquid. For members, which used to be queryable by handle, we can now either add a custom query or use the existing `members` query (providing handle in the `where` clause);
- Some redundant relations (like `event.data.bidder`, if it can be also accessed through, for example `event.data.bid.bidder`) were removed, so filtering now goes more deep in some cases;
- For NFTs, fields like `nftOwnerMember`, `isNftOwnerChannel`, `nftOwnerCuratorGroup` were replaced with `NftOwner` union;
- Some queries were renamed (like `admin` => `getKillSwitch`);
- Channel/videoCategory queries that included `activeVideosCounter` were changed to `extendedChannels`/`extendedVideoCategories` queries;
- Videos featured in a category are now simply queried via `category.featuredVideos` relation;
- Some very specific, REST-api-like Orion queries like `top10Videos` were replaced with a little bit more generic/customizable queries like `mostVievedVideosConnection`;
- New search queries (separate one for channels and videos);
- Changes related to `Many-to-Many` relationships no longer being supported. 

### Custom migrations: Setting up the database

Subsquid allows us to generate database migration files that we can then use to setup the processor database.
Besides that, we can also specify some custom migrations that will be ran before or after the generated ones.
In the draft implementation I introduced [2 custom migrations](https://github.com/Lezek123/subsquid-orion/tree/main/db/migrations): `Views` and `Indexes` (since the filenames and class names need to include timestamp I've just choosen some arbitrarly high values to make sure those migrations are always ran **after** the autogenerated one)

Use cases for custom migrations:
- We can specify indexes on `jsonb` fields or expressions, which is not possible through the input schema. This is useful when dealing with unions where some of the variants include a reference to another entity, like the new `EventData` union.
- We can introduce [views](https://www.postgresql.org/docs/current/sql-createview.html), which has a few benefits:
  - We can simplify complex queries
  - We can replace tables with views, for example, a `channel` can be replaced with `channel` view, which allows us to filter out certain channels from the results of any autogenerated query. In the draft implementaton I use `channel` view to exclude moderated channels, this way Atals doesn't need to worry about including this filter in each query, the censored channels are also hidden from anyone trying to query the server directly. If the `channel` gets "unmoderated" however, it will instantly re-appear in the results (which wouldn't be possible if we just deleted moderated channels permanently).


### Performance

Using the mocked data I did some performance tests against the current implementation, here are some results:

1. `GetExtendedBasicChannels` query

Arguments:
```graphql
  where: { activeVideosCount_gt: 2 },
  orderBy: createdAt_DESC,
  limit: 50
```
Number of channel entries: 12,921
Number of video entries: 257,400
Time to execute the query: 86ms

2. `GetNotifications` query

Arguments:
```graphql
  channelId: "1",
  memberId: "1",
  limit: 50
```
Number of event entries: 2,574,000
Time to execute the query: 880ms

3. `GetNftHistory` query

Arguments:
```graphql
  nftId: "1"
```
Number of event entries: 2,974,000
Time to execute the query: ~9 seconds (!)

**_Potential candidate for optimalization_**

Benchmarks to be continued...

## Known issues and unresolved questions

- The `Context` provided by `@subsquid/graphql-server` is very limited, for example, we cannot access client's IP address or any request headers inside the graphql resolver. This is problematic, as it makes authentication more complex (we have to use the separate "checkRequest" plugin) and also makes it difficult to handle mutations like `addVideoView`, where such data would be necessary to prevent abuse.
- There are two ways I can think of in terms of introducing global category filtering, but both have their pros and cons:
  1. We can avoid storing unrelated videos/events completely, which seems like a natural approach to avoid unnecessary bloat of the database with unrelated content, however this has two obvious drawbacks:
    - if the video category changes there is no easy way to get all the requeired information in order to (re)store this video,
    - if the operator changes the supported category set we run into the same problem - we have no access to data from categories that were not supported before (unless we re-process the chain from scratch)
  2. We store all videos/events, which kind of defeats the idea of vertical scaling, however we can then use views to filter-out unrelated content as described in _Custom migrations: Setting up the database_ section.

## Alternatives to consider

### "Manually" setting up graphql server instead of using `@subsquid/graphql-server`

To have more control of the setup we can run the graphql server from within our own codebase instead of using the `@subsquid/graphql-server`, we can still take advantage of `@subsquid/openreader` however to generate the initial schema and resolvers.

Pros of this approach:
- We solve the `Context` issue, so we can more easily implement our own authentication, authorization, rate limits and other restrictions.
- Generally more freedom when writing our own resolvers, we can also decide how we want to merge them with the autogenerated schema.
- We avoid potential issues if the `@subsquid/graphql-server` changes in the future in a way that no longer supports the current assumptions.

Cons:
- Requires more work
- We will be probably repeating a lot of work that Subsquid team already did/does, while we could try to push for more customizability instead (which I think Subsquid as a project would benefit from too)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orion V2 initial implementation draft #48

Orion V2 initial implementation draft

How to run the local setup

On-chain data indexing and processing

Squid Archive

`SubstrateBatchProcessor`

Current processor implementation:

The API

Input schema

Custom models

Use cases:

Custom GraphQL api extensions

Custom `type-graphql` Resolvers

`checkRequest` plugin

Atlas queries: Refactored!

Custom migrations: Setting up the database

Performance

Known issues and unresolved questions

Alternatives to consider

"Manually" setting up graphql server instead of using `@subsquid/graphql-server`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Orion V2 initial implementation draft #48

Description

Orion V2 initial implementation draft

How to run the local setup

On-chain data indexing and processing

Squid Archive

SubstrateBatchProcessor

Current processor implementation:

The API

Input schema

Custom models

Use cases:

Custom GraphQL api extensions

Custom type-graphql Resolvers

checkRequest plugin

Atlas queries: Refactored!

Custom migrations: Setting up the database

Performance

Known issues and unresolved questions

Alternatives to consider

"Manually" setting up graphql server instead of using @subsquid/graphql-server

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`SubstrateBatchProcessor`

Custom `type-graphql` Resolvers

`checkRequest` plugin

"Manually" setting up graphql server instead of using `@subsquid/graphql-server`