Skip to content

fix: video language detection fix#309

Merged
zeeshanakram3 merged 3 commits intoJoystream:masterfrom
zeeshanakram3:fix-detect-video-language
Mar 8, 2024
Merged

fix: video language detection fix#309
zeeshanakram3 merged 3 commits intoJoystream:masterfrom
zeeshanakram3:fix-detect-video-language

Conversation

@zeeshanakram3
Copy link
Copy Markdown
Contributor

To accurately detect the language of video based on it's title and description, this fix makes the following changes

  • Removes all the URLs from the input string
  • Removes all the hashtags from the input string
  • Removes const cleanedString = input.replace(/[\p{P}\p{S}\p{N}\p{M}]/gu, '') regular expression as it unnecessarily removes a lot of characters from the input string and changes it's composition
  • Only use title for language detection, and if the detected language accuracy is not acceptable then use title+description as input for language detection

@zeeshanakram3 zeeshanakram3 requested a review from WRadoslaw March 3, 2024 18:58

// console.log(`Cleaned text: ${cleanedText}`)
// Get the most accurate language prediction
return detectAll(cleanedText).length ? detectAll(cleanedText)[0] : undefined
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how costly it is to run detection over string, but I would be behind calling it only once. We don't even what to assign it.

Suggested change
return detectAll(cleanedText).length ? detectAll(cleanedText)[0] : undefined
return detectAll(cleanedText)?.[0]

This should have the same outcome with a single call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right that's a mistake on my end

let detectedLang: string | undefined

const titleLang = predictLanguage(title ?? '')
if (titleLang && titleLang?.accuracy < 0.5) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you make some benchmarking at what threshold prediction is mostly wrong?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, It's just a numerical guess, i.e. if the prediction confidence is less than 50%, then do the above logic.

@zeeshanakram3 zeeshanakram3 merged commit 8e5b98b into Joystream:master Mar 8, 2024
zeeshanakram3 added a commit that referenced this pull request Mar 12, 2024
…ication events (#314)

* Add is short field to video entity (#301)

* add isShort field to video entity

* regenerate db migrations

* remove @joystream/metadata-protobuf patch from assets/patches

* fix lint issue

* Disable both in Appp and eail notifications for video posted events (#299)

* bump package version and update CHANGELOG (#302)

* bump package version and update CHANGELOG

* change release version

* Simple public homefeed query and mutation (#304)

* update graphql schema

* add partial index on 'video.include_in_home_feed' field

* update video view definition to only include public videos

* regenerate migrations

* add dumbPublicFeedVideos custom query

* add setPublicFeedVideos mutation

* fix lint issue

* add arg to skip video IDs

* revert: update video view definition to only include public videos

* add feat. to unset public feed videos

* address requested change

* bump package version and update CHANGELOG

* Update `nara` from `master` (#300)

* Adds mappings for `ChannelAssetsDeletedByModerator` & `VideoAssetsDeletedByModerator` events (#199)

* mark 'VideoDeletedByModerator' & 'ChannelDeletedByModerator' events deprecated

* Implements mappings for 'Content.VideoAssetsDeletedByModerator and 'Content.ChannelAssetsDeletedByModerator' runtime events

* remove unused import

* Nara/crt update (#244)

* feat: build orion

* feat: start generating schema

* fix: extra entities

* fixup!

* fix: continue implementing design specs

* fix: review and fix foreign key relationships

* fix: formatting

* fix: generation errors

* fix: add comment

* fix: relations

* fix: final review

* fixup!

* fix: add ending blocks

* fix: generate type & set typegen to ipv4

* fix: add support for event backward compatibility

* feat: start adding mappings

* fix: continue with mappnigs

* feat: init sale

* feat: patronage decreased to & fixed build

* feat: claim patronage event

* feat: tokens bought on amm

* feat: tokens sold on amm

* fix: add relation between sales and vesting schedules

* feat: add Tokens sold on sale vente

* feat: update upcoming sale

* feat: revenue share issued

* feat: member joined whitelist

* feat: amm deactivated

* feat: burned token

* feat: transfer policy changed to permissionless

* feat: sale finalized

* feat: finish mappings

* fix: review

* fix: remove cascade deletions

* fix: renaming & formatting

* fixup!

* fixup!

* fix: patched protobuf packages with token proto

* feat: update metadata and add event handler scheleton

* feat: token metadata

* feat: sale metadata

* fix: review comments

* fix: formatting

* fix: revenue

* Revert "fix: revenue"

This reverts commit 0821abe.

* fix: token status after sale

* fix: fixmes

* fix: formatting

* fix: funds accounting during sale

* fix: amount accounting

* fix: linter

* fix: review

* fix: review 2

* fix: review

* fix: linter

* feat: migration for new db scheam

* fix: update event versions

* fix: patch types with crt_release types

* fix: patch types

* fix: generate all events versions since mainnet

* fix: temp fix after event version generation

* fix: event versioning

* fix: add migration

* fix: mignations

* fix: solve channel not being added

* fix: add id to TokenChannel

* fix: non-nullable deleted field set

* fix: format

* feat: creator token init sale re enabling

* feat: re enable sale init code

* fix: update types

* fix: amm id

* fix: id computation for revenue share

* fix: amm id computation for token

* fix: issuer transfer accounting

* fix: amm tx id

* fix: destination accounting

* feat: minor fix on holder transfer processing

* fix: re-enable metadata

* fix: metadata parsing

* fix: post reword cleanup

* fix: format

* fix: silence ci checks

* fix: event version

* fix: address PR changes

I edited all the entity that have a composite index like TokenAccount so that they have
a synthetic ID and an optionally unique @index

* fix: add hidden entities conditions

* fix: add extra fields to token in order to keep track of ongoing status

* fix: build errors

* fix: adapt mapping to new token fields

* fix: format

* feat: add trailer video entity

this is required so we can simply make trailer video hidden if video is hidden

* fix: linter

* chore: prettier

* fix: from PR review

* fix: vesting schedule schema & mappings

I have replaced the vesting schedule back to the original schema with:
- VestingSchedule: holding vesting schedule information such being amount agnostic
- VestedAccount: contains information regarded to a vested account, the goal is to mimic the
runtime logic

* fix: burning from vesting

* patch: metadata-protobuf package

* patch: metadata-protobuf package

* fix: generate migrations

* fix: purchase token on sale

* Update schema/token.graphql

Co-authored-by: Leszek Wiesner <leszek@jsgenesis.com>

* Update schema/token.graphql

Co-authored-by: Leszek Wiesner <leszek@jsgenesis.com>

* fix: address PR

* fix: hidden entities

* fix: migration ok

* feat: add extra check for migrations

* fix: docker network

* fix: format

* fix: remove unrequired constraint

* fix: 🐛 post rebase fixes

* feat: 🎨 add metadata processing for issue token

* feat(crt-v1): ✨ chain metadata for v 2003

* fix(crt-v1): 🚑 comment out view element for orion playgroud

* fix(crt-v1): 🎨 add playground config variable to .env

* feat: ✅ add tests

* fix(crt-v1): 📦 packages and patches

* fix(crt-v1): ✅ update entity id used and other minor fixes

* fix(crt-v1): ✅ update entity id used and other minor fixes

* test(crt-v1): 🐛 misc fixes to have tests working

* test(crt-v1): 🐛 misc fixes to have tests working

* fix(crt-v1): 🐛 metadata and trailer video

* feat(crt-v1): 🎨 update types

* fix(crt-v1): ✨ Add correct Ratio denomination (Permill)

* update with master

* fix: 🐛 metadata not being set

* fix: 🐛 parameters order

* test: 🧪 fixing integration tests

* test(crt-v1): 🧪 fix integration tests

* feat(crt-v1): ✨ last price for token and recovered field for rev share part

* feat: ✨ add resolver for dividend amount

* feat(crt-v1): ✨ start adding channel fields for trackingtotal revenue

* feat(crt-v1): ✨ add utils for royalty computation

* feat(crt-v1): ✨ cumulative revenue on channel

* feat(crt-v1): ✨ add resolver for transferrable amount

* fix(crt-v1): ✨ add `acquiredAt` to pinpoint latest vesting schedule for account

* Token metadata processing update

* Prettier

* chore(crt-v1): ⚡ dbgen

* fix(crt-v1): 🧪 fix integration tests

* fix(crt-v1): 🐛 missing fields in token sale vesting source

* test(crt-v1): 🧪 test for transferrable balance amount

* fix(crt-v1): 🐛 transferrable amount

* test: 🧪 update tests after resolver fix

* fix: 🐛 error on vesting schedules array

* fix: 🎨 CI fixes

* docs: update gitignore

* fix: 🚨 prettier

* build: 📌 chai depnedencies

---------

Co-authored-by: Leszek Wiesner <leszek@jsgenesis.com>
Co-authored-by: WRadoslaw <r.wyszynski00@gmail.com>

* Clear benefits even if not passed (#282)

* 🤑 Fix revenue share dividend estimation (#297)

* Fix on revenue share dividend estimation

* Fix type on result

* 🛕 Historical revenue share participants (#286)

* New field for revenue share

* Set potential revenue share particitants at the time of start

* fix: .gitignore not working

* fix lint issues

* re-generate db migrations

* commit register.html.mst file

* fix: notifications integration test

---------

Co-authored-by: Ignazio Bovo <ignazio@jsgenesis.com>
Co-authored-by: Leszek Wiesner <leszek@jsgenesis.com>
Co-authored-by: WRadoslaw <r.wyszynski00@gmail.com>
Co-authored-by: WRadoslaw <92513933+WRadoslaw@users.noreply.github.com>

* Revert "Update `nara` from `master` (#300)" (#306)

This reverts commit 887427c.

* generate auth api docs and types

* add is short derived field to video entity (#310)

* add is shirt derived field to video entity

* add indices on is short fields

* fix: video language detection fix (#309)

* fix: video language detection fix

* address requested changes

* fix: predictVideoLanguage function

* fix: include max 1 video per channel in homepage videos (#313)

* fix: include max 1 video per channel in homepage videos

* update setOrionLanguage Migration script

* format updateVideoRelevanceValue SQL query

* fix: use UTC midnight epoch instead of current epoch to calculate video relevance score

* bump package version and update CHANGELOG

* fix: lint bug

* add CRT token 'channelId' to amm burn/mint and sale mint notification events

---------

Co-authored-by: Ignazio Bovo <ignazio@jsgenesis.com>
Co-authored-by: Leszek Wiesner <leszek@jsgenesis.com>
Co-authored-by: WRadoslaw <r.wyszynski00@gmail.com>
Co-authored-by: WRadoslaw <92513933+WRadoslaw@users.noreply.github.com>
zeeshanakram3 added a commit that referenced this pull request Mar 13, 2024
* Add is short field to video entity (#301)

* add isShort field to video entity

* regenerate db migrations

* remove @joystream/metadata-protobuf patch from assets/patches

* fix lint issue

* Disable both in Appp and eail notifications for video posted events (#299)

* bump package version and update CHANGELOG (#302)

* bump package version and update CHANGELOG

* change release version

* Simple public homefeed query and mutation (#304)

* update graphql schema

* add partial index on 'video.include_in_home_feed' field

* update video view definition to only include public videos

* regenerate migrations

* add dumbPublicFeedVideos custom query

* add setPublicFeedVideos mutation

* fix lint issue

* add arg to skip video IDs

* revert: update video view definition to only include public videos

* add feat. to unset public feed videos

* address requested change

* bump package version and update CHANGELOG

* Update `nara` from `master` (#300)

* Adds mappings for `ChannelAssetsDeletedByModerator` & `VideoAssetsDeletedByModerator` events (#199)

* mark 'VideoDeletedByModerator' & 'ChannelDeletedByModerator' events deprecated

* Implements mappings for 'Content.VideoAssetsDeletedByModerator and 'Content.ChannelAssetsDeletedByModerator' runtime events

* remove unused import

* Nara/crt update (#244)

* feat: build orion

* feat: start generating schema

* fix: extra entities

* fixup!

* fix: continue implementing design specs

* fix: review and fix foreign key relationships

* fix: formatting

* fix: generation errors

* fix: add comment

* fix: relations

* fix: final review

* fixup!

* fix: add ending blocks

* fix: generate type & set typegen to ipv4

* fix: add support for event backward compatibility

* feat: start adding mappings

* fix: continue with mappnigs

* feat: init sale

* feat: patronage decreased to & fixed build

* feat: claim patronage event

* feat: tokens bought on amm

* feat: tokens sold on amm

* fix: add relation between sales and vesting schedules

* feat: add Tokens sold on sale vente

* feat: update upcoming sale

* feat: revenue share issued

* feat: member joined whitelist

* feat: amm deactivated

* feat: burned token

* feat: transfer policy changed to permissionless

* feat: sale finalized

* feat: finish mappings

* fix: review

* fix: remove cascade deletions

* fix: renaming & formatting

* fixup!

* fixup!

* fix: patched protobuf packages with token proto

* feat: update metadata and add event handler scheleton

* feat: token metadata

* feat: sale metadata

* fix: review comments

* fix: formatting

* fix: revenue

* Revert "fix: revenue"

This reverts commit 0821abe.

* fix: token status after sale

* fix: fixmes

* fix: formatting

* fix: funds accounting during sale

* fix: amount accounting

* fix: linter

* fix: review

* fix: review 2

* fix: review

* fix: linter

* feat: migration for new db scheam

* fix: update event versions

* fix: patch types with crt_release types

* fix: patch types

* fix: generate all events versions since mainnet

* fix: temp fix after event version generation

* fix: event versioning

* fix: add migration

* fix: mignations

* fix: solve channel not being added

* fix: add id to TokenChannel

* fix: non-nullable deleted field set

* fix: format

* feat: creator token init sale re enabling

* feat: re enable sale init code

* fix: update types

* fix: amm id

* fix: id computation for revenue share

* fix: amm id computation for token

* fix: issuer transfer accounting

* fix: amm tx id

* fix: destination accounting

* feat: minor fix on holder transfer processing

* fix: re-enable metadata

* fix: metadata parsing

* fix: post reword cleanup

* fix: format

* fix: silence ci checks

* fix: event version

* fix: address PR changes

I edited all the entity that have a composite index like TokenAccount so that they have
a synthetic ID and an optionally unique @index

* fix: add hidden entities conditions

* fix: add extra fields to token in order to keep track of ongoing status

* fix: build errors

* fix: adapt mapping to new token fields

* fix: format

* feat: add trailer video entity

this is required so we can simply make trailer video hidden if video is hidden

* fix: linter

* chore: prettier

* fix: from PR review

* fix: vesting schedule schema & mappings

I have replaced the vesting schedule back to the original schema with:
- VestingSchedule: holding vesting schedule information such being amount agnostic
- VestedAccount: contains information regarded to a vested account, the goal is to mimic the
runtime logic

* fix: burning from vesting

* patch: metadata-protobuf package

* patch: metadata-protobuf package

* fix: generate migrations

* fix: purchase token on sale

* Update schema/token.graphql

Co-authored-by: Leszek Wiesner <leszek@jsgenesis.com>

* Update schema/token.graphql

Co-authored-by: Leszek Wiesner <leszek@jsgenesis.com>

* fix: address PR

* fix: hidden entities

* fix: migration ok

* feat: add extra check for migrations

* fix: docker network

* fix: format

* fix: remove unrequired constraint

* fix: 🐛 post rebase fixes

* feat: 🎨 add metadata processing for issue token

* feat(crt-v1): ✨ chain metadata for v 2003

* fix(crt-v1): 🚑 comment out view element for orion playgroud

* fix(crt-v1): 🎨 add playground config variable to .env

* feat: ✅ add tests

* fix(crt-v1): 📦 packages and patches

* fix(crt-v1): ✅ update entity id used and other minor fixes

* fix(crt-v1): ✅ update entity id used and other minor fixes

* test(crt-v1): 🐛 misc fixes to have tests working

* test(crt-v1): 🐛 misc fixes to have tests working

* fix(crt-v1): 🐛 metadata and trailer video

* feat(crt-v1): 🎨 update types

* fix(crt-v1): ✨ Add correct Ratio denomination (Permill)

* update with master

* fix: 🐛 metadata not being set

* fix: 🐛 parameters order

* test: 🧪 fixing integration tests

* test(crt-v1): 🧪 fix integration tests

* feat(crt-v1): ✨ last price for token and recovered field for rev share part

* feat: ✨ add resolver for dividend amount

* feat(crt-v1): ✨ start adding channel fields for trackingtotal revenue

* feat(crt-v1): ✨ add utils for royalty computation

* feat(crt-v1): ✨ cumulative revenue on channel

* feat(crt-v1): ✨ add resolver for transferrable amount

* fix(crt-v1): ✨ add `acquiredAt` to pinpoint latest vesting schedule for account

* Token metadata processing update

* Prettier

* chore(crt-v1): ⚡ dbgen

* fix(crt-v1): 🧪 fix integration tests

* fix(crt-v1): 🐛 missing fields in token sale vesting source

* test(crt-v1): 🧪 test for transferrable balance amount

* fix(crt-v1): 🐛 transferrable amount

* test: 🧪 update tests after resolver fix

* fix: 🐛 error on vesting schedules array

* fix: 🎨 CI fixes

* docs: update gitignore

* fix: 🚨 prettier

* build: 📌 chai depnedencies

---------

Co-authored-by: Leszek Wiesner <leszek@jsgenesis.com>
Co-authored-by: WRadoslaw <r.wyszynski00@gmail.com>

* Clear benefits even if not passed (#282)

* 🤑 Fix revenue share dividend estimation (#297)

* Fix on revenue share dividend estimation

* Fix type on result

* 🛕 Historical revenue share participants (#286)

* New field for revenue share

* Set potential revenue share particitants at the time of start

* fix: .gitignore not working

* fix lint issues

* re-generate db migrations

* commit register.html.mst file

* fix: notifications integration test

---------

Co-authored-by: Ignazio Bovo <ignazio@jsgenesis.com>
Co-authored-by: Leszek Wiesner <leszek@jsgenesis.com>
Co-authored-by: WRadoslaw <r.wyszynski00@gmail.com>
Co-authored-by: WRadoslaw <92513933+WRadoslaw@users.noreply.github.com>

* Revert "Update `nara` from `master` (#300)" (#306)

This reverts commit 887427c.

* generate auth api docs and types

* add is short derived field to video entity (#310)

* add is shirt derived field to video entity

* add indices on is short fields

* fix: video language detection fix (#309)

* fix: video language detection fix

* address requested changes

* fix: predictVideoLanguage function

* fix: include max 1 video per channel in homepage videos (#313)

* fix: include max 1 video per channel in homepage videos

* update setOrionLanguage Migration script

* format updateVideoRelevanceValue SQL query

* fix: use UTC midnight epoch instead of current epoch to calculate video relevance score

* bump package version and update CHANGELOG

* fix: lint bug

* remove NextEntityIdManager migration script

* [offchainState] add v4.0.0 (CRT release) migrations

* [offchainState] remove ORDER BY clause from UPDATE statements

* add migration for NextEntityId

* bump package version and update CHANGELOG

---------

Co-authored-by: Ignazio Bovo <ignazio@jsgenesis.com>
Co-authored-by: Leszek Wiesner <leszek@jsgenesis.com>
Co-authored-by: WRadoslaw <r.wyszynski00@gmail.com>
Co-authored-by: WRadoslaw <92513933+WRadoslaw@users.noreply.github.com>
malchililj added a commit to malchililj/orion that referenced this pull request Sep 3, 2024
* fix: video language detection fix

* address requested changes

* fix: predictVideoLanguage function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants