Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

README.md

Topic modelling is an unsupervised machine learning method that helps us discover hidden semantic structures in a paper, that allows us to learn topic representations of papers in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the website.

Ещё можно называть мягкой би-кластеризацией

Models

  • Unsupervised
    • Latent Dirichlet Allocation (LDA)
    • Expectation–maximization algorithm (EM Algorithm)
    • Probabilistic latent semantic analysis (PLSA)
    • LDA2Vec
    • Sentence-BERT (SBERT)
  • Sepervised or semi-supervised
    • Guided Latent Dirichlet Allocation (Guided LDA)
    • Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge (CorEx)
  • BERTopic

Dataset for this task:

Papers

Articles


  • классификация и категоризация документов
  • тематическая сегментация документов
  • атоматическое аннотирование документов
  • автоматическая суммаризация коллекций

Предварительная обработка и очистка текстов

  • удаление форматирования и переносов
  • удаление обрывочной и нетекстовой информации
  • исправление опечаток
  • сливание коротких текстов
  • Формирование словаря:
    • Лемматизация
    • Стемминг
    • Выделение терминов (term extraction)
    • Удаление stopwords и слишком редких слов (реже 10 раз)
    • Использовать биграммы

Базовые предположения

  • Порядок слов в документе не важен (bag of words)
  • Каждая пара (d, w) связана с некторой темой t ∈ T
  • Гипотеза условной независимости (cлова в документах генерируются темой, а не документом): p(w | t, d) = p(w | t)
  • Иногда вводят предположение о разрежености:
    • Документ относится к небольшому числу тем
    • Тема состоит из небольшого числа терминов
  • Документ d -- это смесь распределений p(w | t) с весами p(t | d)

Постанвка задачи

Дано:

  • W - словарь терминов (слов или словосочетаний)
  • D - коллекция текстовых документов d ⊂ W
  • n dw - сколько раз термин (слово) w встретится в документе d
  • n d - длина документа d

Найти: Параметры вероятностной тематической модели (формула полной вероятности):

Условные распределения:

  • φw t = p(w | t) - вероятности терминов w в каждой теме t
  • Θt d = p(t | d) - вероятности тем t в каждом документе d

Стохастическая матрица - столбцы дискретные вероятностные распределения

  • В Φ неотрицательные нормарованные значения, сумма по каждому столбцу = 1, столбец - дискретное распределение вероятностей

Не одно решение

Решение: Использование регуляризаторов

Latent Dirichlet Allocation (LDA)

  • Generative model

Articles

The Process

  • We pick the number of topics ahead of time even if we’re not sure what the topics are.
  • Each document is represented as a distribution over topics.
  • Each topic is represented as a distribution over words.

Two important parameters of the algorithm are:

  • Topic concentration / Beta
  • Document concentration / Alpha

For the symmetric distribution, a high alpha-value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. A low alpha value puts less such constraints on documents and means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. Likewise, a high beta-value means that each topic is likely to contain a mixture of most of the words, and not any word specifically, while a low value means that a topic may contain a mixture of just a few of the words.

If, on the other hand, the distribution is asymmetric, a high alpha-value means that a specific topic distribution (depending on the base measure) is more likely for each document. Similarly, high beta-values means each topic is more likely to contain a specific word mix defined by the base measure.

In practice, a high alpha-value will lead to documents being more similar in terms of what topics they contain. A high beta-value will similarly lead to topics being more similar in terms of what words they contain.

Code examples

Semi-Supervised Topic Modeling

Guided Latent Dirichlet Allocation (LDA)

GuidedLDA OR SeededLDA implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. GuidedLDA can be guided by setting some seed words per topic. Which will make the topics converge in that direction.

Articles

Code examples

Issues

Solution:

Running on OS X 10.14 w/Python3.7. able to get past them by upgrading cython:

pip3 install -U cython

Removing the following 2 lines from setup.cfg:

[sdist]

pre-hook.sdist_pre_hook = guidedlda._setup_hooks.sdist_pre_hook

And then running the original installation instructions:

git clone https://github.com/vi3k6i5/GuidedLDA
cd GuidedLDA
sh build_dist.sh
python setup.py sdist
pip3 install -e .
  • The package GuidedLDA doesn't installed

Solution:

The package that this is built of off is LDA and it installed with no issue. I managed to copy from the GuidedLDA package: the guidedlda.py, utils.py, datasets.py and the few NYT data set items into the original LDA package, post installation.

GuidedLDA_WorkAround

  1. Pull down the repository.

  2. Install the original LDA package. https://pypi.org/project/lda/

  3. Drop the *.py files from the GuidedLDA_WorkAround repo in the lda folder under site-packages for your specific enviroment.

  4. Profit...

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge (2017)

Correlation Explanation (CorEx) is a topic model that yields rich topics that are maximally informative about a set of documents. The advantage of using CorEx versus other topic models is that it can be easily run as an unsupervised, semi-supervised, or hierarchical topic model depending on a user's needs. For semi-supervision, CorEx allows a user to integrate their domain knowledge via "anchor words". This integration is flexible and allows the user to guide the topic model in the direction of those words. This allows for creative strategies that promote topic representation, separability, and aspects. More generally, this implementation of CorEx is good for clustering any sparse binary data.

Modifications of CorEx

Useful Articles

Issues on GitHub

Regulariazation

Kullback–Leibler divergence (relative entropy)

Способ померять растояние между двумя распределениями

Other models, tools and possible solutions

Tools

  • pyLDAvis
    • Python library for interactive topic model visualization.