microtc.weighting

class microtc.weighting.Entropy(docs, X=None, **kwargs)[source]

Vector Space using 1 - entropy as the weighting scheme

Usage:

>>> from microtc.weighting import Entropy
>>> tokens = [['buenos', 'dia', 'microtc'], ['excelente', 'dia'], ['buenas', 'tardes'], ['las', 'vacas', 'me', 'deprimen', 'al', 'dia'], ['odio', 'los', 'lunes'], ['odio', 'el', 'trafico'], ['la', 'computadora'], ['la', 'mesa'], ['la', 'ventana']]
>>> y = [0, 0, 0, 2, 2, 2, 1, 1, 1]
>>> ent = Entropy(tokens, X=[dict(text=t, klass=k) for t, k in zip(tokens, y)])
>>> vector = ent['buenos', 'X', 'dia']
static entropy(corpus, docs, word2id)[source]

Compute entropy

Parameters:
  • corpus (list) – Tokenized corpus, i.e., as a list of tokens list

  • docs (list) – Original corpus is a list of dictionaries where key klass contains the class or label

  • word2id (dict) – Map token to identifier

Return type:

np.array

property wordWeight

Weight associated to each word, entropy per token

class microtc.weighting.TF(docs, X=None, token_min_filter: Union[int, float] = 0, token_max_filter: Union[int, float] = 1, max_dimension: bool = False)[source]
property wordWeight

Weight associated to each word, this is one on TF

class microtc.weighting.TFIDF(docs, X=None, token_min_filter: Union[int, float] = 0, token_max_filter: Union[int, float] = 1, max_dimension: bool = False)[source]

Vector Space model using TFIDF

Parameters:
  • docs (list) – corpus as a list of list of tokens

  • X (list) – original corpus, useful to pass extra information in a dict

  • token_min_filter (int or float) – Keep those tokens that appear more times than the parameter

  • token_max_filter (int or float) – Keep those tokens that appear less times than the parameter

Usage:

>>> from microtc.weighting import TFIDF
>>> tokens = [['buenos', 'dia', 'microtc'], ['excelente', 'dia'], ['buenas', 'tardes'], ['las', 'vacas', 'me', 'deprimen'], ['odio', 'los', 'lunes'], ['odio', 'el', 'trafico'], ['la', 'computadora'], ['la', 'mesa'], ['la', 'ventana']]
>>> tfidf = TFIDF(tokens)
>>> vector = tfidf['buenos', 'X', 'trafico']
classmethod counter(counter, token_min_filter=0, token_max_filter=1)[source]

Create from microtc.utils.Corpus

Parameters:
  • counter – Tokens

  • typemicrotc.utils.Corpus

doc2weight(tokens)[source]

Weight associated to each token

Parameters:

tokens (lst) – list of tokens

Return type:

tuple - ids, term frequency, wordWeight

property num_terms

Number of terms

property word2id

Map word to id

property wordWeight

Weight associated to each word, this could be the inverse document frequency