`microtc.weighting`¶

class microtc.weighting.Entropy(docs, X=None, **kwargs)[source]¶

Vector Space using 1 - entropy as the weighting scheme

Usage:

>>> from microtc.weighting import Entropy
>>> tokens = [['buenos', 'dia', 'microtc'], ['excelente', 'dia'], ['buenas', 'tardes'], ['las', 'vacas', 'me', 'deprimen', 'al', 'dia'], ['odio', 'los', 'lunes'], ['odio', 'el', 'trafico'], ['la', 'computadora'], ['la', 'mesa'], ['la', 'ventana']]
>>> y = [0, 0, 0, 2, 2, 2, 1, 1, 1]
>>> ent = Entropy(tokens, X=[dict(text=t, klass=k) for t, k in zip(tokens, y)])
>>> vector = ent['buenos', 'X', 'dia']

static entropy(corpus, docs, word2id)[source]¶

Compute entropy

Parameters:

corpus (list) – Tokenized corpus, i.e., as a list of tokens list
docs (list) – Original corpus is a list of dictionaries where key klass contains the class or label
word2id (dict) – Map token to identifier

Return type:

np.array

property wordWeight¶: Weight associated to each word, entropy per token

class microtc.weighting.TF(docs, X=None, token_min_filter: Union[int, float] = 0, token_max_filter: Union[int, float] = 1, max_dimension: bool = False)[source]¶

property wordWeight¶: Weight associated to each word, this is one on TF

class microtc.weighting.TFIDF(docs, X=None, token_min_filter: Union[int, float] = 0, token_max_filter: Union[int, float] = 1, max_dimension: bool = False)[source]¶

Vector Space model using TFIDF

Parameters:

docs (list) – corpus as a list of list of tokens
X (list) – original corpus, useful to pass extra information in a dict
token_min_filter (int or float) – Keep those tokens that appear more times than the parameter
token_max_filter (int or float) – Keep those tokens that appear less times than the parameter

Usage:

>>> from microtc.weighting import TFIDF
>>> tokens = [['buenos', 'dia', 'microtc'], ['excelente', 'dia'], ['buenas', 'tardes'], ['las', 'vacas', 'me', 'deprimen'], ['odio', 'los', 'lunes'], ['odio', 'el', 'trafico'], ['la', 'computadora'], ['la', 'mesa'], ['la', 'ventana']]
>>> tfidf = TFIDF(tokens)
>>> vector = tfidf['buenos', 'X', 'trafico']

classmethod counter(counter, token_min_filter=0, token_max_filter=1)[source]¶

Create from microtc.utils.Corpus

Parameters:

counter – Tokens
type – microtc.utils.Corpus

doc2weight(tokens)[source]¶

Weight associated to each token

Parameters:: tokens (lst) – list of tokens
Return type:: tuple - ids, term frequency, wordWeight

property num_terms¶: Number of terms

property word2id¶: Map word to id

property wordWeight¶: Weight associated to each word, this could be the inverse document frequency

`microtc.weighting`¶

Previous topic

Next topic

This Page

microtc.weighting¶

`microtc.weighting`¶