microtc.weighting
¶
- class microtc.weighting.Entropy(docs, X=None, **kwargs)[source]¶
Vector Space using 1 - entropy as the weighting scheme
Usage:
>>> from microtc.weighting import Entropy >>> tokens = [['buenos', 'dia', 'microtc'], ['excelente', 'dia'], ['buenas', 'tardes'], ['las', 'vacas', 'me', 'deprimen', 'al', 'dia'], ['odio', 'los', 'lunes'], ['odio', 'el', 'trafico'], ['la', 'computadora'], ['la', 'mesa'], ['la', 'ventana']] >>> y = [0, 0, 0, 2, 2, 2, 1, 1, 1] >>> ent = Entropy(tokens, X=[dict(text=t, klass=k) for t, k in zip(tokens, y)]) >>> vector = ent['buenos', 'X', 'dia']
- static entropy(corpus, docs, word2id)[source]¶
Compute entropy
- Parameters:
corpus (list) – Tokenized corpus, i.e., as a list of tokens list
docs (list) – Original corpus is a list of dictionaries where key klass contains the class or label
word2id (dict) – Map token to identifier
- Return type:
np.array
- property wordWeight¶
Weight associated to each word, entropy per token
- class microtc.weighting.TF(docs, X=None, token_min_filter: Union[int, float] = 0, token_max_filter: Union[int, float] = 1, max_dimension: bool = False)[source]¶
- property wordWeight¶
Weight associated to each word, this is one on TF
- class microtc.weighting.TFIDF(docs, X=None, token_min_filter: Union[int, float] = 0, token_max_filter: Union[int, float] = 1, max_dimension: bool = False)[source]¶
Vector Space model using TFIDF
- Parameters:
docs (list) – corpus as a list of list of tokens
X (list) – original corpus, useful to pass extra information in a dict
token_min_filter (int or float) – Keep those tokens that appear more times than the parameter
token_max_filter (int or float) – Keep those tokens that appear less times than the parameter
Usage:
>>> from microtc.weighting import TFIDF >>> tokens = [['buenos', 'dia', 'microtc'], ['excelente', 'dia'], ['buenas', 'tardes'], ['las', 'vacas', 'me', 'deprimen'], ['odio', 'los', 'lunes'], ['odio', 'el', 'trafico'], ['la', 'computadora'], ['la', 'mesa'], ['la', 'ventana']] >>> tfidf = TFIDF(tokens) >>> vector = tfidf['buenos', 'X', 'trafico']
- classmethod counter(counter, token_min_filter=0, token_max_filter=1)[source]¶
Create from
microtc.utils.Corpus
- Parameters:
counter – Tokens
type –
microtc.utils.Corpus
- doc2weight(tokens)[source]¶
Weight associated to each token
- Parameters:
tokens (lst) – list of tokens
- Return type:
tuple - ids, term frequency, wordWeight
- property num_terms¶
Number of terms
- property word2id¶
Map word to id
- property wordWeight¶
Weight associated to each word, this could be the inverse document frequency