\(\mu\text{TC}\)¶
A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackled using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, \(\mu\text{TC}\) is a minimalist and multi-propose text-classifier able to tackle tasks independently of domain and language.
\(\mu\text{TC}\) core entry point is microtc.textmodel.TextModel
which
can be seen as a function with the form
\(m(\text{text}) \rightarrow \Re ^d\) where \(d\)
is the vocabulary, i.e., the dimension of the vector space. As can
be seen, \(m\) can be used to transform a text into a vector, and,
consequently, it can be used to transform a training set of pairs,
text and label, into a training set of pairs, vectors and label, which
can be directly used by any supervised
learning algorithm to obtain a text classifier.
microtc.textmodel.TextModel
follows the idea of http://scikit-learn.org
transformers. That is, it implements a method
microtc.textmodel.TextModel.fit()
that receives the training
set and a method microtc.textmodel.TextModel.transform()
that
receives a list of texts and returns and sparse matrix that correspond
to the representation of the given texts in the vector space.
\(\mu\text{TC}\) is described in An Automated Text Categorization Framework based on Hyperparameter Optimization. Eric S. Tellez, Daniela Moctezuma, Sabino Miranda-Jímenez, Mario Graff. Knowledge-Based Systems Volume 149, 1 June 2018, Pages 110-123
Quickstart Guide¶
We have decided to make a live quickstart guide, it covers the installation, and the use of \(\mu\text{TC}\) with different parameters. Finally, the notebook can be found at the docs directory on GitHub.
Citing¶
If you find \(\mu\text{TC}\) useful for any academic/scientific purpose, we would appreciate citations to the following reference:
@article{Tellez2018110,
title = "An automated text categorization framework based on hyperparameter optimization",
journal = "Knowledge-Based Systems",
volume = "149",
pages = "110--123",
year = "2018",
issn = "0950-7051",
doi = "10.1016/j.knosys.2018.03.003",
url = "https://www.sciencedirect.com/science/article/pii/S0950705118301217",
author = "Eric S. Tellez and Daniela Moctezuma and Sabino Miranda-Jiménez and Mario Graff",
keywords = "Text classification",
keywords = "Hyperparameter optimization",
keywords = "Text modelling"
}
Installing \(\mu\text{TC}\)¶
\(\mu\text{TC}\) can be easly install using anaconda
conda install -c conda-forge microtc
or can be install using pip, it depends on numpy, scipy and scikit-learn.
pip install numpy
pip install scipy
pip install scikit-learn
pip install microtc
Text Model¶
This is class is \(\mu\text{TC}\) main entry, it receives a corpus, i.e., a list of text and builds a text model from it.
- class microtc.textmodel.TextModel(docs=None, text: str = 'text', num_option: str = 'group', usr_option: str = 'group', url_option: str = 'group', emo_option: str = 'group', hashtag_option: str = 'none', ent_option: str = 'none', lc: bool = True, del_dup: bool = True, del_punc: bool = False, del_diac: bool = True, token_list: list = [- 1], token_min_filter: Union[int, float] = 0, token_max_filter: Union[int, float] = 1, select_ent: bool = False, select_suff: bool = False, select_conn: bool = False, weighting: str = 'tfidf', q_grams_words: bool = False, max_dimension: bool = False)[source]¶
- Parameters:
docs (list) – Corpus
text (str) – In the case corpus is a dict then text is the key containing the text
num_option (str) – Transformations on numbers (none | group | delete)
usr_option (str) – Transformations on users (none | group | delete)
url_option (str) – Transformations on urls (none | group | delete)
emo_option (str) – Transformations on emojis and emoticons (none | group | delete)
hashtag_option (str) – Transformations on hashtag (none | group | delete)
ent_option (str) – Transformations on entities (none | group | delete)
lc (bool) – Lower case
del_dup (bool) – Remove duplicates e.g. hooola -> hola
del_punc (True) – Remove punctuation symbols
del_diac (bool) – Remove diacritics
token_list (list) – Tokens > 0 qgrams < 0 word-grams
token_min_filter (int or float) – Keep those tokens that appear more times than the parameter (used in weighting class)
token_max_filter (int or float) – Keep those tokens that appear less times than the parameter (used in weighting class)
q_grams_words (bool) – Compute q-grams only on words
select_ent (bool) –
select_suff (bool) –
select_conn (bool) –
weighting (class or str) – Weighting scheme (tfidf | tf | entropy)
Usage:
>>> from microtc.textmodel import TextModel >>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
Using default parameters
>>> textmodel = TextModel().fit(corpus)
Represent a text whose words are in the corpus and one that does not
>>> vector = textmodel['categorizacion ingoetec'] >>> vector2 = textmodel['cat']
Using a different token_list
>>> textmodel = TextModel(token_list=[[2, 1], -1, 3, 4]).fit(corpus) >>> vector = textmodel['categorizacion ingoetec'] >>> vector2 = textmodel['cat']
Train a classifier
>>> from sklearn.svm import LinearSVC >>> y = [1, 0, 0] >>> textmodel = TextModel().fit(corpus) >>> m = LinearSVC().fit(textmodel.transform(corpus), y) >>> m.predict(textmodel.transform(corpus)) array([1, 0, 0])
- compute_q_grams_words(textlist)[source]¶
>>> from microtc import TextModel >>> tm = TextModel(token_list=[3]) >>> tm.compute_q_grams_words(['abc', 'def']) ['q:~ab', 'q:abc', 'q:bc~', 'q:~de', 'q:def', 'q:ef~']
- compute_tokens(text)[source]¶
Compute tokens from a text using q-grams of characters and words, and skip-grams.
- Parameters:
text (str) – Text transformed by
microtc.textmodel.TextModel.text_transformations()
.- Return type:
list
Example:
>>> from microtc.textmodel import TextModel >>> tm = TextModel(token_list=[-2, -1]) >>> tm.compute_tokens("~Good morning~") [['Good~morning', 'Good', 'morning'], [], []] >>> tm = TextModel(token_list=[3]) >>> tm.compute_tokens('abc def') [[], [], ['q:abc', 'q:bc ', 'q:c d', 'q: de', 'q:def']] >>> tm = TextModel(token_list=[(2, 1)]) >>> tm.compute_tokens('~abc x de~') [[], ['abc~de'], []] >>> tm = TextModel(token_list=[3], q_grams_words=True) >>> tm.compute_tokens('~abc def~') [[], [], ['q:~ab', 'q:abc', 'q:bc~', 'q:~de', 'q:def', 'q:ef~']]
- property id2token¶
Token identifier to token
>>> from microtc.textmodel import TextModel >>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec'] >>> textmodel = TextModel().fit(corpus) >>> _ = textmodel.transform(corpus) >>> textmodel.id2token[5] 'de'
- property n_grams¶
n-grams of words >>> from microtc import TextModel >>> tm = TextModel(token_list=[-1, 3, (2, 1)]) >>> tm.n_grams [-1]
- property num_terms¶
Dimension which is the number of terms of the corpus
>>> from microtc.textmodel import TextModel >>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec'] >>> textmodel = TextModel().fit(corpus) >>> _ = textmodel.transform(corpus) >>> textmodel.num_terms 8
- Return type:
int
- classmethod params()[source]¶
Parameters
>>> from microtc.textmodel import TextModel >>> TextModel.params() odict_keys(['docs', 'text', 'num_option', 'usr_option', 'url_option', 'emo_option', 'hashtag_option', 'ent_option', 'lc', 'del_dup', 'del_punc', 'del_diac', 'token_list', 'token_min_filter', 'token_max_filter', 'select_ent', 'select_suff', 'select_conn', 'weighting', 'q_grams_words', 'max_dimension'])
- property q_grams¶
q-grams of characters >>> from microtc import TextModel >>> tm = TextModel(token_list=[-1, 3, (2, 1)]) >>> tm.q_grams [3]
- select_tokens(L)[source]¶
Filter tokens using suffix or connections
- Parameters:
L (list) – list of tokens
- Return type:
list
- property skip_grams¶
skip-grams >>> from microtc import TextModel >>> tm = TextModel(token_list=[-1, 3, (2, 1)]) >>> tm.skip_grams [(2, 1)]
- text_transformations(text)[source]¶
Text transformations. It starts by analyzing emojis, hashtags, entities, lower case, numbers, URL, and users. After these transformations are applied to the text, it calls
microtc.textmodel.norm_chars()
.- Parameters:
text (str) –
- Return type:
str
Example:
>>> from microtc.textmodel import TextModel >>> tm = TextModel(del_dup=False) >>> tm.text_transformations("Life is good at México @mgraffg.") '~life~is~good~at~mexico~_usr~'
- property token2id¶
Token to token identifier
>>> from microtc.textmodel import TextModel >>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec'] >>> textmodel = TextModel().fit(corpus) >>> _ = textmodel.transform(corpus) >>> textmodel.token2id['de'] 5
- property token_weight¶
Weight associated to each token id
>>> from microtc.textmodel import TextModel >>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec'] >>> textmodel = TextModel().fit(corpus) >>> _ = textmodel.transform(corpus) >>> textmodel.token_weight[5] 1.584962500721156
- tokenize(text)[source]¶
Transform text to tokens. The procedure is:
- Parameters:
text (str or list) – Text
- Return type:
list
Example:
>>> from microtc.textmodel import TextModel >>> tm = TextModel() >>> tm.tokenize("buenos dias") ['buenos', 'dias'] >>> tm.tokenize(["buenos", "dias", "tenga usted"]) ['buenos', 'dias', 'tenga', 'usted']
- transform(texts)[source]¶
Convert test into a vector
- Parameters:
texts (list) – List of text to be transformed
- Return type:
list
Example:
>>> from microtc.textmodel import TextModel >>> corpus = ['buenos dias catedras', 'catedras conacyt'] >>> textmodel = TextModel().fit(corpus) >>> X = textmodel.transform(corpus)
- microtc.textmodel.norm_chars(text, del_diac=True, del_dup=True, del_punc=False)[source]¶
Transform text by removing diacritics, duplicates, and punctuation. It adds ~ at the beginning, the end, and the spaces are changed by ~.
- Parameters:
text (str) – Text
del_diac (bool) – Delete diacritics
del_dup (bool) – Delete duplicates
del_punc (bool) – Delete punctuation symbols
- Return type:
str
Example:
>>> from microtc.textmodel import norm_chars >>> norm_chars("Life is good at Méxicoo.") '~Life~is~god~at~Mexico.~'
- microtc.textmodel.get_word_list(text)[source]¶
Transform a text (begining and ending with ~) to list words. It is called after
microtc.textmodel.norm_chars()
.Example
>>> from microtc.textmodel import get_word_list >>> get_word_list("~Someone's house.~") ['Someone', 's', 'house']
- Parameters:
text (str) – text
- Return type:
list
- microtc.textmodel.expand_qgrams(text, qsize, output)[source]¶
Expands a text into a set of q-grams
- Parameters:
text (str) – Text
qsize (int) – q-gram size
output (list) – output
- Returns:
output
- Return type:
list
Example:
>>> from microtc.textmodel import expand_qgrams >>> output = list() >>> expand_qgrams("Good morning.", 3, output) ['q:Goo', 'q:ood', 'q:od ', 'q:d m', 'q: mo', 'q:mor', 'q:orn', 'q:rni', 'q:nin', 'q:ing', 'q:ng.']
- microtc.textmodel.expand_qgrams_word_list(wlist, qsize, output, sep='~')[source]¶
Expands a list of words into a list of q-grams. It uses sep to join words
- Parameters:
wlist (list) – List of words computed by
microtc.textmodel.get_word_list()
.qsize (int) – q-gram size of words
output (list) – output
sep (str) – String used to join the words
- Returns:
output
- Return type:
list
Example:
>>> from microtc.textmodel import expand_qgrams_word_list >>> wlist = ["Good", "morning", "Mexico"] >>> expand_qgrams_word_list(wlist, 2, list()) ['Good~morning', 'morning~Mexico']
- microtc.textmodel.expand_skipgrams_word_list(wlist, qsize, output, sep='~')[source]¶
Expands a list of words into a list of skipgrams. It uses sep to join words
- Parameters:
wlist (list) – List of words computed by
microtc.textmodel.get_word_list()
.qsize (tuple) – (qsize, skip) qsize is the q-gram size and skip is the number of words ahead.
output (list) – output
sep (str) – String used to join the words
- Returns:
output
- Return type:
list
Example:
>>> from microtc.textmodel import expand_skipgrams_word_list >>> wlist = ["Good", "morning", "Mexico"] >>> expand_skipgrams_word_list(wlist, (2, 1), list()) ['Good~Mexico']