\(\mu\text{TC}\)¶

https://coveralls.io/repos/github/INGEOTEC/microtc/badge.svg?branch=develop

https://dev.azure.com/conda-forge/feedstock-builds/_apis/build/status/microtc-feedstock?branchName=main

https://img.shields.io/conda/vn/conda-forge/microtc.svg

https://img.shields.io/conda/pn/conda-forge/microtc.svg

https://readthedocs.org/projects/microtc/badge/?version=docs

A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackled using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, \(\mu\text{TC}\) is a minimalist and multi-propose text-classifier able to tackle tasks independently of domain and language.

\(\mu\text{TC}\) core entry point is microtc.textmodel.TextModel which can be seen as a function with the form \(m(\text{text}) \rightarrow \Re ^d\) where \(d\) is the vocabulary, i.e., the dimension of the vector space. As can be seen, \(m\) can be used to transform a text into a vector, and, consequently, it can be used to transform a training set of pairs, text and label, into a training set of pairs, vectors and label, which can be directly used by any supervised learning algorithm to obtain a text classifier.

microtc.textmodel.TextModel follows the idea of http://scikit-learn.org transformers. That is, it implements a method microtc.textmodel.TextModel.fit() that receives the training set and a method microtc.textmodel.TextModel.transform() that receives a list of texts and returns and sparse matrix that correspond to the representation of the given texts in the vector space.

\(\mu\text{TC}\) is described in An Automated Text Categorization Framework based on Hyperparameter Optimization. Eric S. Tellez, Daniela Moctezuma, Sabino Miranda-Jímenez, Mario Graff. Knowledge-Based Systems Volume 149, 1 June 2018, Pages 110-123

Quickstart Guide¶

We have decided to make a live quickstart guide, it covers the installation, and the use of \(\mu\text{TC}\) with different parameters. Finally, the notebook can be found at the docs directory on GitHub.

Citing¶

If you find \(\mu\text{TC}\) useful for any academic/scientific purpose, we would appreciate citations to the following reference:

@article{Tellez2018110,
title = "An automated text categorization framework based on hyperparameter optimization",
journal = "Knowledge-Based Systems",
volume = "149",
pages = "110--123",
year = "2018",
issn = "0950-7051",
doi = "10.1016/j.knosys.2018.03.003",
url = "https://www.sciencedirect.com/science/article/pii/S0950705118301217",
author = "Eric S. Tellez and Daniela Moctezuma and Sabino Miranda-Jiménez and Mario Graff",
keywords = "Text classification",
keywords = "Hyperparameter optimization",
keywords = "Text modelling"
}

Installing \(\mu\text{TC}\)¶

\(\mu\text{TC}\) can be easly install using anaconda

conda install -c conda-forge microtc

or can be install using pip, it depends on numpy, scipy and scikit-learn.

pip install numpy
pip install scipy
pip install scikit-learn
pip install microtc

Text Model¶

This is class is \(\mu\text{TC}\) main entry, it receives a corpus, i.e., a list of text and builds a text model from it.

class microtc.textmodel.TextModel(docs=None, text: str = 'text', num_option: str = 'group', usr_option: str = 'group', url_option: str = 'group', emo_option: str = 'group', hashtag_option: str = 'none', ent_option: str = 'none', lc: bool = True, del_dup: bool = True, del_punc: bool = False, del_diac: bool = True, token_list: list = [- 1], token_min_filter: Union[int, float] = 0, token_max_filter: Union[int, float] = 1, select_ent: bool = False, select_suff: bool = False, select_conn: bool = False, weighting: str = 'tfidf', q_grams_words: bool = False, max_dimension: bool = False)[source]¶

Parameters:

docs (list) – Corpus
text (str) – In the case corpus is a dict then text is the key containing the text
num_option (str) – Transformations on numbers (none | group | delete)
usr_option (str) – Transformations on users (none | group | delete)
url_option (str) – Transformations on urls (none | group | delete)
emo_option (str) – Transformations on emojis and emoticons (none | group | delete)
hashtag_option (str) – Transformations on hashtag (none | group | delete)
ent_option (str) – Transformations on entities (none | group | delete)
lc (bool) – Lower case
del_dup (bool) – Remove duplicates e.g. hooola -> hola
del_punc (True) – Remove punctuation symbols
del_diac (bool) – Remove diacritics
token_list (list) – Tokens > 0 qgrams < 0 word-grams
token_min_filter (int or float) – Keep those tokens that appear more times than the parameter (used in weighting class)
token_max_filter (int or float) – Keep those tokens that appear less times than the parameter (used in weighting class)
q_grams_words (bool) – Compute q-grams only on words
select_ent (bool) –
select_suff (bool) –
select_conn (bool) –
weighting (class or str) – Weighting scheme (tfidf | tf | entropy)

Usage:

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']

Using default parameters

>>> textmodel = TextModel().fit(corpus)

Represent a text whose words are in the corpus and one that does not

>>> vector = textmodel['categorizacion ingoetec']
>>> vector2 = textmodel['cat']

Using a different token_list

>>> textmodel = TextModel(token_list=[[2, 1], -1, 3, 4]).fit(corpus)
>>> vector = textmodel['categorizacion ingoetec']
>>> vector2 = textmodel['cat']

Train a classifier

>>> from sklearn.svm import LinearSVC
>>> y = [1, 0, 0]
>>> textmodel = TextModel().fit(corpus)
>>> m = LinearSVC().fit(textmodel.transform(corpus), y)
>>> m.predict(textmodel.transform(corpus))
array([1, 0, 0])

compute_q_grams_words(textlist)[source]¶

>>> from microtc import TextModel
>>> tm = TextModel(token_list=[3])
>>> tm.compute_q_grams_words(['abc', 'def'])
['q:~ab', 'q:abc', 'q:bc~', 'q:~de', 'q:def', 'q:ef~']

compute_tokens(text)[source]¶

Compute tokens from a text using q-grams of characters and words, and skip-grams.

Parameters:: text (str) – Text transformed by microtc.textmodel.TextModel.text_transformations().
Return type:: list

Example:

>>> from microtc.textmodel import TextModel
>>> tm = TextModel(token_list=[-2, -1])
>>> tm.compute_tokens("~Good morning~")
[['Good~morning', 'Good', 'morning'], [], []]
>>> tm = TextModel(token_list=[3])
>>> tm.compute_tokens('abc def')
[[], [], ['q:abc', 'q:bc ', 'q:c d', 'q: de', 'q:def']]
>>> tm = TextModel(token_list=[(2, 1)])
>>> tm.compute_tokens('~abc x de~')
[[], ['abc~de'], []]
>>> tm = TextModel(token_list=[3], q_grams_words=True)
>>> tm.compute_tokens('~abc def~')
[[], [], ['q:~ab', 'q:abc', 'q:bc~', 'q:~de', 'q:def', 'q:ef~']]

fit(X)[source]¶

Train the model

Parameters:: X (list) – Corpus
Return type:: instance

get_text(text)[source]¶

Return self._text key from text

Parameters:: text (dict) – Text

property id2token¶

Token identifier to token

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
>>> textmodel = TextModel().fit(corpus)
>>> _ = textmodel.transform(corpus)
>>> textmodel.id2token[5]
'de'

property n_grams¶: n-grams of words >>> from microtc import TextModel >>> tm = TextModel(token_list=[-1, 3, (2, 1)]) >>> tm.n_grams [-1]

property num_terms¶

Dimension which is the number of terms of the corpus

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
>>> textmodel = TextModel().fit(corpus)
>>> _ = textmodel.transform(corpus)
>>> textmodel.num_terms
8

Return type:: int

classmethod params()[source]¶

Parameters

>>> from microtc.textmodel import TextModel
>>> TextModel.params()
odict_keys(['docs', 'text', 'num_option', 'usr_option', 'url_option', 'emo_option', 'hashtag_option', 'ent_option', 'lc', 'del_dup', 'del_punc', 'del_diac', 'token_list', 'token_min_filter', 'token_max_filter', 'select_ent', 'select_suff', 'select_conn', 'weighting', 'q_grams_words', 'max_dimension'])

property q_grams¶: q-grams of characters >>> from microtc import TextModel >>> tm = TextModel(token_list=[-1, 3, (2, 1)]) >>> tm.q_grams [3]

select_tokens(L)[source]¶

Filter tokens using suffix or connections

Parameters:: L (list) – list of tokens
Return type:: list

property skip_grams¶: skip-grams >>> from microtc import TextModel >>> tm = TextModel(token_list=[-1, 3, (2, 1)]) >>> tm.skip_grams [(2, 1)]

text_transformations(text)[source]¶

Text transformations. It starts by analyzing emojis, hashtags, entities, lower case, numbers, URL, and users. After these transformations are applied to the text, it calls microtc.textmodel.norm_chars().

Parameters:: text (str) –
Return type:: str

Example:

>>> from microtc.textmodel import TextModel
>>> tm = TextModel(del_dup=False)
>>> tm.text_transformations("Life is good at México @mgraffg.")
'~life~is~good~at~mexico~_usr~'

property token2id¶

Token to token identifier

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
>>> textmodel = TextModel().fit(corpus)
>>> _ = textmodel.transform(corpus)
>>> textmodel.token2id['de']
5

property token_weight¶

Weight associated to each token id

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
>>> textmodel = TextModel().fit(corpus)
>>> _ = textmodel.transform(corpus)
>>> textmodel.token_weight[5]
1.584962500721156

tokenize(text)[source]¶

Transform text to tokens. The procedure is:

microtc.textmodel.TextModel.text_transformations().
microtc.textmodel.TextModel.compute_tokens().
microtc.textmodel.TextModel.select_tokens().

Parameters:: text (str or list) – Text
Return type:: list

Example:

>>> from microtc.textmodel import TextModel
>>> tm = TextModel()
>>> tm.tokenize("buenos dias")
['buenos', 'dias']
>>> tm.tokenize(["buenos", "dias", "tenga usted"])
['buenos', 'dias', 'tenga', 'usted']

transform(texts)[source]¶

Convert test into a vector

Parameters:: texts (list) – List of text to be transformed
Return type:: list

Example:

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias catedras', 'catedras conacyt']
>>> textmodel = TextModel().fit(corpus)
>>> X = textmodel.transform(corpus)

microtc.textmodel.norm_chars(text, del_diac=True, del_dup=True, del_punc=False)[source]¶

Transform text by removing diacritics, duplicates, and punctuation. It adds ~ at the beginning, the end, and the spaces are changed by ~.

Parameters:

text (str) – Text
del_diac (bool) – Delete diacritics
del_dup (bool) – Delete duplicates
del_punc (bool) – Delete punctuation symbols

Return type:

str

Example:

>>> from microtc.textmodel import norm_chars
>>> norm_chars("Life is good at Méxicoo.")
'~Life~is~god~at~Mexico.~'

microtc.textmodel.get_word_list(text)[source]¶

Transform a text (begining and ending with ~) to list words. It is called after microtc.textmodel.norm_chars().

Example

>>> from microtc.textmodel import get_word_list
>>> get_word_list("~Someone's house.~")
['Someone', 's', 'house']

Parameters:: text (str) – text
Return type:: list

microtc.textmodel.expand_qgrams(text, qsize, output)[source]¶

Expands a text into a set of q-grams

Parameters:

text (str) – Text
qsize (int) – q-gram size
output (list) – output

Returns:

output

Return type:

list

Example:

>>> from microtc.textmodel import expand_qgrams
>>> output = list()
>>> expand_qgrams("Good morning.", 3, output)
['q:Goo', 'q:ood', 'q:od ', 'q:d m', 'q: mo', 'q:mor', 'q:orn', 'q:rni', 'q:nin', 'q:ing', 'q:ng.']

microtc.textmodel.expand_qgrams_word_list(wlist, qsize, output, sep='~')[source]¶

Expands a list of words into a list of q-grams. It uses sep to join words

Parameters:

wlist (list) – List of words computed by microtc.textmodel.get_word_list().
qsize (int) – q-gram size of words
output (list) – output
sep (str) – String used to join the words

Returns:

output

Return type:

list

Example:

>>> from microtc.textmodel import expand_qgrams_word_list
>>> wlist = ["Good", "morning", "Mexico"]
>>> expand_qgrams_word_list(wlist, 2, list())
['Good~morning', 'morning~Mexico']

microtc.textmodel.expand_skipgrams_word_list(wlist, qsize, output, sep='~')[source]¶

Expands a list of words into a list of skipgrams. It uses sep to join words

Parameters:

wlist (list) – List of words computed by microtc.textmodel.get_word_list().
qsize (tuple) – (qsize, skip) qsize is the q-gram size and skip is the number of words ahead.
output (list) – output
sep (str) – String used to join the words

Returns:

output

Return type:

list

Example:

>>> from microtc.textmodel import expand_skipgrams_word_list
>>> wlist = ["Good", "morning", "Mexico"]
>>> expand_skipgrams_word_list(wlist, (2, 1), list())
['Good~Mexico']

\(\mu\text{TC}\)¶

Quickstart Guide¶

Citing¶

Installing \(\mu\text{TC}\)¶

Text Model¶

Modules¶

Table of Contents

Next topic

This Page