\(\mu\text{TC}\)

https://github.com/INGEOTEC/microtc/actions/workflows/test.yaml/badge.svg https://coveralls.io/repos/github/INGEOTEC/microtc/badge.svg?branch=develop https://badge.fury.io/py/microtc.svg https://dev.azure.com/conda-forge/feedstock-builds/_apis/build/status/microtc-feedstock?branchName=main https://img.shields.io/conda/vn/conda-forge/microtc.svg https://img.shields.io/conda/pn/conda-forge/microtc.svg https://readthedocs.org/projects/microtc/badge/?version=docs https://colab.research.google.com/assets/colab-badge.svg

A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackled using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, \(\mu\text{TC}\) is a minimalist and multi-propose text-classifier able to tackle tasks independently of domain and language.

\(\mu\text{TC}\) core entry point is microtc.textmodel.TextModel which can be seen as a function with the form \(m(\text{text}) \rightarrow \Re ^d\) where \(d\) is the vocabulary, i.e., the dimension of the vector space. As can be seen, \(m\) can be used to transform a text into a vector, and, consequently, it can be used to transform a training set of pairs, text and label, into a training set of pairs, vectors and label, which can be directly used by any supervised learning algorithm to obtain a text classifier.

microtc.textmodel.TextModel follows the idea of http://scikit-learn.org transformers. That is, it implements a method microtc.textmodel.TextModel.fit() that receives the training set and a method microtc.textmodel.TextModel.transform() that receives a list of texts and returns and sparse matrix that correspond to the representation of the given texts in the vector space.

\(\mu\text{TC}\) is described in An Automated Text Categorization Framework based on Hyperparameter Optimization. Eric S. Tellez, Daniela Moctezuma, Sabino Miranda-Jímenez, Mario Graff. Knowledge-Based Systems Volume 149, 1 June 2018, Pages 110-123

Quickstart Guide

We have decided to make a live quickstart guide, it covers the installation, and the use of \(\mu\text{TC}\) with different parameters. Finally, the notebook can be found at the docs directory on GitHub.

Citing

If you find \(\mu\text{TC}\) useful for any academic/scientific purpose, we would appreciate citations to the following reference:

@article{Tellez2018110,
title = "An automated text categorization framework based on hyperparameter optimization",
journal = "Knowledge-Based Systems",
volume = "149",
pages = "110--123",
year = "2018",
issn = "0950-7051",
doi = "10.1016/j.knosys.2018.03.003",
url = "https://www.sciencedirect.com/science/article/pii/S0950705118301217",
author = "Eric S. Tellez and Daniela Moctezuma and Sabino Miranda-Jiménez and Mario Graff",
keywords = "Text classification",
keywords = "Hyperparameter optimization",
keywords = "Text modelling"
}

Installing \(\mu\text{TC}\)

\(\mu\text{TC}\) can be easly install using anaconda

conda install -c conda-forge microtc

or can be install using pip, it depends on numpy, scipy and scikit-learn.

pip install numpy
pip install scipy
pip install scikit-learn
pip install microtc

Text Model

This is class is \(\mu\text{TC}\) main entry, it receives a corpus, i.e., a list of text and builds a text model from it.

class microtc.textmodel.TextModel(docs=None, text: str = 'text', num_option: str = 'group', usr_option: str = 'group', url_option: str = 'group', emo_option: str = 'group', hashtag_option: str = 'none', ent_option: str = 'none', lc: bool = True, del_dup: bool = True, del_punc: bool = False, del_diac: bool = True, token_list: list = [- 1], token_min_filter: Union[int, float] = 0, token_max_filter: Union[int, float] = 1, select_ent: bool = False, select_suff: bool = False, select_conn: bool = False, weighting: str = 'tfidf', q_grams_words: bool = False, max_dimension: bool = False)[source]
Parameters:
  • docs (list) – Corpus

  • text (str) – In the case corpus is a dict then text is the key containing the text

  • num_option (str) – Transformations on numbers (none | group | delete)

  • usr_option (str) – Transformations on users (none | group | delete)

  • url_option (str) – Transformations on urls (none | group | delete)

  • emo_option (str) – Transformations on emojis and emoticons (none | group | delete)

  • hashtag_option (str) – Transformations on hashtag (none | group | delete)

  • ent_option (str) – Transformations on entities (none | group | delete)

  • lc (bool) – Lower case

  • del_dup (bool) – Remove duplicates e.g. hooola -> hola

  • del_punc (True) – Remove punctuation symbols

  • del_diac (bool) – Remove diacritics

  • token_list (list) – Tokens > 0 qgrams < 0 word-grams

  • token_min_filter (int or float) – Keep those tokens that appear more times than the parameter (used in weighting class)

  • token_max_filter (int or float) – Keep those tokens that appear less times than the parameter (used in weighting class)

  • q_grams_words (bool) – Compute q-grams only on words

  • select_ent (bool) –

  • select_suff (bool) –

  • select_conn (bool) –

  • weighting (class or str) – Weighting scheme (tfidf | tf | entropy)

Usage:

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']

Using default parameters

>>> textmodel = TextModel().fit(corpus)

Represent a text whose words are in the corpus and one that does not

>>> vector = textmodel['categorizacion ingoetec']
>>> vector2 = textmodel['cat']

Using a different token_list

>>> textmodel = TextModel(token_list=[[2, 1], -1, 3, 4]).fit(corpus)
>>> vector = textmodel['categorizacion ingoetec']
>>> vector2 = textmodel['cat']

Train a classifier

>>> from sklearn.svm import LinearSVC
>>> y = [1, 0, 0]
>>> textmodel = TextModel().fit(corpus)
>>> m = LinearSVC().fit(textmodel.transform(corpus), y)
>>> m.predict(textmodel.transform(corpus))
array([1, 0, 0])
compute_q_grams_words(textlist)[source]
>>> from microtc import TextModel
>>> tm = TextModel(token_list=[3])
>>> tm.compute_q_grams_words(['abc', 'def'])
['q:~ab', 'q:abc', 'q:bc~', 'q:~de', 'q:def', 'q:ef~']
compute_tokens(text)[source]

Compute tokens from a text using q-grams of characters and words, and skip-grams.

Parameters:

text (str) – Text transformed by microtc.textmodel.TextModel.text_transformations().

Return type:

list

Example:

>>> from microtc.textmodel import TextModel
>>> tm = TextModel(token_list=[-2, -1])
>>> tm.compute_tokens("~Good morning~")
[['Good~morning', 'Good', 'morning'], [], []]
>>> tm = TextModel(token_list=[3])
>>> tm.compute_tokens('abc def')
[[], [], ['q:abc', 'q:bc ', 'q:c d', 'q: de', 'q:def']]
>>> tm = TextModel(token_list=[(2, 1)])
>>> tm.compute_tokens('~abc x de~')
[[], ['abc~de'], []]
>>> tm = TextModel(token_list=[3], q_grams_words=True)
>>> tm.compute_tokens('~abc def~')
[[], [], ['q:~ab', 'q:abc', 'q:bc~', 'q:~de', 'q:def', 'q:ef~']]
fit(X)[source]

Train the model

Parameters:

X (list) – Corpus

Return type:

instance

get_text(text)[source]

Return self._text key from text

Parameters:

text (dict) – Text

property id2token

Token identifier to token

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
>>> textmodel = TextModel().fit(corpus)
>>> _ = textmodel.transform(corpus)
>>> textmodel.id2token[5]
'de'
property n_grams

n-grams of words >>> from microtc import TextModel >>> tm = TextModel(token_list=[-1, 3, (2, 1)]) >>> tm.n_grams [-1]

property num_terms

Dimension which is the number of terms of the corpus

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
>>> textmodel = TextModel().fit(corpus)
>>> _ = textmodel.transform(corpus)
>>> textmodel.num_terms
8
Return type:

int

classmethod params()[source]

Parameters

>>> from microtc.textmodel import TextModel
>>> TextModel.params()
odict_keys(['docs', 'text', 'num_option', 'usr_option', 'url_option', 'emo_option', 'hashtag_option', 'ent_option', 'lc', 'del_dup', 'del_punc', 'del_diac', 'token_list', 'token_min_filter', 'token_max_filter', 'select_ent', 'select_suff', 'select_conn', 'weighting', 'q_grams_words', 'max_dimension'])
property q_grams

q-grams of characters >>> from microtc import TextModel >>> tm = TextModel(token_list=[-1, 3, (2, 1)]) >>> tm.q_grams [3]

select_tokens(L)[source]

Filter tokens using suffix or connections

Parameters:

L (list) – list of tokens

Return type:

list

property skip_grams

skip-grams >>> from microtc import TextModel >>> tm = TextModel(token_list=[-1, 3, (2, 1)]) >>> tm.skip_grams [(2, 1)]

text_transformations(text)[source]

Text transformations. It starts by analyzing emojis, hashtags, entities, lower case, numbers, URL, and users. After these transformations are applied to the text, it calls microtc.textmodel.norm_chars().

Parameters:

text (str) –

Return type:

str

Example:

>>> from microtc.textmodel import TextModel
>>> tm = TextModel(del_dup=False)
>>> tm.text_transformations("Life is good at México @mgraffg.")
'~life~is~good~at~mexico~_usr~'
property token2id

Token to token identifier

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
>>> textmodel = TextModel().fit(corpus)
>>> _ = textmodel.transform(corpus)
>>> textmodel.token2id['de']
5
property token_weight

Weight associated to each token id

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
>>> textmodel = TextModel().fit(corpus)
>>> _ = textmodel.transform(corpus)
>>> textmodel.token_weight[5]
1.584962500721156
tokenize(text)[source]

Transform text to tokens. The procedure is:

Parameters:

text (str or list) – Text

Return type:

list

Example:

>>> from microtc.textmodel import TextModel
>>> tm = TextModel()
>>> tm.tokenize("buenos dias")
['buenos', 'dias']
>>> tm.tokenize(["buenos", "dias", "tenga usted"])
['buenos', 'dias', 'tenga', 'usted']
transform(texts)[source]

Convert test into a vector

Parameters:

texts (list) – List of text to be transformed

Return type:

list

Example:

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias catedras', 'catedras conacyt']
>>> textmodel = TextModel().fit(corpus)
>>> X = textmodel.transform(corpus)
microtc.textmodel.norm_chars(text, del_diac=True, del_dup=True, del_punc=False)[source]

Transform text by removing diacritics, duplicates, and punctuation. It adds ~ at the beginning, the end, and the spaces are changed by ~.

Parameters:
  • text (str) – Text

  • del_diac (bool) – Delete diacritics

  • del_dup (bool) – Delete duplicates

  • del_punc (bool) – Delete punctuation symbols

Return type:

str

Example:

>>> from microtc.textmodel import norm_chars
>>> norm_chars("Life is good at Méxicoo.")
'~Life~is~god~at~Mexico.~'
microtc.textmodel.get_word_list(text)[source]

Transform a text (begining and ending with ~) to list words. It is called after microtc.textmodel.norm_chars().

Example

>>> from microtc.textmodel import get_word_list
>>> get_word_list("~Someone's house.~")
['Someone', 's', 'house']
Parameters:

text (str) – text

Return type:

list

microtc.textmodel.expand_qgrams(text, qsize, output)[source]

Expands a text into a set of q-grams

Parameters:
  • text (str) – Text

  • qsize (int) – q-gram size

  • output (list) – output

Returns:

output

Return type:

list

Example:

>>> from microtc.textmodel import expand_qgrams
>>> output = list()
>>> expand_qgrams("Good morning.", 3, output)
['q:Goo', 'q:ood', 'q:od ', 'q:d m', 'q: mo', 'q:mor', 'q:orn', 'q:rni', 'q:nin', 'q:ing', 'q:ng.']
microtc.textmodel.expand_qgrams_word_list(wlist, qsize, output, sep='~')[source]

Expands a list of words into a list of q-grams. It uses sep to join words

Parameters:
  • wlist (list) – List of words computed by microtc.textmodel.get_word_list().

  • qsize (int) – q-gram size of words

  • output (list) – output

  • sep (str) – String used to join the words

Returns:

output

Return type:

list

Example:

>>> from microtc.textmodel import expand_qgrams_word_list
>>> wlist = ["Good", "morning", "Mexico"]
>>> expand_qgrams_word_list(wlist, 2, list())
['Good~morning', 'morning~Mexico']
microtc.textmodel.expand_skipgrams_word_list(wlist, qsize, output, sep='~')[source]

Expands a list of words into a list of skipgrams. It uses sep to join words

Parameters:
  • wlist (list) – List of words computed by microtc.textmodel.get_word_list().

  • qsize (tuple) – (qsize, skip) qsize is the q-gram size and skip is the number of words ahead.

  • output (list) – output

  • sep (str) – String used to join the words

Returns:

output

Return type:

list

Example:

>>> from microtc.textmodel import expand_skipgrams_word_list
>>> wlist = ["Good", "morning", "Mexico"]
>>> expand_skipgrams_word_list(wlist, (2, 1), list())
['Good~Mexico']

Modules