\(\mu\text{TC}\)

https://travis-ci.org/INGEOTEC/microtc.svg?branch=master https://ci.appveyor.com/api/projects/status/afcwh0d9sw6g937h/branch/master?svg=true https://coveralls.io/repos/github/INGEOTEC/microtc/badge.svg?branch=master https://anaconda.org/ingeotec/microtc/badges/version.svg https://badge.fury.io/py/microtc.svg https://anaconda.org/ingeotec/microtc/badges/downloads.svg https://readthedocs.org/projects/microtc/badge/?version=latest

A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackled using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, \(\mu\text{TC}\) is a minimalist and multi-propose text-classifier able to tackle tasks independently of domain and language.

\(\mu\text{TC}\) core entry point is microtc.textmodel.TextModel which can be seen as a function with the form \(m(\text{text}) \rightarrow \Re ^d\) where \(d\) is the vocabulary, i.e., the dimension of the vector space. As can be seen, \(m\) can be used to transform a text into a vector, and, consequently, it can be used to transform a training set of pairs, text and label, into a training set of pairs, vectors and label, which can be directly used by any supervised learning algorithm to obtain a text classifier.

microtc.textmodel.TextModel follows the idea of http://scikit-learn.org transformers. That is, it implements a method microtc.textmodel.TextModel.fit() that receives the training set and a method microtc.textmodel.TextModel.transform() that receives a list of texts and returns and sparse matrix that correspond to the representation of the given texts in the vector space.

\(\mu\text{TC}\) is described in An Automated Text Categorization Framework based on Hyperparameter Optimization. Eric S. Tellez, Daniela Moctezuma, Sabino Miranda-Jímenez, Mario Graff. Knowledge-Based Systems Volume 149, 1 June 2018, Pages 110-123

Citing

If you find \(\mu\text{TC}\) useful for any academic/scientific purpose, we would appreciate citations to the following reference:

@article{Tellez2018110,
title = "An automated text categorization framework based on hyperparameter optimization",
journal = "Knowledge-Based Systems",
volume = "149",
pages = "110--123",
year = "2018",
issn = "0950-7051",
doi = "10.1016/j.knosys.2018.03.003",
url = "https://www.sciencedirect.com/science/article/pii/S0950705118301217",
author = "Eric S. Tellez and Daniela Moctezuma and Sabino Miranda-Jiménez and Mario Graff",
keywords = "Text classification",
keywords = "Hyperparameter optimization",
keywords = "Text modelling"
}

Installing \(\mu\text{TC}\)

\(\mu\text{TC}\) can be easly install using anaconda

conda install -c ingeotec microtc

or can be install using pip, it depends on numpy, scipy and scikit-learn.

pip install numpy
pip install scipy
pip install scikit-learn
pip install microtc

Text Model

This is class is \(\mu\text{TC}\) main entry, it receives a corpus, i.e., a list of text and builds a text model from it.

class microtc.textmodel.TextModel(docs=None, text='text', num_option='group', usr_option='group', url_option='group', emo_option='group', hashtag_option='none', ent_option='none', lc=True, del_dup=True, del_punc=False, del_diac=True, token_list=[-1], token_min_filter=0, token_max_filter=1, select_ent=False, select_suff=False, select_conn=False, weighting='tfidf')[source]
Parameters
  • docs (list) – Corpus

  • text (str) – In the case corpus is a dict then text is the key containing the text

  • num_option (str) – Transformations on numbers (none | group | delete)

  • usr_option (str) – Transformations on users (none | group | delete)

  • url_option (str) – Transformations on urls (none | group | delete)

  • emo_option (str) – Transformations on emojis and emoticons (none | group | delete)

  • hashtag_option (str) – Transformations on hashtag (none | group | delete)

  • ent_option (str) – Transformations on entities (none | group | delete)

  • lc (bool) – Lower case

  • del_dup (bool) – Remove duplicates e.g. hooola -> hola

  • del_punc (True) – Remove punctuation symbols

  • del_diac (bool) – Remove diacritics

  • token_list (list) – Tokens > 0 qgrams < 0 word-grams

  • token_min_filter (int or float) – Keep those tokens that appear more times than the parameter (used in weighting class)

  • token_max_filter (int or float) – Keep those tokens that appear less times than the parameter (used in weighting class)

  • select_ent (bool) –

  • select_suff (bool) –

  • select_conn (bool) –

  • weighting (class or str) – Weighting scheme (tfidf | tf | entropy)

Usage:

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']

Using default parameters

>>> textmodel = TextModel().fit(corpus)

Represent a text whose words are in the corpus and one that does not

>>> vector = textmodel['categorizacion ingoetec']
>>> vector2 = textmodel['cat']

Using a different token_list

>>> textmodel = TextModel(token_list=[[2, 1], -1, 3, 4]).fit(corpus)
>>> vector = textmodel['categorizacion ingoetec']
>>> vector2 = textmodel['cat']

Train a classifier

>>> from sklearn.svm import LinearSVC
>>> y = [1, 0, 0]
>>> textmodel = TextModel().fit(corpus)
>>> m = LinearSVC().fit(textmodel.transform(corpus), y)
>>> m.predict(textmodel.transform(corpus))
array([1, 0, 0])
compute_tokens(text)[source]

Compute tokens from a text using q-grams of characters and words, and skip-grams.

Parameters

text (str) – Text transformed by microtc.textmodel.TextModel.text_transformations().

Return type

list

Example:

>>> from microtc.textmodel import TextModel
>>> tm = TextModel(token_list=[-2, -1])
>>> tm.compute_tokens("~Good morning~")
[['Good~morning'], ['Good', 'morning']]
fit(X)[source]

Train the model

Parameters

X (list) – Corpus

Return type

instance

get_text(text)[source]

Return self._text key from text

Parameters

text (dict) – Text

property num_terms

Dimension which is the number of terms of the corpus

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
>>> textmodel = TextModel().fit(corpus)
>>> _ = textmodel.transform(corpus)
>>> textmodel.num_terms
8
Return type

int

classmethod params()[source]

Parameters

>>> from microtc.textmodel import TextModel
>>> TextModel.params()
odict_keys(['docs', 'text', 'num_option', 'usr_option', 'url_option', 'emo_option', 'hashtag_option', 'ent_option', 'lc', 'del_dup', 'del_punc', 'del_diac', 'token_list', 'token_min_filter', 'token_max_filter', 'select_ent', 'select_suff', 'select_conn', 'weighting'])
select_tokens(L)[source]

Filter tokens using suffix or connections

Parameters

L (list) – list of tokens

Return type

list

text_transformations(text)[source]

Text transformations. It starts by analyzing emojis, hashtags, entities, lower case, numbers, URL, and users. After these transformations are applied to the text, it calls microtc.textmodel.norm_chars().

Parameters

text (str) –

Return type

str

Example:

>>> from microtc.textmodel import TextModel
>>> tm = TextModel(del_dup=False)
>>> tm.text_transformations("Life is good at México @mgraffg.")
'~life~is~good~at~mexico~_usr~'
tokenize(text)[source]

Transform text to tokens. The procedure is:

Parameters

text (str or list) – Text

Return type

list

Example:

>>> from microtc.textmodel import TextModel
>>> tm = TextModel()
>>> tm.tokenize("buenos dias")
['buenos', 'dias']
>>> tm.tokenize(["buenos", "dias", "tenga usted"])
['buenos', 'dias', 'tenga', 'usted']
tonp(X)[source]

Sparse representation to sparce matrix

Parameters

X (list) – Sparse representation of matrix

Return type

csr_matrix

Example:

>>> from microtc.textmodel import TextModel
>>> tm = TextModel()
>>> class A: pass
>>> tm.model = A()
>>> tm.model.num_terms = 4
>>> matrix = [[(1, 0.5), (3, -0.2)], [(2, 0.3)], [(0, 1), (3, -1.2)]]
>>> r = tm.tonp(matrix)
>>> r.toarray()
array([[ 0. ,  0.5,  0. , -0.2],
       [ 0. ,  0. ,  0.3,  0. ],
       [ 1. ,  0. ,  0. , -1.2]])
transform(texts)[source]

Convert test into a vector

Parameters

texts (list) – List of text to be transformed

Return type

list

Example:

>>> from microtc.textmodel import TextModel
>>> corpus = ['buenos dias catedras', 'catedras conacyt']
>>> textmodel = TextModel().fit(corpus)
>>> X = textmodel.transform(corpus)
microtc.textmodel.norm_chars(text, del_diac=True, del_dup=True, del_punc=False)[source]

Transform text by removing diacritics, duplicates, and punctuation. It adds ~ at the beginning, the end, and the spaces are changed by ~.

Parameters
  • text (str) – Text

  • del_diac (bool) – Delete diacritics

  • del_dup (bool) – Delete duplicates

  • del_punc (bool) – Delete punctuation symbols

Return type

str

Example:

>>> from microtc.textmodel import norm_chars
>>> norm_chars("Life is good at Méxicoo.")
'~Life~is~god~at~Mexico.~'
microtc.textmodel.get_word_list(text)[source]

Transform a text (begining and ending with ~) to list words. It is called after microtc.textmodel.norm_chars().

Example

>>> from microtc.textmodel import get_word_list
>>> get_word_list("~Someone's house.~")
["Someone's", 'house']
Parameters

text (str) – text

Return type

list

microtc.textmodel.expand_qgrams(text, qsize, output)[source]

Expands a text into a set of q-grams

Parameters
  • text (str) – Text

  • qsize (int) – q-gram size

  • output (list) – output

Returns

output

Return type

list

Example:

>>> from microtc.textmodel import expand_qgrams
>>> output = list()
>>> expand_qgrams("Good morning.", 3, output)
['q:Goo', 'q:ood', 'q:od ', 'q:d m', 'q: mo', 'q:mor', 'q:orn', 'q:rni', 'q:nin', 'q:ing', 'q:ng.']
microtc.textmodel.expand_qgrams_word_list(wlist, qsize, output, sep='~')[source]

Expands a list of words into a list of q-grams. It uses sep to join words

Parameters
  • wlist (list) – List of words computed by microtc.textmodel.get_word_list().

  • qsize (int) – q-gram size of words

  • output (list) – output

  • sep (str) – String used to join the words

Returns

output

Return type

list

Example:

>>> from microtc.textmodel import expand_qgrams_word_list
>>> wlist = ["Good", "morning", "Mexico"]
>>> expand_qgrams_word_list(wlist, 2, list())
['Good~morning', 'morning~Mexico']
microtc.textmodel.expand_skipgrams_word_list(wlist, qsize, output, sep='~')[source]

Expands a list of words into a list of skipgrams. It uses sep to join words

Parameters
  • wlist (list) – List of words computed by microtc.textmodel.get_word_list().

  • qsize (tuple) – (qsize, skip) qsize is the q-gram size and skip is the number of words ahead.

  • output (list) – output

  • sep (str) – String used to join the words

Returns

output

Return type

list

Example:

>>> from microtc.textmodel import expand_skipgrams_word_list
>>> wlist = ["Good", "morning", "Mexico"]
>>> expand_skipgrams_word_list(wlist, (2, 1), list())
['Good~Mexico']