Text Analytics Toolbox
Analyze and model text data
Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.
Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.
Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.
Import and Visualize Text Data
Extract text data from sources such as social media, news feeds, equipment logs, reports, and surveys.
Extract Text Data
Import text data into MATLAB® from single files or large collections of files, including PDF, HTML, and Microsoft® Word® and Excel® files.
Visually explore text datasets using word clouds and text scatter plots.
Text Analytics Toolbox provides language specific preprocessing capabilities for English and Japanese. Most functions also work with text in other languages.
Preprocess Text Data
Extract meaningful words from raw text.
Clean Text Data
Apply high-level filtering functions to remove extraneous content such as URLs, HTML tags, and punctuation.
Filter Stop Words and Normalize Words to Root Form
Prioritize meaningful text data in your analysis by filtering out common words, words that appear too frequently or infrequently, and very long or very short words. Reduce the vocabulary and focus on the broader sense or sentiment of a document by stemming words to their root form or lemmatizing them to their dictionary form.
Identify Tokens, Sentences, and Parts-of-Speech
Automatically split raw text into a collection of words using a tokenization algorithm. Add sentence boundaries, part-of-speech details, and other relevant information for context.
Convert Text to Numeric Formats
Convert text data to numeric form for use in machine learning and deep learning.
Word and N-Gram Counting
Calculate word frequency statistics to represent text data numerically.
Word Embedding and Encoding
Train word-embedding models such as word2vec continuous bag-of-words (CBOW) and skip-gram models. Import pretrained models including fastText and GloVe.
Machine Learning with Text Data
Perform topic modeling, classification, and dimensionality reduction with machine learning algorithms such as latent Dirichlet allocation (LDA) and latent semantic analysis (LSA).
Discover and visualize underlying patterns, trends, and complex relationships in large sets of text data.
Identify the attitudes and opinions expressed in text data to categorize statements as being positive, neutral, or negative. Build models that can predict sentiment in real time.
Classify text descriptions using word embeddings that can identify categories of text through deep learning.
Use deep learning to generate new text based on observed text.
Japanese Language Support
Perform text analytics on Japanese language text, including tokenization, stop word removal, lemmatization, and part-of-speech tagging
Convert words to their dictionary form using lemmatization with parts of speech and other information
Identify parts of speech, such as adjectives, adverbs, nouns, and verbs
Extract HTML from specific parts of a web page using HTML structure and CSS classes
Train deep learning networks using word embedding layers (requires Deep Learning Toolbox)
Deep Learning Examples
Learn about generating text and working with out-of-memory text data (requires Deep Learning Toolbox)
Contact Sarah Palfreyman, Text Analytics Toolbox Technical Expert