This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Prepare Text Data for Analysis

This example shows how to create a function which cleans and preprocesses text data for analysis.

Text data can be large and can contain lots of noise which negatively affects statistical analysis. For example, text data can contain the following:

  • Variations in case, for example "new" and "New"

  • Variations in word forms, for example "walk" and "walking"

  • Words which add noise, for example stop words such as "the" and "of"

  • Punctuation and special characters

  • HTML and XML tags

These word clouds illustrate word frequency analysis applied to some raw text data from weather reports, and a preprocessed version of the same text data.

Load and Extract Text Data

Load the example data. The file weatherReports.csv contains weather reports, including a text description and categorical labels for each event.

filename = "weatherReports.csv";
data = readtable(filename,'TextType','string');

Extract the text data from the field event_narrative, and the label data from the field event_type.

textData = data.event_narrative;
labels = data.event_type;
textData(1:10)
ans = 10×1 string array
    "Large tree down between Plantersville and Nettleton."
    "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."
    "NWS Columbia relayed a report of trees blown down along Tom Hall St."
    "Media reported two trees blown down along I-40 in the Old Fort area."
    ""
    "A few tree limbs greater than 6 inches down on HWY 18 in Roseland."
    "Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins."
    "Quarter size hail near Rosemark."
    "Tin roof ripped off house on Old Memphis Road near Billings Drive. Several large trees down in the area."
    "Powerlines down at Walnut Grove and Cherry Lane roads."

Create Tokenized Documents

Convert the text data to lowercase.

cleanTextData = lower(textData);
cleanTextData(1:10)
ans = 10×1 string array
    "large tree down between plantersville and nettleton."
    "one to two feet of deep standing water developed on a street on the winthrop university campus after more than an inch of rain fell in less than an hour. one vehicle was stalled in the water."
    "nws columbia relayed a report of trees blown down along tom hall st."
    "media reported two trees blown down along i-40 in the old fort area."
    ""
    "a few tree limbs greater than 6 inches down on hwy 18 in roseland."
    "awning blown off a building on lamar avenue. multiple trees down near the intersection of winchester and perkins."
    "quarter size hail near rosemark."
    "tin roof ripped off house on old memphis road near billings drive. several large trees down in the area."
    "powerlines down at walnut grove and cherry lane roads."

Create an array of tokenized documents.

cleanDocuments = tokenizedDocument(cleanTextData);
cleanDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

   (1,1)   8 tokens: large tree down between plantersville and nettleton .
   (2,1)  39 tokens: one to two feet of deep standing water developed on a stre…
   (3,1)  14 tokens: nws columbia relayed a report of trees blown down along to…
   (4,1)  14 tokens: media reported two trees blown down along i-40 in the old …
   (5,1)   0 tokens:
   (6,1)  15 tokens: a few tree limbs greater than 6 inches down on hwy 18 in r…
   (7,1)  20 tokens: awning blown off a building on lamar avenue . multiple tre…
   (8,1)   6 tokens: quarter size hail near rosemark .
   (9,1)  21 tokens: tin roof ripped off house on old memphis road near billing…
  (10,1)  10 tokens: powerlines down at walnut grove and cherry lane roads .

Erase the punctuation from the documents.

cleanDocuments = erasePunctuation(cleanDocuments);
cleanDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

   (1,1)   7 tokens: large tree down between plantersville and nettleton
   (2,1)  37 tokens: one to two feet of deep standing water developed on a stre…
   (3,1)  13 tokens: nws columbia relayed a report of trees blown down along to…
   (4,1)  13 tokens: media reported two trees blown down along i40 in the old f…
   (5,1)   0 tokens:
   (6,1)  14 tokens: a few tree limbs greater than 6 inches down on hwy 18 in r…
   (7,1)  18 tokens: awning blown off a building on lamar avenue multiple trees…
   (8,1)   5 tokens: quarter size hail near rosemark
   (9,1)  19 tokens: tin roof ripped off house on old memphis road near billing…
  (10,1)   9 tokens: powerlines down at walnut grove and cherry lane roads

Words like "a", "and", "to", and "the" (known as stop words) can add noise to data. Remove a list of stop words using the removeStopWords function.

cleanDocuments = removeStopWords(cleanDocuments);
cleanDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

   (1,1)   5 tokens: large tree down plantersville nettleton
   (2,1)  18 tokens: two feet deep standing water developed street winthrop uni…
   (3,1)  10 tokens: nws columbia relayed report trees blown down tom hall st
   (4,1)  10 tokens: media reported two trees blown down i40 old fort area
   (5,1)   0 tokens:
   (6,1)  10 tokens: few tree limbs greater 6 inches down hwy 18 roseland
   (7,1)  13 tokens: awning blown off building lamar avenue multiple trees down…
   (8,1)   5 tokens: quarter size hail near rosemark
   (9,1)  16 tokens: tin roof ripped off house old memphis road near billings d…
  (10,1)   7 tokens: powerlines down walnut grove cherry lane roads

Remove words with 2 or fewer characters, and words with 15 or greater characters.

cleanDocuments = removeShortWords(cleanDocuments,2);
cleanDocuments = removeLongWords(cleanDocuments,15);
cleanDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

   (1,1)   5 tokens: large tree down plantersville nettleton
   (2,1)  18 tokens: two feet deep standing water developed street winthrop uni…
   (3,1)   9 tokens: nws columbia relayed report trees blown down tom hall
   (4,1)  10 tokens: media reported two trees blown down i40 old fort area
   (5,1)   0 tokens:
   (6,1)   8 tokens: few tree limbs greater inches down hwy roseland
   (7,1)  13 tokens: awning blown off building lamar avenue multiple trees down…
   (8,1)   5 tokens: quarter size hail near rosemark
   (9,1)  16 tokens: tin roof ripped off house old memphis road near billings d…
  (10,1)   7 tokens: powerlines down walnut grove cherry lane roads

Lemmatize the words using normalizeWords. To improve lemmatization, first add part of speech details to the documents using addPartOfSpeechDetails.

cleanDocuments = addPartOfSpeechDetails(cleanDocuments);
cleanDocuments = normalizeWords(cleanDocuments,'Style','lemma');
cleanDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

   (1,1)   5 tokens: large tree down plantersville nettleton
   (2,1)  18 tokens: two foot deep standing water develop street winthrop unive…
   (3,1)   9 tokens: nws columbia relayed report tree blow down tom hall
   (4,1)  10 tokens: medium report two tree blow down i40 old fort area
   (5,1)   0 tokens:
   (6,1)   8 tokens: few tree limb great inches down hwy roseland
   (7,1)  13 tokens: awning blow off building lamar avenue multiple tree down n…
   (8,1)   5 tokens: quarter size hail near rosemark
   (9,1)  16 tokens: tin roof rip off house old memphis road near billings driv…
  (10,1)   7 tokens: powerlines down walnut grove cherry lane road

Create Bag-of-Words Model

Create a bag-of-words model.

cleanBag = bagOfWords(cleanDocuments)
cleanBag = 
  bagOfWords with properties:

          Counts: [36176×18410 double]
      Vocabulary: [1×18410 string]
        NumWords: 18410
    NumDocuments: 36176

Remove words that do not appear more than two times in the bag-of-words model.

cleanBag = removeInfrequentWords(cleanBag,2)
cleanBag = 
  bagOfWords with properties:

          Counts: [36176×6952 double]
      Vocabulary: [1×6952 string]
        NumWords: 6952
    NumDocuments: 36176

Some preprocessing steps such as removeInfrequentWords leaves empty documents in the bag-of-words model. To ensure that no empty documents remain in the bag-of-words model after preprocessing, use removeEmptyDocuments as the last step.

Remove empty documents from the bag-of-words model and the corresponding labels from labels.

[cleanBag,idx] = removeEmptyDocuments(cleanBag);
labels(idx) = [];
cleanBag
cleanBag = 
  bagOfWords with properties:

          Counts: [28137×6952 double]
      Vocabulary: [1×6952 string]
        NumWords: 6952
    NumDocuments: 28137

Create a Preprocessing Function

It can be useful to create a function which performs preprocessing so you can prepare different collections of text data in the same way. For example, you can use a function so that you can preprocess new data using the same steps as the training data.

Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The function preprocessWeatherNarratives, performs the following steps in order:

  1. Convert the text data to lowercase using lower.

  2. Tokenize the text using tokenizedDocument.

  3. Erase punctuation using erasePunctuation.

  4. Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

  7. Lemmatize the words using normalizeWords.

Use the example preprocessing function preprocessWeatherNarratives to prepare the text data.

newText = "A tree is downed outside Apple Hill Drive, Natick";
newDocuments = preprocessWeatherNarratives(newText)
newDocuments = 
  tokenizedDocument:

   7 tokens: tree down outside apple hill drive natick

Compare with Raw Data

Compare the preprocessed data with the raw data.

rawDocuments = tokenizedDocument(textData);
rawBag = bagOfWords(rawDocuments)
rawBag = 
  bagOfWords with properties:

          Counts: [36176×23302 double]
      Vocabulary: [1×23302 string]
        NumWords: 23302
    NumDocuments: 36176

Calculate the reduction in data.

numWordsClean = cleanBag.NumWords;
numWordsRaw = rawBag.NumWords;
reduction = 1 - numWordsClean/numWordsRaw
reduction = 0.7017

Compare the raw data and the cleaned data by visualizing the two bag-of-words models using word clouds.

figure
subplot(1,2,1)
wordcloud(rawBag);
title("Raw Data")
subplot(1,2,2)
wordcloud(cleanBag);
title("Clean Data")

Example Preprocessing Function

The function preprocessWeatherNarratives, performs the following steps in order:

  1. Convert the text data to lowercase using lower.

  2. Tokenize the text using tokenizedDocument.

  3. Erase punctuation using erasePunctuation.

  4. Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

  7. Lemmatize the words using normalizeWords.

function [documents] = preprocessWeatherNarratives(textData)
% Convert the text data to lowercase.
cleanTextData = lower(textData);

% Tokenize the text.
documents = tokenizedDocument(cleanTextData);

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Remove words with 2 or fewer characters, and words with 15 or greater
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

% Lemmatize the words.
documents = addPartOfSpeechDetails(documents);
documents = normalizeWords(documents,'Style','lemma');
end

See Also

| | | | | | | | | |

Related Topics