how to display a line from a text file that satisfies certain set of words

hi... if i have a text file say " twinkle twinkle little start; how i wonder what you are ; up above the word so high ; like a diamond in the sky;" so what i want to do is to check for a particular word like wonder & if i have that word then it should look for other word say high as well like a supporting word.. & if it has both the word then it should display that line... basically i am trying to do extractive summarization kind of stuff using matlab... can anyone help me in creating such thing

Answers (2)

Although strfind can do some things, it has problems with overlapping words and with words being in substrings rather than in full words. For example, "if" would be found in "lover's tiff". To avoid this, it is easiest to use regexp() and word boundary markers
if regexp(sentence, '(\<star\>.*\<twinkle\>)|(\<twinkle\>.*\<star\>)') ...
To automate this further:
pattern = sprintf('(\\<%s\\>.*\\<%s\\>)|(\\<%s\\>.*\\<%s\\>)', word1, word2, word2, word1);
if regexp(sentence, pattern) ...
Caution: here, the '.*' will match any character, including newline and including punctuation. If your sentence variable is not already broken up into distinct English sentences then this code will match across multiple grammatical sentences.
The pattern to restrict to a single grammatical sentence is not easy, because grammatical sentences boundaries are tricky to detect. Grammatical sentences can end in period or exclamation mark or question mark, but none of these necessarily ends the grammatical sentence, "... especially if there are quotations in the grammatical sentence!", or if there are parenthetical comments (don't you think those are important?) in some portion. Periods are a nuisance: they can signal the end of the sentence or they can signal an abbr., or they can signal a decimal point. A period that occurs after a value proceeded by a currency unit is sometimes a decimal point that will cost you $10. per hair that you pull out trying to get the code to work. Sometimes apostrophes after whitespace signal quotations and sometimes 'tis not a quotation at all and apostrophes words might be signalling the words' possessiveness.
Your code to reliably break your input into sentences is going to be much longer than your code to find multiple words within the resulting sentences.

7 Comments

Oh yes, while I remember: If a grammatical sentence has a parenthetical clause at the end of it, then most style guides say that the closing punctuation should be before the closing parenthesis. Personally I think that is a Bad Thing (but I have no authority in the matter.)
Likewise if a grammatical sentence ends in a quotation then most style guides require that the closing punctuation be before the closing quotation. Again I think that is a Bad Thing, but like I said, "I have no authority in the matter."
There have been some textbooks in which the publishers have required that punctuation be before a final quotation even when computer code was being quoted; "disp('bad idea! Is the period part of the code or is it outside the code?')."
Notice by the way that neither the ! nor the ? ended that sentence, as they were being quoted.
Did I mention the arguments about whether an ellipsis at the end of a sentence should have a period after it or not? Go figure....
hi.. thanks for your help... actually initially I had a pdf file that I have converted into a text file.... so my text file has different lines.... so right now I am trying on it.... so first I need to find a word & then look for second word in the same line if it has both words then it should display the line ....
PDF have linebreaks in the
middle of grammatical sen-
tences, and they often have hy-
phenation too, so you need to
think more about whether you
wish to restrict the search to
pdf lines.
I dint know how to read a pdf file with the help of matlab so I thought to first convert it in text format & then use matlab to read it.... but right as u said my text file is not the formatted one.. I need it in such a way that it should have 1 sentence in 1 line... I would be really grateful if you can help me out in this... or is there anyway to read pdf or doc file in such a way... or is there any way by which we can convert them to formatted text file...
Why? What are you doing anyway? What's the big picture here? Why do you need to have text strings with the same line breaks they would have in the PDF? And why do you need to extract only the lines that have a word repeated on them? What's your "use case"?
let me clear it.... its like I need to read certain file basically a pdf file or a word document & on the basis of that I need to conclude some of the things... so its like if under certain headings if I have a sentence consisting of say words like "x" "y" "z" then applying condition that any meaningful sentence that has word "x" & "z" then I can conclude it as 'A'.
its like decisions making on the basis of extracted data from the word or pdf file...

Sign in to comment.

Look up "strfind" and "if" in the help. I'm assuming you already know how to import the words from the file. If not, look into fread(), importdata(), textscan(), textread(), etc.
You might also like to use John D'Errico's allwords to split your sentence up into words.

6 Comments

hi. if i m not wrong strfind gives you exact match.. my problem is i need to find a group of words say two words not necessary in the same sequence in a line irrespective of cases... its just if it has both the words then it should display say yes
strcmp() gives an exact match. strfind() located one sequence of letters or numbers anywhere in another, larger sequence of letters or numbers. It should work.
sentence = 'twinkle twinkle little start; how i wonder what you are ; up above the word so high ; like a diamond in the sky';
locations = strfind(sentence, 'twinkle')
locations =
1 9
thanks for clearing the difference between the two.. its fine that we can find the word using strfind command but basically what i want to do something like if i find a word say twinkle then it should search for one more word relating to it say "star" if it has star as well then my output should be yes otherwise no... its more like a set of words that have some relation like twinkle & star in this case that i need to check in the text file.... if my condition is fulfilled i.e. if i have both the words in this case then i should get a display as yes
You have "star" in your string, but not the word "star", just "start". You can just use strfind() twice to find if both words are in the sentence.
if ~isempty(sentence, word1) && ~isempty(sentence, word2)
% Then both words occur in the sentence string.
end
I think you mean
if ~isempty(strfind(sentence, word1)) && ~isempty(strfind(sentence, word2))
% Then both words occur in the sentence string.
end

Sign in to comment.

Categories

Asked:

on 30 Dec 2015

Commented:

on 4 Jan 2016

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!