How to access itemprop = "name" from within a data structure in HTML code using Matlab?

HTML code
<div class="itemName largestFont" itemprop="name"> Information which I want to extract </div>
<div class="itemCategory largeFont"><a href="/somerandomwebsitelink"> Information which I dont need </a></div>
I want to extract the information from itemprop = "name" only
using the selector feature with text analytics,
I can do "selector = "DIV.itemHeader"
Item Header is the class in which both those div elements lie and as a result both of the information within those divs is extracted.
I only want the information from itemprop = "name"
How do I go about doing that?

 Accepted Answer

I don't have the toolbox you mentioned, but it most likely uses xpath to parse the html...
I think the best options are xpath or regular expressions.
as far as I know to use xpath in matlab you have to use Java classes, but regular expressions are built in to matlab and they are very covenient.
The regex pattern could be something like that:
str = ['<div class="itemName largestFont" itemprop="name"> Information which I want to extract </div>'...
'<div class="itemCategory largeFont"><a href="/somerandomwebsitelink"> Information which I dont need </a></div>'];
match = regexp(str, '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>', 'names')
match =
struct with fields:
data: ' Information which I want to extract '

11 Comments

anyway, I edited my answer with a regular expression solution to your problem
function [name] = getTitle(tree)
selector = "DIV.itemHeader";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
end
Sorry but based on this function, how do I integrate your code into this?
I realize you don't have it but this is all using the text analytics toolbox.
Would I not still need some sort of selector to point to the correct tag and class before going further to extract the information?
what do you get back from this line?
name = extractHTMLText(nameSection);
do you get the HTML you mentioned?
if so you can simply run the regexp line on this string:
function name = getTitle(tree)
selector = "DIV.itemHeader";
nameSection = findElement(tree, selector);
% i'm not sure because I don't have your original HTML nor that toolbox
% but I suspect that this line returns the HTML you mentioned
html = extractHTMLText(nameSection);
regexPattern = '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>';
name = regexp(html, regexPattern, 'names');
end
come to think of it, "DIV.itemHeader" is a css selector,
If you post the original HTML document you are mining data from it will help
this is a wild guess, but if I'm right you can try this instead:
selector = "DIV.itemHeader.itemName";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
or this (although I'm not sure because itemprop is not a valid html attribute):
selector = "DIV.itemHeader[itemprop=""name""]";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
NOTE: When the functions were run, the outputs did not have semi colons. Please ignore the outputs having semicolons
When I run this
function [name] = getTitle(tree)
selector = "DIV.itemHeader";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
end
I get this in the command window
name =
Information I want
Information I don't want
When I run this
selector = "DIV.itemHeader.itemName";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
I get this in the command window
name =
0×1 empty double column vector
When I run this
selector = "DIV.itemHeader[itemprop=""name""]";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
I get this in the command window
Error using htmlTree/findElement (line 99)
Attribute selector 'itemprop="name"' is not supported.
When I run this
function name = getTitle(tree)
selector = "DIV.itemHeader";
nameSection = findElement(tree, selector);
html = extractHTMLText(nameSection);
regexPattern = '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>';
name = regexp(html, regexPattern, 'names');
end
I get this in the command window
name =
0×0 empty struct array with fields:
data
I want the output of
name = regexp(title, regexPattern, 'names');
to give me this in the command window
name =
Information I want
Here is the website I am trying to get HTML information from
You will notice that
Manticore of Darkness - IOC-067 - Ultra Rare Unlimited
and
are both in the
<div class="itemHeader"><div class="itemName largestFont" itemprop="name">Manticore of Darkness - IOC-067 - Ultra Rare Unlimited</div><div class="itemCategory largeFont"><a href="/invasion-of-chaos-ioc-unlimited-singles/11257">Invasion of Chaos [IOC] Unlimited Singles</a></div></div>
I just want
Manticore of Darkness - IOC-067 - Ultra Rare Unlimited
NOT
Thanks !!
this was a real long shot:
selector = "DIV.itemHeader[itemprop=""name""]";
the regex doesn't work because that extractHTMLText returns an array of strings of the text and not the HTML...
can you post you HTML document so I can at least try the css selectors?
also I made a mistake with the selector earlier,
try that instead:
% this css selector is now valid if I got the structure of your html right
% and if matlab handle's css selectors correctly
selector = "DIV.itemHeader .itemName";
or that: (probably won't work either)
selector = "DIV.itemHeader [itemprop=""name""]"
or maybe (not sure as the htmlTree is only available starting 2018b so I don't have it):
function name = getTitle(tree)
selector = "DIV.itemHeader";
nameSection = findElement(tree, selector);
html = nameSection.Content; % hopefully this will return the inner HTML
regexPattern = '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>';
match = regexp(html, regexPattern, 'names');
name = match.data;
end
OK,
so that element you want to find is the only one with the "itemName" css class
the simplest css selector for that one would be ".itemName"
this should work:
function name = getTitle(tree)
selector = ".itemName";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
end

Sign in to comment.

More Answers (1)

Using htmlTree, this is trivial:
tree = htmlTree(fileread('yourfile.html'))
div = tree.findElement('div')
item = div.getAttribute("itemprop")
names = item == "name"
div(names).extractHTMLText

4 Comments

This also worked, however, while the ouput of
div(names).extractHTMLText
was what I wanted, when the function returned the value and this value was assigned to a variable
name = getName(tree);
The output of that was
name =
377×1 logical array
and then it spat out a column of 377 zeros
You gotta give TADA and Walter a raise, they've been helping me over literally the past few days. At this point, I might as well throw them on my script as co-authors :D
Neither me nor Walter Robertson (as far as I know) work for mathworks... I'd gladly take that raise though :)
@TADA, we're always hiring into MathWorks and have a distributor in Israel who may or may not be looking for MATLAB users.
@Shivam, this returns exactly what you want from your comment above:
s = string(webread("https://beta.trollandtoad.com/yugioh/invasion-of-chaos-ioc-unlimited-singles/manticore-of-darkness-ioc-067-ultra-rare-unlimited/1155511", weboptions('Timeout', 15)));
%%
tree = htmlTree(s)
%%
div = tree.findElement('div')
%%
item = div.getAttribute("itemprop")
%%
names = item == "name"
%%
div(names).extractHTMLText
ans =
"Manticore of Darkness - IOC-067 - Ultra Rare Unlimited"

Sign in to comment.

Categories

Products

Release

R2019a

Tags

Asked:

N/A
on 26 Mar 2019

Commented:

on 29 Mar 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!