How to access itemprop = "name" from within a data structure in HTML code using Matlab?

Question

1 vote

HTML code

<div class="itemName largestFont" itemprop="name"> Information which I want to extract </div>
<div class="itemCategory largeFont"><a href="/somerandomwebsitelink"> Information which I dont need </a></div>

I want to extract the information from itemprop = "name" only

using the selector feature with text analytics,

I can do "selector = "DIV.itemHeader"

Item Header is the class in which both those div elements lie and as a result both of the information within those divs is extracted.

I only want the information from itemprop = "name"

How do I go about doing that?

3 Comments
Show 1 older comment Hide 1 older comment

N/A on 26 Mar 2019

Yup, thats correct

Walter Roberson on 26 Mar 2019

Unfortunately I do not have that toolbox to test with.

My own implementation would probably be to use regexp with named tokens and the 'names' option.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

TADA on 26 Mar 2019

Edited: TADA on 27 Mar 2019

Open in MATLAB Online

0 votes

I don't have the toolbox you mentioned, but it most likely uses xpath to parse the html...

I think the best options are xpath or regular expressions.

as far as I know to use xpath in matlab you have to use Java classes, but regular expressions are built in to matlab and they are very covenient.

The regex pattern could be something like that:

str = ['<div class="itemName largestFont" itemprop="name"> Information which I want to extract </div>'...
'<div class="itemCategory largeFont"><a href="/somerandomwebsitelink"> Information which I dont need </a></div>'];
match = regexp(str, '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>', 'names')
match = 
  struct with fields:
    data: ' Information which I want to extract '

11 Comments
Show 9 older comments Hide 9 older comments

N/A on 28 Mar 2019

Edited: N/A on 28 Mar 2019

Open in MATLAB Online

NOTE: When the functions were run, the outputs did not have semi colons. Please ignore the outputs having semicolons

When I run this

function [name] = getTitle(tree)
    selector = "DIV.itemHeader";
    nameSection = findElement(tree, selector);
    name = extractHTMLText(nameSection);
end

I get this in the command window

name = 
     Information I want
     
     Information I don't want

When I run this

selector = "DIV.itemHeader.itemName";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);

I get this in the command window

name =
  0×1 empty double column vector

When I run this

selector = "DIV.itemHeader[itemprop=""name""]";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);

I get this in the command window

Error using htmlTree/findElement (line 99)
Attribute selector 'itemprop="name"' is not supported.

When I run this

function name = getTitle(tree)
    selector = "DIV.itemHeader";
    nameSection = findElement(tree, selector);
    html = extractHTMLText(nameSection);
    
    regexPattern = '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>';
    name = regexp(html, regexPattern, 'names');
end
     

I get this in the command window

name = 
  0×0 empty struct array with fields:
    data
    

I want the output of

name = regexp(title, regexPattern, 'names');

to give me this in the command window

name = 
     Information I want
     

N/A on 28 Mar 2019

HALLELUJAH! :D

TADA on 28 Mar 2019

Cheers

Sign in to comment.

Answer 2

Sean de Wolski on 28 Mar 2019

Edited: Sean de Wolski on 28 Mar 2019

Open in MATLAB Online

0 votes

Using htmlTree, this is trivial:

tree = htmlTree(fileread('yourfile.html'))
div = tree.findElement('div')
item = div.getAttribute("itemprop")
names = item == "name"
div(names).extractHTMLText

4 Comments
Show 2 older comments Hide 2 older comments

TADA on 28 Mar 2019

Neither me nor Walter Robertson (as far as I know) work for mathworks... I'd gladly take that raise though :)

Sean de Wolski on 29 Mar 2019

Open in MATLAB Online

@TADA, we're always hiring into MathWorks and have a distributor in Israel who may or may not be looking for MATLAB users.

@Shivam, this returns exactly what you want from your comment above:

s = string(webread("https://beta.trollandtoad.com/yugioh/invasion-of-chaos-ioc-unlimited-singles/manticore-of-darkness-ioc-067-ultra-rare-unlimited/1155511", weboptions('Timeout', 15)));
%%
tree = htmlTree(s)
%%
div = tree.findElement('div')
%%
item = div.getAttribute("itemprop")
%%
names = item == "name"
%%
div(names).extractHTMLText
ans = 
    "Manticore of Darkness - IOC-067 - Ultra Rare Unlimited"

Sign in to comment.

How to access itemprop = "name" from within a data structure in HTML code using Matlab?

3 Comments
Show 1 older comment Hide 1 older comment

Accepted Answer

11 Comments
Show 9 older comments Hide 9 older comments

More Answers (1)

4 Comments
Show 2 older comments Hide 2 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

How to access itemprop = "name" from within a data structure in HTML code using Matlab?

3 Comments Show 1 older comment Hide 1 older comment

Accepted Answer

11 Comments Show 9 older comments Hide 9 older comments

More Answers (1)

4 Comments Show 2 older comments Hide 2 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

3 Comments
Show 1 older comment Hide 1 older comment

11 Comments
Show 9 older comments Hide 9 older comments

4 Comments
Show 2 older comments Hide 2 older comments