Iteratively search in a website (for dummies)

Hi all,
I have a list of thousands of chemical formula (or potentially formula). What I'd like to do is to iteratively get one of this formula (for i=1:size(FormulaList,1)....end), insert the formula into the search bar of the website (that is: https://pubchem.ncbi.nlm.nih.gov/ ), and check if I have a possible matches or I get something like this ("0 results found"):
I've tried to apply the method described here ( https://it.mathworks.com/matlabcentral/answers/400522-retrieving-data-from-a-web-page ) but I was not able to understand how to get the "curl" (sorry: I'm completely ignorant in this!).
Cheers,
Luca
[SL: removed the parenthesis from the end of one of the hyperlinks]

 Accepted Answer

I've found the solution.
% MassList: column-vector with molecular formula
tic
for mass=1:size(MassList,1)
url=strcat('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/',MassList(mass,1),'/cids/JSON?list_return=cachekey');
try
jsonData = webread(url);
ResNum(mass,1)=jsonData.IdentifierList.Size;
catch
ResNum(mass,1)=0;
end
pause(0.205) % the website asks for max 5 requests / second
end
toc
The resulting column-array provides the number of compounds with the same molecular formula found in PubChem.

More Answers (1)

Your best bet is probably to use one of the access methods that PubChem provides, as described on this page. Note the usage policy. If you have thousands of requests it's likely going to take minutes or longer, or the bulk data downloads functionality linked in the usage policy may be a better fit for your needs.
From the MATLAB side of things, the functions in this documentation category likely will be of use to you as may be the functions on this documentation page. [Before you ask no, I don't have any examples specific to using those functions to access that database.]

3 Comments

Thanks for you kind reply. The PubChem policy defines that "we ask that any script or application not make more than 5 requests per second, in order to avoid overloading these servers".
I've tried with a simple test (also with the help of ChatGPT):
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
attempt = 1;
% Loop to check the status of the request
while attempt <= maxAttempts
% Fetch JSON response from PubChem API
jsonData = webread(apiUrl);
% Check if the request is still processing
if ~isfield(jsonData, 'Waiting') || isempty(jsonData.Waiting) %|| ~strcmpi(jsonData.Waiting, 'true')
break; % Exit loop if request is not waiting anymore
end
% Increment attempt counter
attempt = attempt + 1;
% Wait for some time before making the next attempt
pause(waitTime);
end
% Check if the request is still processing after the loop
if isfield(jsonData, 'Waiting') && ~isempty(jsonData.Waiting) && strcmpi(jsonData.Waiting, 'true')
disp('Your request is still processing. Please wait and try again later.');
return;
end
% Check if the request was successful
if isfield(jsonData, 'Fault')
disp(['Error: ', jsonData.Fault.Message]);
return;
end
% Step 3: Parse JSON response to extract the number of search results
numResults = 0; % Initialize number of results
if isfield(jsonData, 'IdentifierList') && isfield(jsonData.IdentifierList, 'CID')
numResults = numel(jsonData.IdentifierList.CID); % Number of search results
end
% Display the number of results
disp(['Number of results for molecular formula "', molecularFormula, '": ', num2str(numResults)]);
This doesn't work because it always reports:
>> jsonData
jsonData =
struct with fields:
Waiting: [1×1 struct]
>> jsonData.Waiting
ans =
struct with fields:
ListKey: '4044371352122785656'
Message: 'Your request is running'
Do you have any clue how to solve this?
Cheers
You haven't shown us what values you're using for the maxAttempts and waitTime variables in your code.
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
maxAttempts = 10; % Maximum number of attempts
waitTime = 5; % Time to wait between attempts (in seconds)
attempt = 1;
while attempt <= maxAttempts
jsonData = webread(apiUrl);
if ~isfield(jsonData, 'Waiting') || isempty(jsonData.Waiting) %|| ~strcmpi(jsonData.Waiting, 'true')
break; % Exit loop if request is not waiting anymore
end
attempt = attempt + 1;
pause(waitTime);
end
% Check if the request is still processing after the loop
if isfield(jsonData, 'Waiting') && ~isempty(jsonData.Waiting) && strcmpi(jsonData.Waiting, 'true')
disp('Your request is still processing. Please wait and try again later.');
return;
end
if isfield(jsonData, 'Fault')
disp(['Error: ', jsonData.Fault.Message]);
return;
end
numResults = 0; % Initialize number of results
if isfield(jsonData, 'IdentifierList') && isfield(jsonData.IdentifierList, 'CID')
numResults = numel(jsonData.IdentifierList.CID); % Number of search results
end
disp(['Number of results for molecular formula "', molecularFormula, '": ', num2str(numResults)]);
It doesn't really matter, actually. Most of the previous code was written by chatgpt but it's useless. The main lines are:
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
jsonData = webread(apiUrl);
if webread worked, maybe I would be able to find the information I am looking for. The problem is that I think the function launches the search but then doesn't wait for the website to ‘load’ the result, so it shows ‘Your request is still running’. Maybe I should find a way to launch the command, wait and then check if the webpage 'loaded' the results. What do you think?

Sign in to comment.

Categories

Products

Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!