MATLAB Answers

Finding and Extracting Instances from a Text File

52 views (last 30 days)
Nima
Nima on 15 Dec 2020
Commented: Nima on 16 Dec 2020
Hi All,
I have a large text file and I need to save all instances of string which comes after a key_name and are betwen specific characters.
For example, there are many instances of strings in the file - could be numebrs, characters, letters - with unknown length, but in the file they all appear after a known key_name phrase and are between known symbols. Example: key_name/':"/This_is_the_string"/
The format always repeat like this, and I want to search for like < key_name/':"/ > and then in an array store whatever comes after the searched phrase up to the next < "/ > character set. There is no space in the text file.
Thank you for your help
UPADTES:
A sample text file is attched. So, I'd like the Matlab script returns an array saying "answer" with three value:
answer = [abc def3ghi JK]

  2 Comments

Matt Gaidica
Matt Gaidica on 16 Dec 2020
Can you attach a sample file? It will help cover all cases when someone tries to help.

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 16 Dec 2020
Edited: Walter Roberson on 16 Dec 2020
Another approach is to use regexp with named tokens
S = fileread('sample_text.txt');
t = regexp(S, 'keyname/'':"(?<kn>[^"]*)', 'names')
This should return a struct array with field kn that holds one key each.
Subhadeep's use of regexp is fine for the exact task you laid out, but if you wanted to extend to the key names and corresponding value tag then named token approach is easier to extend.

  1 Comment

Nima
Nima on 16 Dec 2020
Walter,
Thank you for your answer. Your script worked great. As I replied to the above answer from Subhadeep, his code did find all instances too but it took a few hours to capture, but the script you wrote did the job in less than 5 seconds. I would use this one.
Thanks

Sign in to comment.

More Answers (1)

Subhadeep Koley
Subhadeep Koley on 16 Dec 2020
Hi, the below code might help.
% Open the file
fid = fopen('sample_text.txt');
% Read the file by character
str = fread(fid, 'uint8=>char');
% Close the file
fclose(fid);
% Search <key_name/':"/> by regural expression matching
[~, endIdx] = regexpi(str', 'key_name/'':"/');
% Find values
result = "";
for ind = 1:length(endIdx)
temp = string(str(endIdx(ind):end)');
result(ind) = strtok(temp, '"/');
end

  3 Comments

Nima
Nima on 16 Dec 2020
Thank you Subhadeep. The code worked good and could find all instances, about 14000 cases. Just some notes:
  • The text file had around 6.5 mil lines and took more than two hours for the code to scan the file. Is there more efficient way to do it?
  • Also, I am not sure how the code capture and store the characters, but when I tried to save the final string variable with 14 k values into both .csv and .txt formats, the saved file couldn't show the names that I see in Matlab. I guess there are some unicode issues. I assume this can be resolved with a simple conversion of the final results in Matlab,
Thank you for your answer.
Walter Roberson
Walter Roberson on 16 Dec 2020
regexp() with 'match' option would have cut out a lot of code. Also, regexp() can work on entire long character vectors, not restricted to line-by-line.

Sign in to comment.

Tags

Products


Release

R2017a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!