Collect information of a file with the same kind of text pattern

1 view (last 30 days)
Dear all
I have the same kind of text file that I am attaching to this question in different folders in my MatLab path. As you can see, from line 42 to the end of the file, there are text blocks like:
Cr1 Cr2 ( 0, 0, 0) 2.6832 ( 0.000, -4.001, 0.000) 4.001
J_iso: 2.6832
[Testing!] Jprime: 10.082, B: -4.132
[Testing!] DMI: ( 0.0635 0.0000 -0.0306)
[Testing!]J_ani:
[[-1.401 0. 0.396]
[ 0. -0.327 0. ]
[ 0.396 0. -8.263]]
I would be interested in MatLab collecting information from this kind of patterns as follows:
As you can see, at the end of the first line of the text block that I put before as an example, the number 4.001 can be found. There are other text blocks with the last element of the first line ending in that specific number, but also on other values, like 6.930. It would be great if for each occurrence of the element 4.001 (and also for 6.930) as the last element of the first line of a text block MatLab creates a line with the following information in a line:
(i) First column: number index of the first Crx appearance (in the example above, 1)
(ii) Second column: number index of the second Crx appearance (in the example above, 2)
(iii) Third to fifth columns: the three elements in the first parenthesis of the first line (in the example above, 0 0 0)
(iv) Sixth to eighth columns: the three elements inside the parenthesis of the fourth line of the text block (that it is, what it is after the text "[Testing!] DMI:". In this case, 0.0635 0.0000 -0.0306).
So first line would be 1 2 0 0 0 0.0635 0.0000 -0.0306. Same for all the occurrences of 6.930 as the last element of the first line of the text blocks.
Any idea on how to do this efficiently?

Accepted Answer

Stephen23
Stephen23 on 30 Jul 2024
Edited: Stephen23 on 30 Jul 2024
txt = fileread('file.txt');
rgx = 'Cr(\d+)\s*Cr(\d+)\s*\(\s*(\S+),\s*(\S+),\s*(\S+)\).+@.*\n.+\n.+\n[^:]*:\s*\(\s*(\S+)\s+(\S+)\s+(\S+)\)';
tkn = regexp(txt,strrep(rgx,'@','4.001'),'tokens','dotexceptnewline');
tkn = vertcat(tkn{:})
tkn = 24x8 cell array
{'1'} {'2'} {'0' } {'0' } {'0'} {'0.0635' } {'0.0000' } {'-0.0306'} {'2'} {'1'} {'0' } {'0' } {'0'} {'-0.0635'} {'-0.0000'} {'0.0306' } {'3'} {'4'} {'0' } {'0' } {'0'} {'-0.0638'} {'-0.0002'} {'0.0252' } {'4'} {'3'} {'0' } {'0' } {'0'} {'0.0638' } {'0.0002' } {'-0.0252'} {'2'} {'1'} {'-1'} {'-1'} {'0'} {'0.0316' } {'-0.0550'} {'0.0306' } {'4'} {'3'} {'-1'} {'-1'} {'0'} {'-0.0316'} {'0.0552' } {'-0.0252'} {'1'} {'2'} {'1' } {'1' } {'0'} {'-0.0316'} {'0.0550' } {'-0.0306'} {'3'} {'4'} {'1' } {'1' } {'0'} {'0.0316' } {'-0.0552'} {'0.0252' } {'2'} {'1'} {'0' } {'-1'} {'0'} {'0.0315' } {'0.0548' } {'0.0306' } {'4'} {'3'} {'0' } {'-1'} {'0'} {'-0.0316'} {'-0.0550'} {'-0.0251'} {'1'} {'2'} {'0' } {'1' } {'0'} {'-0.0315'} {'-0.0548'} {'-0.0306'} {'3'} {'4'} {'0' } {'1' } {'0'} {'0.0316' } {'0.0550' } {'0.0251' } {'1'} {'4'} {'0' } {'0' } {'0'} {'-0.0001'} {'-0.0000'} {'-0.0001'} {'2'} {'3'} {'0' } {'0' } {'0'} {'0.0001' } {'-0.0001'} {'0.0001' } {'3'} {'2'} {'0' } {'0' } {'0'} {'-0.0001'} {'0.0001' } {'-0.0001'} {'4'} {'1'} {'0' } {'0' } {'0'} {'0.0001' } {'0.0000' } {'0.0001' } {'1'} {'2'} {'-1'} {'0' } {'0'} {'0.0197' } {'-0.0340'} {'-0.0618'} {'3'} {'4'} {'-1'} {'0' } {'0'} {'-0.0200'} {'0.0345' } {'0.0608' } {'2'} {'1'} {'1' } {'0' } {'0'} {'-0.0197'} {'0.0340' } {'0.0618' } {'4'} {'3'} {'1' } {'0' } {'0'} {'0.0200' } {'-0.0345'} {'-0.0608'} {'2'} {'1'} {'-1'} {'0' } {'0'} {'-0.0197'} {'-0.0341'} {'0.0618' } {'4'} {'3'} {'-1'} {'0' } {'0'} {'0.0200' } {'0.0346' } {'-0.0608'} {'1'} {'2'} {'1' } {'0' } {'0'} {'0.0197' } {'0.0341' } {'-0.0618'} {'3'} {'4'} {'1' } {'0' } {'0'} {'-0.0200'} {'-0.0346'} {'0.0608' }
tkn = regexp(txt,strrep(rgx,'@','6.930'),'tokens','dotexceptnewline');
tkn = vertcat(tkn{:})
tkn = 40x8 cell array
{'1'} {'1'} {'-1'} {'-1'} {'0'} {'0.0448' } {'0.0746' } {'0.0808' } {'2'} {'2'} {'-1'} {'-1'} {'0'} {'-0.0422'} {'-0.0761'} {'-0.0808'} {'3'} {'3'} {'-1'} {'-1'} {'0'} {'0.0423' } {'0.0761' } {'0.0808' } {'4'} {'4'} {'-1'} {'-1'} {'0'} {'-0.0448'} {'-0.0747'} {'-0.0808'} {'1'} {'1'} {'1' } {'1' } {'0'} {'-0.0448'} {'-0.0746'} {'-0.0808'} {'2'} {'2'} {'1' } {'1' } {'0'} {'0.0422' } {'0.0761' } {'0.0808' } {'3'} {'3'} {'1' } {'1' } {'0'} {'-0.0423'} {'-0.0761'} {'-0.0808'} {'4'} {'4'} {'1' } {'1' } {'0'} {'0.0448' } {'0.0747' } {'0.0808' } {'1'} {'1'} {'0' } {'-1'} {'0'} {'-0.0422'} {'0.0761' } {'-0.0808'} {'2'} {'2'} {'0' } {'-1'} {'0'} {'0.0448' } {'-0.0746'} {'0.0808' } {'3'} {'3'} {'0' } {'-1'} {'0'} {'-0.0447'} {'0.0747' } {'-0.0808'} {'4'} {'4'} {'0' } {'-1'} {'0'} {'0.0423' } {'-0.0761'} {'0.0808' } {'1'} {'1'} {'0' } {'1' } {'0'} {'0.0422' } {'-0.0761'} {'0.0808' } {'2'} {'2'} {'0' } {'1' } {'0'} {'-0.0448'} {'0.0746' } {'-0.0808'} {'3'} {'3'} {'0' } {'1' } {'0'} {'0.0447' } {'-0.0747'} {'0.0808' } {'4'} {'4'} {'0' } {'1' } {'0'} {'-0.0423'} {'0.0761' } {'-0.0808'} {'1'} {'1'} {'-1'} {'0' } {'0'} {'0.0870' } {'-0.0015'} {'-0.0808'} {'2'} {'2'} {'-1'} {'0' } {'0'} {'-0.0870'} {'-0.0015'} {'0.0808' } {'3'} {'3'} {'-1'} {'0' } {'0'} {'0.0871' } {'0.0014' } {'-0.0808'} {'4'} {'4'} {'-1'} {'0' } {'0'} {'-0.0870'} {'0.0014' } {'0.0808' } {'1'} {'1'} {'1' } {'0' } {'0'} {'-0.0870'} {'0.0015' } {'0.0808' } {'2'} {'2'} {'1' } {'0' } {'0'} {'0.0870' } {'0.0015' } {'-0.0808'} {'3'} {'3'} {'1' } {'0' } {'0'} {'-0.0871'} {'-0.0014'} {'0.0808' } {'4'} {'4'} {'1' } {'0' } {'0'} {'0.0870' } {'-0.0014'} {'-0.0808'} {'1'} {'2'} {'-1'} {'0' } {'0'} {'0.0197' } {'-0.0340'} {'-0.0618'} {'3'} {'4'} {'-1'} {'0' } {'0'} {'-0.0200'} {'0.0345' } {'0.0608' } {'2'} {'1'} {'1' } {'0' } {'0'} {'-0.0197'} {'0.0340' } {'0.0618' } {'4'} {'3'} {'1' } {'0' } {'0'} {'0.0200' } {'-0.0345'} {'-0.0608'} {'2'} {'1'} {'-1'} {'0' } {'0'} {'-0.0197'} {'-0.0341'} {'0.0618' } {'4'} {'3'} {'-1'} {'0' } {'0'} {'0.0200' } {'0.0346' } {'-0.0608'}
Use STR2DOUBLE to convert to numeric:
mat = str2double(tkn)
mat = 40x8
1.0000 1.0000 -1.0000 -1.0000 0 0.0448 0.0746 0.0808 2.0000 2.0000 -1.0000 -1.0000 0 -0.0422 -0.0761 -0.0808 3.0000 3.0000 -1.0000 -1.0000 0 0.0423 0.0761 0.0808 4.0000 4.0000 -1.0000 -1.0000 0 -0.0448 -0.0747 -0.0808 1.0000 1.0000 1.0000 1.0000 0 -0.0448 -0.0746 -0.0808 2.0000 2.0000 1.0000 1.0000 0 0.0422 0.0761 0.0808 3.0000 3.0000 1.0000 1.0000 0 -0.0423 -0.0761 -0.0808 4.0000 4.0000 1.0000 1.0000 0 0.0448 0.0747 0.0808 1.0000 1.0000 0 -1.0000 0 -0.0422 0.0761 -0.0808 2.0000 2.0000 0 -1.0000 0 0.0448 -0.0746 0.0808
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
  2 Comments
Roderick
Roderick on 31 Jul 2024
Thank you very much, @Stephen23! I was wondering, would be the saving of the first two columns of the two lines after the text line that states "Cell (Angstrom):" possible using a similar logic? Also, I imagine that, if for any reason, I am interested in only saving to the tkn variable those cases with "Cr3 Cr4 ( 0, 0, 0) ... 4.001" I will just need to change rgx variable to
rgx='Cr3\s*Cr4\s*\(\s*0,\s*0,\s*0\).+@.*\n.+\n.+\n[^:]*:\s*\(\s*(\S+)\s+(\S+)\s+(\S+)\)';
right?
Stephen23
Stephen23 on 31 Jul 2024
Edited: Stephen23 on 31 Jul 2024
@Richard Wood: that looks like it should work.
As an alternative, rather than filtering your data by modifying the regular expression, it may be easier/more robust to extract all of those fields with using one regular expression and then filter the numeric data afterwards using e.g. logical indexing.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!