MATLAB Answers

0

How to read xml file with binary data into Matlab? (VTK/VTU File)

Asked by Richard Crozier on 4 Aug 2016
Latest activity Commented on by Richard Crozier on 23 Nov 2018 at 10:53
First of all, I've also asked this question on another site but not had any luck, so I thought I'd try here too. I'll cross-post any answer to either site.
I would like to be able to read in an xml formal file which has a section containing binary data. An example file is shown below:
<?xml version="1.0"?>
<VTKFile type="UnstructuredGrid" version="0.1" byte_order="LittleEndian">
<UnstructuredGrid>
<Piece NumberOfPoints="1941" NumberOfCells="11339">
<PointData>
<DataArray type="Float64" Name="magnetic field strength" NumberOfComponents="3" format="appended" offset="0"/>
<DataArray type="Float64" Name="magnetic flux density" NumberOfComponents="3" format="appended" offset="46588"/>
<DataArray type="Float64" Name="magnetic vector potential" NumberOfComponents="3" format="appended" offset="93176"/>
</PointData>
<CellData>
<DataArray type="Int32" Name="GeometryIds" format="appended" offset="139764"/>
</CellData>
<Points>
<DataArray type="Float64" NumberOfComponents="3" format="appended" offset="185124"/>
</Points>
<Cells>
<DataArray type="Int32" Name="connectivity" format="appended" offset="231712"/>
<DataArray type="Int32" Name="offsets" format="appended" offset="403396"/>
<DataArray type="Int32" Name="types" format="appended" offset="448756"/>
</Cells>
</Piece>
</UnstructuredGrid>
<AppendedData encoding="raw">
_XF@Loû1q@`@!?V7^W@9DCz@bd@Yb@r <snip>
</AppendedData>
</VTKFile>
This is a VTK data file, specifically the unstructured gid type, for which the .vtu extension is used. The format of this is normal xml, but with a section AppendedData where there is an underscore followed by binary data, the xml describes where each of the data sequences start and end in this data.
Matlab's xmlread can't read this file, I presume because of the binary portion. I get the error below:
[Fatal Error] elmer_3d_magnet_mesh.dat0001.vtu:24:1: Invalid byte 1 of 1-byte UTF-8 sequence.
Error using xmlread (line 97)
Java exception occurred:
org.xml.sax.SAXParseException; systemId: file:/home/rcrozier/Sync/cad_models/elmer_3D_magnet/elmer_3d_magnet_mesh/elmer_3d_magnet_mesh.dat0001.vtu; lineNumber: 24;
columnNumber: 1; Invalid byte 1 of 1-byte UTF-8 sequence.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
However, I can successfully read in the xml portion of the file (using fgetl to read up to the AppendedData tag). I can then create a temporary xml file by adding the missing closing tags and reading this in using xmlread. I can then parse the xml to determine the data structure. This just leaves the reading in the binary portion part. At the end of reading the xml data fgetl leaves me at the file position corresponding to the start of the line with the underscore.
How can I ignore the underscore character, then read in the binary data?
Actually it is the 'ignoring the underscore character' part that is proving difficult as I can't figure out out how to do this without knowing about the character encoding of the file (file -bi returns application/xml; charset=binary on one example).
In case it's of interest, the actual vtk file format specification can be found here (pdf)
Below is the code to get the first text xml part of the file with fgetl
% open the file
fid = fopen(filename, 'r');
% close file when we're done
CC = onCleanup (@() fclose(fid));
xmlstrs = {fgetl(fid)};
find = 1;
while ischar (xmlstrs{find})
find = find + 1;
xmlstrs{find,1} = fgetl(fid);
if ~isempty(strfind (xmlstrs{find,1}, 'AppendedData'))
xmlstrs = [ xmlstrs; {'</AppendedData>'; '</VTKFile>'} ];
% could get file position like this? how many bytes?
datapos = ftell (fid) + 4;
break;
end
end

  1 Comment

I should add that solutions which require mex files are acceptable!

Sign in to comment.

Products

2 Answers

Answer by Richard Crozier on 5 Aug 2016
 Accepted Answer

The answer to determining the position was to calculate the character bit length from the first line read in like so:
% open the file
fid = fopen(filename, 'r');
% close file when we're done
CC = onCleanup (@() fclose(fid));
xmlstrs = {fgetl(fid)};
firstlinebytes = ftell (fid) - 1;
bytesperchar = round (firstlinebytes / numel (xmlstrs{1}));
then the position of the first byte in the data section is
datapos = ftell (fid) + bytesperchar;
Note, that this isn't the whole answer to reading 'raw' type data in the AppendedData section which is poorly documented. You will find more info on the format of 'raw' (rather than 'base64') data here, but the short answer is it's encoded like the following:
_NNNN<data>NNNN<data>NNNN<data>
^ ^ ^
1 2 3
where each "NNNN" is an unsigned 32-bit integer, and <data> consists of
a number of bytes equal to the preceding NNNN value. The corresponding
DataArray elements must have format="appended" and offset attributes
equal to the following:
1.) offset="0"
2.) offset="(4+NNNN1)"
3.) offset="(4+NNNN1+4+NNNN2)"

  3 Comments

Hi Richard,
This was of great help. To help the others I will post my code, inspired by yours, for reading a vtk file containing a PolyData.
function [v,f,s] = read_vtkpoly(fname,scalar_name)
fid = fopen(fname,'r');
fline = {fgetl(fid)};
firstlinebytes = ftell(fid)-1;
bytesperchar = round(firstlinebytes / numel (fline{1}));
i=1;
while ischar(fline{i})
i=i+1;
fline{end+1} = fgetl(fid);
if ~isempty(strfind(fline{end},'<AppendedData encoding="raw">'))
datapos = ftell(fid) + bytesperchar*4;
break
end
end
header_type = lower(get_string(fline,'header_type'));
num_vertices = get_int(fline,'NumberOfPoints');
num_faces = get_int(fline,'NumberOfPolys');
scalar = get_dataarray(fline,scalar_name);
vertices = get_dataarray(fline,'Points');
faces = get_nameless_dataarray(fline,'Polys','connectivity');
fseek(fid,datapos+scalar.offset,-1);
scalar.nbytes = fread(fid,1,header_type);
s = fread(fid,num_vertices,scalar.dtype);
fseek(fid,datapos+vertices.offset,-1);
vertices.nbytes = fread(fid,1,header_type);
v = fread(fid,num_vertices*3,vertices.dtype);
fseek(fid,datapos+faces.offset,-1);
faces.nbytes = fread(fid,1,header_type);
f = fread(fid,num_faces*3,faces.dtype);
v = reshape(v,[3 num_vertices])';
f = reshape(f,[3 num_faces])' + 1; % add one because of C to FORTRAN !!!
end
function mystr = get_string(fline,ziel)
line_nb = find(contains(fline,ziel)==1);
regex_str = ['(?<=' ziel '\=\")[\w]+'];
mystr = regexp(fline{line_nb},regex_str,'match','once');
end
function myint = get_int(fline,ziel)
line_nb = find(contains(fline,ziel)==1);
regex_str = ['(?<=' ziel '\=\")[\d]+'];
myint = str2num(regexp(fline{line_nb},regex_str,'match','once'));
end
function darray = get_dataarray(fline,dname)
line_nb = find(contains(fline,['Name="' dname '"' ])==1);
darray.dtype = lower(regexp(fline{line_nb},...
'(?<=type\=\")[\w]+','match','once'));
darray.min = str2num(regexp(fline{line_nb},...
'(?<=RangeMin\=\")[+-?][\d]*[.]*[\d]*','match','once'));
darray.max = str2num(regexp(fline{line_nb},...
'(?<=RangeMax\=\")[+-?][\d]*[.]*[\d]*','match','once'));
darray.offset = str2num(regexp(fline{line_nb},...
'(?<=offset\=\")[\d]+','match','once'));
end
function darray = get_nameless_dataarray(fline,group,dname)
line_open = find(contains(fline,['<' group '>'])==1);
line_close = find(contains(fline,['</' group '>'])==1);
line_nb = find(contains(fline,['Name="' dname '"' ])==1);
line_nb = line_nb(find(line_nb<line_close & line_nb>line_open));
darray.dtype = lower(regexp(fline{line_nb},...
'(?<=type\=\")[\w]+','match','once'));
darray.min = str2num(regexp(fline{line_nb},...
'(?<=RangeMin\=\")[+-?][\d]*[.]*[\d]*','match','once'));
darray.max = str2num(regexp(fline{line_nb},...
'(?<=RangeMax\=\")[+-?][\d]*[.]*[\d]*','match','once'));
darray.offset = str2num(regexp(fline{line_nb},...
'(?<=offset\=\")[\d]+','match','once'));
end
function bytes_size = get_nbytes(dtype)
a = cast(1,dtype);
b = whos('a');
bytes_size = b.bytes;
end
Thanks for posting a function! However, I was trying to use your code and it doesn't recognize the function "contains". Is that one of your own functions, or is it part of a toolbox that I don't have?
contains is a recent matlab addition
equivalent for older matlab is
isempty (strfind (TEXT,PATTERN))

Sign in to comment.


Answer by Zakia Tasnim on 18 Oct 2018

where should i put the file name?

  2 Comments

function [v,f,s] = read_vtkpoly(fname,scalar_name)
v are the verticies f the faces s is the scalar field you want to retrieve
fname is the file name scalar_name is the variable name to be read
Why have you put this as an answer to the question???

Sign in to comment.