Importing large .csv by READTABLE gives wrong results

Question

Weronika Huss on 17 Jul 2023

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/1997198-importing-large-csv-by-readtable-gives-wrong-results

Answered: Abhas on 9 Oct 2024

Comb_16_20230513 - Kopia.csv

Hello everyone !

I have a lot of .csv files (example attached). I need to extract column F and G from them. In col F some cells are empty and in col G it's reverse.

My script is as follows:

day = input('Give day ','s');
month = input('Give month ','s');
year = '2023';
filename = ['L:\XT4Comb_16_' [year month day] '.csv']    % I have a lot of these files with subsequent dates in their names
tablica = readtable(filename,'Delimiter',',','ReadVariableNames',true,'HeaderLines',0,'Format','%*s%*s%*s%*s%*s%s%s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s');   %% column F and G only

Of course it's just the most important part of the whole script

The problem is:

If a file is small, like that one I attached, there is no problem to get the same values (in F column) as in .csv.

But if a file is larger (like 35MB or larger) values are mixed and they don't match the original .csv file.

Why is that ? Can you tell me what am I doing wrong ?

Many thanks in advance !!!

7 Comments
Show 5 older commentsHide 5 older comments

DGM on 18 Jul 2023

Edited: DGM on 18 Jul 2023

Open in MATLAB Online

Error using readtable (line 223)

Reading failed at line 17693. All lines of a text file must have the same number of delimiters. Line 17693 has 20 delimiters, while preceding lines have 16.

There are multiple lines which have extra delimiters or different content

T = fileread(filename); % read the whole file
T = split(T);
D = cellfun(@(x) nnz(x == ','),T) % find number of delimiters
badidx = find(D~=16) % find lines with nonstandard number of delimiters
badlines = T(badidx) % print those lines
numdelim = D(badidx) % number of delimiters on those lines

badidx =

7256

11005

17693

46822

badlines =

4×1 cell array

{'1683916462929,53.53142,10.00707,0,356.7,,"TRANSPORT",,,,"[(4,114),(8,11.5),(12,15.5),(16,16.5),(20,14.5),(24,4.5),(28,3.5),(32,5),(36,2),(40,2),(44,0.5),(48,2),(52,0.5),(56,4),(60,0),(64,2.5),(68,0),(72,0),(76,0),(80,0.5),(84,1),(88,0),(92,0),(96,1),(100,0.5),(104,1),(108,0),(112,0),(116,0.5),(120,0),(124,0),(128,0),(132,0.5)]","[(4,36154.5),(8,9495),(12,3450.5),(16,2616),(20,1320),(24,1005.5),(28,552),(32,470),(36,249),(40,219),(44,136),(48,108),(52,75.5),(56,54.5),(60,30),(64,37.5),(68,18.5),(72,17),(76,13),(80,7.5),(84,10.5),(88,7),(92,7.5),(96,9),(100,7),(104,8),(108,3.5),(112,4.5),(116,4),(120,1),(124,2),(128,3.5),(132,2.5),(136,0.5),(140,0.5),(144,0.5)]",3170,401,-760,0.000754496,2.13207e-05'}

{'1683916597744,53.53086,10.00883,0,52.7,,"TRANSPORT",,,,"[(4,30),(8,8),(12,6),(16,3.5),(20,1),(24,0.5),(28,3.5),(32,0),(36,1.5),(40,0),(44,1),(48,1.5),(52,0.5),(56,0),(60,0),(64,0.5),(68,0),(72,1),(76,0),(80,0),(84,0),(88,0),(92,0),(96,0),(100,0),(104,0),(108,0.5)]","[(4,36184.5),(8,9503),(12,3456.5),(16,2619.5),(20,1321),(24,1006),(28,555.5),(32,470),(36,250.5),(40,219),(44,137),(48,109.5),(52,76),(56,54.5),(60,30),(64,38),(68,18.5),(72,18),(76,13),(80,7.5),(84,10.5),(88,7),(92,7.5),(96,9),(100,7),(104,8),(108,4),(112,4.5),(116,4),(120,1),(124,2),(128,3.5),(132,2.5),(136,0.5),(140,0.5),(144,0.5)]",3170,401,-760,0.000759143,4.54643e-06' }

{'1683916837559,53.53370,10.00302,0,451683996200710,,,0,0,149.255,,,,,,,,,,,' }

{0×0 char }

numdelim =

152

140

20

0

I would say that line 17693 is where logging got interrupted and a new entry was appended in the middle of a partial line.

Walter Roberson on 18 Jul 2023

%q format should handle the commas but %s format would not.

Sign in to comment.

Sign in to answer this question.

Answer 1

Abhas on 9 Oct 2024

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/1997198-importing-large-csv-by-readtable-gives-wrong-results#answer_1529440

Open in MATLAB Online

Hi @Weronika Huss,

One way to fix this issue is to adjust the formatting of the CSV file. Some lines have an inconsistent number of comma delimited fields. Ensuring that each line has the same number of comma delimited fields will help the import process progress successfully.

Alternatively, you can apply the import options from the version of the CSV file that works correctly to the version of the file with the additional lines. This would involve code that looks like this:

opts = detectImportOptions('<Working CSV Filename>');
originalTable = readtable('<Working CSV Filename>', opts); % This reads the original table
newTable = readtable('<Longer CSV Filename>', opts); % This reads the modified table with additional lines with the original formatting options.

You may refer to the MathWorks documentation link below to know more about "detectImportOptions" :

https://www.mathworks.com/help/matlab/ref/detectimportoptions.html

If the files aren't the same format exactly, they can pass a detection hint to detectImportOptions. In this case, disable header line detection by providing a value.

opts = detectImportOptions('<Longer CSV Filename>', 'NumHeaderLines',0)

More information on the "delimitedTextImportOptions" created by "detectImportOptions" on a CSV can be found by referencing the link below:

https://www.mathworks.com/help/matlab/ref/matlab.io.text.delimitedtextimportoptions.html

I hope this helps!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Importing large .csv by READTABLE gives wrong results

7 Comments
Show 5 older commentsHide 5 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

Importing large .csv by READTABLE gives wrong results

7 Comments Show 5 older commentsHide 5 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

7 Comments
Show 5 older commentsHide 5 older comments

0 Comments
Show -2 older commentsHide -2 older comments