Bug in readtable()? - if the first values in a CSV's column are missing, the whole column is misinterpreted

I am reading a big CSV file (500K lines) with readtable. In the CSV, some columns have the first 250+ lines empty (e.g. ",,,," in the CSV), while the non-missing values below (pretty rare) are either text strings or dates (in the DD-MM-YYYY format). Readtable() somehow interprets these columns as numeric, and so converts all the strings and dates into NaNs - thus, I end up with 100% NaN-filled columns instead of rarely-populated data (among empty strings and NaTs).
Furthermore, if I move the "with-data" lines up - even a few dozens positions up - readtable() starts to read everything normally!
So, it looks like readtable() checks only ~250 first values to determine the type of the column, which, in my opinion, is a bug! (Although I understand that it was likely made to improve speed.)
Is there a way to fix it systematically? I have lots of such CSVs with thousands of columns in them - so, a manual check and manual fix is not an option...
UPD: a test-file (truncated to 510 lines) is attached - the behaviour is still the same. The problem columns are the 2nd and the 3rd (p190, p191). The first non-empty value is on data-line 270.

10 Comments

hello
could you share one file (or extract if size > 5Mo) so we can test ?
"Is there a way to fix it systematically?"
Specify a DelimitedTextImportOptions object and use that for importing all of the files:
"So, it looks like readtable() checks only ~250 first values to determine the type of the column, which, in my opinion, is a bug! (Although I understand that it was likely made to improve speed.)"
You consider it a bug when READTABLE does not check all rows before deciding what class a column has.
Another hypothetical user considers it a bug when READTABLE hypothetically imports the entire file into memory (taking five minutes) just to check ten billion rows because there might be something inconsistent on the final row.
Who is right?
Perhaps TMW should add an option: "read all rows before deciding the class, even if this takes five minutes".
@Stephen23 In normal usage, readtable() eventually reads everything into the memory, and so full checking would not be SO slow. Even if some huge CSV cannot fit into the memory and you are reading it by parts, it still seems reasonable that the format should be determined by the full read chunk, not by its first ~250 lines.
"In normal usage, readtable() eventually reads everything into the memory..."
Yes, but as interpreted data (not as a bytestream like the raw file data).
"...and so full checking would not be SO slow."
It would bascially require reading the file twice OR storing the file data twice in memory.
Below, I suggested a heuristic variant "to check the first M non-empty values". For "dense" tables, it will be the same performance, but for sparse tables, it will drastically reduce the number of errors at a reasonable performance cost.
"it will drastically reduce the number of errors at a reasonable performance cost."
Unless the first non-empty value in that column is in the last row of the table, in which case it will have to read in the whole file.
If you know that your data may be sparsely populated in your file, I would recommend using the detectImportOptions approach Cris LaPierre suggested to allow readtable to take advantage of that knowledge.
Note that when you use detectImportOptions() that it checks more lines than is the case when detectImportOptions() is not used.
@Steven Lord Yes, the case when the first non-empty value is in the bottom of the table is exactly the "worst case scenario". But my argument is that this scenario is rare, and in the wast majority real cases it won't happen. So, the "net gain" will be positive, although at the cost of some rare negative outliers.
detectImportOptions doesn't actually check more lines, it just defaults to treating the variable as text instead of double when the first 250 rows are all empty. The reason for the difference is to preserve the behavior from before detectImportOptions was introduced. Before that, it only considered the first row of data for the variable types and empty fields were treated as double by default.

Sign in to comment.

Answers (1)

I wouldn't call it a bug. It's an artifact of trying to autodetect the format, which will naturally not get it right every time. I would recommend specifying the import options it is not detecting correctly.
opts = detectImportOptions('TEST_ongoing.csv');
opts = setvartype(opts,["p190","p191"],["string","datetime"]);
data = readtable('TEST_ongoing.csv',opts)
data = 510x7 table
eid p190 p191 p20005 p20143 p20144 p20145 __________ _________ ____ _______________________________________________ __________ ___________________________ ______ 4.3685e+06 <missing> NaT {'Current, email address assumed working' } 2019-07-15 {'Online questionnaire' } 2 3.8845e+06 <missing> NaT {0x0 char } 2006-03-21 {'Assessment centre visit'} 1 5.8641e+06 <missing> NaT {0x0 char } 2021-11-08 {'Online questionnaire' } 14 5.5967e+06 <missing> NaT {0x0 char } 2010-06-22 {'Assessment centre visit'} 1 5.3448e+06 <missing> NaT {'Current, email address assumed working' } 2021-10-27 {'Online questionnaire' } 16 4.7725e+06 <missing> NaT {'Unusable, email address failed or withdrawn'} 2022-07-19 {'Online questionnaire' } 11 5.0307e+06 <missing> NaT {'Current, email address assumed working' } 2022-11-08 {'Online questionnaire' } 27 1.9987e+06 <missing> NaT {'Current, email address assumed working' } 2022-11-07 {'Online questionnaire' } 32 3.9414e+06 <missing> NaT {'Current, email address assumed working' } 2007-12-06 {'Assessment centre visit'} 1 3.9442e+06 <missing> NaT {'Current, email address assumed working' } 2019-06-15 {'Online questionnaire' } 11 5.113e+06 <missing> NaT {'Current, email address assumed working' } 2016-08-23 {'Online questionnaire' } 4 3.2642e+06 <missing> NaT {'Current, email address assumed working' } 2022-02-22 {'Online questionnaire' } 7 5.4625e+06 <missing> NaT {'Current, email address assumed working' } 2023-03-04 {'Online questionnaire' } 20 4.3812e+06 <missing> NaT {'Current, email address assumed working' } 2022-11-11 {'Online questionnaire' } 30 3.4906e+06 <missing> NaT {0x0 char } 2008-04-03 {'Assessment centre visit'} 1 3.6481e+06 <missing> NaT {'Current, email address assumed working' } 2022-11-03 {'Online questionnaire' } 38
data([270,508],:)
ans = 2x7 table
eid p190 p191 p20005 p20143 p20144 p20145 __________ _________________________________________________ __________ __________________________________________ __________ ___________________________ ______ 3.5307e+06 "UK Biobank sources report they have left the UK" 2013-05-09 {0x0 char } 2008-07-22 {'Assessment centre visit'} 1 1.717e+06 "UK Biobank sources report they have left the UK" 2012-07-19 {'Current, email address assumed working'} 2022-11-03 {'Online questionnaire' } 21

12 Comments

Alas, I cannot manually mark up thousands of columns in the database.
And I totally agree with @Stephen23: the optimal solution would be to add an option like "read and check ALL the values before autodetecting the format" (instead of the excuse "it is not a bug, it's a feature").
BTW, another option is instead of "check first N lines to autodetect the format" to do something like "check first M non-empty values". This heuristic can be much better in 99% of usage cases.
Alas, I cannot manually mark up thousands of columns in the database.
What if I were to tell you that the real purpose of the column is of numeric values, with lots and lots of leading missing values followed by a number of "mistakes" ?
What if I were to tell you that the real purpose of the column is of strings?
Detection of import options cannot guess both of those possibilities simultaneously. Without a hint about which case it is, detection of import options will surely guess incorrectly sometimes.
@Walter Roberson Of course, autodetection cannot get the real purpose of a column from the head of the author. But autodetection can check whether ALL non-empty values in the column are formattable to the same type, which would make a good guess about the "real purpose".
But, importantly, if there is nonuniformity in the data-values, then, in my opinion, readtable() should throw a warning (just as it does when renaming variables to make them Matlab-compatible) and, possibly, format everything as text-strings.
Alas, I cannot manually mark up thousands of columns in the database.
I made an assumption that you had thousands of files, all of them with the same format as the file you shared, just with more rows.
I'd agree that, if you have thousands of columns, this is a cumbersome approach. There may be other approaches, but in order to provide a more specific answer for your use case, we need to see an actual example of what you are working with.
Is there no definition of the(se) database(s)? How does anybody/any application know how to populate/read them otherwise? Unless they are not controlled and are just text fields that allow for mistakes which would seem to be the kind of thing @Konstantin is running into if he has columns that contain both date strings and other text (or date strings in inconsistent formats?). It seems the fundamental problem is in the database itself, not in MATLAB.
Lacking controls on the database or at least information about its column types that can be used, it would seem the only real solution would be to read the whole file as a cell array and then try to parse it.
Perhaps a workable solution would be to put the import into a try...catch...end block and if the auto-import fails then fall back to the cell import and try to deal with it programmatically and only if that fails get the human involved.
But clearly, it isn't practical for the auto-detect functionality to have to parse the whole file first as a generic implementation.
@Cris LaPierre @dpb The database is an extremely typical "medical" database. The columns are consistent within themselves (i.e. all values in one column are of one type). The list of "column types" doesn't exist in a convenient form - at least on the level of data access that I have at the moment. But, after some persistent googling, I've found a list in a multipage HTML form - so, I can write a parser for this and re-format the main data accordingly.
Yet, I still consider the silent failure of readtable() to preserve the actual data in the process of formatting to be a bug. Maybe, readtable() could have at least throw a warning like "formatting of column X is based on missed values only - misformatting is possible" or like "there were non-empty values that were converted to NaNs".
But clearly, it isn't practical for the auto-detect functionality
to have to parse the whole file first as a generic implementation.
I don't ask to make it the generic/default behaviour. I ask to produce warnings by default and give an option to force the "wasteful" behaviour with additional parameters.
You have a good use case here for an enhancement. I'd suggest submitting an enhancement suggestion to MathWorks to address this need. You can do that here: https://supportcases.mathworks.com/mwsupport/s/casetypeselection?language=en_US&c__caseParameter=productusage&s_tid=srlp_product
@Cris LaPierre Thanks for the suggestion. The enchancement ticket is submitted.
@Cris LaPierre I've stumbled upon a "continuation bug-o-feature" with readtable():
After obtaining the "creator's" data-types for all columns and forcing them on readtable() using setvartype(), I get the following warning when I read fully empty columns marked as "dates":
Warning: Unable to convert one or more variables to datetime. Specify the correct InputFormat and DatetimeLocale properties in the DatetimeVariableImportOptions.
Interestingly, when I read equally empty 'char' or 'categorical' columns, there is no such warning...
I think, if the options are forcing some variable to be datetime, then readtable() should not fight back even if the actual variable is empty.
The datetime conversion routines are pretty picky, failing if valid date/time strings aren't all the same format in a single call. This is clearly an implementation choice for performance reasons, but can become annoying.
If this can really happen and isn't just a test case, for the immediate fix seems like it may be more expeditious to read the date column as cellstr and then test for the empty case/convert the existing or use the try...catch construction to deal with the (presumed) aberration. I agree that NaT should be the result...although this seems to be an artifact of readtable and not part of datetime's repetoire...
s="";s=standardizeMissing(s,"")
s = <missing>
datetime(s,'InputFormat',"yyyy-MMM-dd")
ans = datetime
NaT
Another possibility I haven't tested might be able to use the 'MissingRule' and/or 'ImportErrorRule' although they are global rather than having granularity of setting a replacement value by column/data type--another enchancement I think I've previously suggested.

Sign in to comment.

Categories

Products

Release

R2024b

Asked:

on 3 Apr 2025

Edited:

on 2 May 2025

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!