parquetread
Read columnar data from a Parquet file
Description
Examples
Read Parquet File into Table
Get information about a Parquet file, read the data from the file into a table, and then read a subset of the variables into a table.
Create a ParquetInfo
object for the file outages.parquet
.
info = parquetinfo('outages.parquet')
info = ParquetInfo with properties: Filename: "/mathworks/devel/bat/filer/batfs2561-0/Bdoc24b.2679053/build/runnable/matlab/toolbox/matlab/demos/outages.parquet" FileSize: 44202 NumRowGroups: 1 RowGroupHeights: 1468 VariableNames: ["Region" "OutageTime" "Loss" "Customers" "RestorationTime" "Cause"] VariableTypes: ["string" "datetime" "double" "double" "datetime" "string"] VariableCompression: ["snappy" "snappy" "snappy" "snappy" "snappy" "snappy"] VariableEncoding: ["plain" "plain" "plain" "plain" "plain" "plain"] Version: "2.0"
Read data from the file into a table and display the first 10
rows.
T = parquetread('outages.parquet');
T(1:10,:)
ans=10×6 table
Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ _________________
"SouthWest" 01-Feb-2002 12:18:00 458.98 1.8202e+06 07-Feb-2002 16:50:00 "winter storm"
"SouthEast" 23-Jan-2003 00:49:00 530.14 2.1204e+05 NaT "winter storm"
"SouthEast" 07-Feb-2003 21:15:00 289.4 1.4294e+05 17-Feb-2003 08:14:00 "winter storm"
"West" 06-Apr-2004 05:44:00 434.81 3.4037e+05 06-Apr-2004 06:10:00 "equipment fault"
"MidWest" 16-Mar-2002 06:18:00 186.44 2.1275e+05 18-Mar-2002 23:23:00 "severe storm"
"West" 18-Jun-2003 02:49:00 0 0 18-Jun-2003 10:54:00 "attack"
"West" 20-Jun-2004 14:39:00 231.29 NaN 20-Jun-2004 19:16:00 "equipment fault"
"West" 06-Jun-2002 19:28:00 311.86 NaN 07-Jun-2002 00:51:00 "equipment fault"
"NorthEast" 16-Jul-2003 16:23:00 239.93 49434 17-Jul-2003 01:12:00 "fire"
"MidWest" 27-Sep-2004 11:09:00 286.72 66104 27-Sep-2004 16:37:00 "equipment fault"
Select and import the variables Region
, OutageTime
, and Cause
into a table and display the first 10
rows.
SelVarNames = {'Region','OutageTime','Cause'}; T_subset = parquetread('outages.parquet','SelectedVariableNames',SelVarNames); T_subset(1:10,:)
ans=10×3 table
Region OutageTime Cause
___________ ____________________ _________________
"SouthWest" 01-Feb-2002 12:18:00 "winter storm"
"SouthEast" 23-Jan-2003 00:49:00 "winter storm"
"SouthEast" 07-Feb-2003 21:15:00 "winter storm"
"West" 06-Apr-2004 05:44:00 "equipment fault"
"MidWest" 16-Mar-2002 06:18:00 "severe storm"
"West" 18-Jun-2003 02:49:00 "attack"
"West" 20-Jun-2004 14:39:00 "equipment fault"
"West" 06-Jun-2002 19:28:00 "equipment fault"
"NorthEast" 16-Jul-2003 16:23:00 "fire"
"MidWest" 27-Sep-2004 11:09:00 "equipment fault"
Read Parquet File into Timetable
Read the data from the file into a timetable
, and then use timetable functions to determine if the timetable is regular and sorted.
Read data from outages.parquet
into a timetable and display the first 10
rows. Use the second variable OutageTime
in the data as the time vector for the timetable.
TT = parquetread('outages.parquet','RowTimes','OutageTime'); TT(1:10,:)
ans=10×5 timetable
OutageTime Region Loss Customers RestorationTime Cause
____________________ ___________ ______ __________ ____________________ _________________
01-Feb-2002 12:18:00 "SouthWest" 458.98 1.8202e+06 07-Feb-2002 16:50:00 "winter storm"
23-Jan-2003 00:49:00 "SouthEast" 530.14 2.1204e+05 NaT "winter storm"
07-Feb-2003 21:15:00 "SouthEast" 289.4 1.4294e+05 17-Feb-2003 08:14:00 "winter storm"
06-Apr-2004 05:44:00 "West" 434.81 3.4037e+05 06-Apr-2004 06:10:00 "equipment fault"
16-Mar-2002 06:18:00 "MidWest" 186.44 2.1275e+05 18-Mar-2002 23:23:00 "severe storm"
18-Jun-2003 02:49:00 "West" 0 0 18-Jun-2003 10:54:00 "attack"
20-Jun-2004 14:39:00 "West" 231.29 NaN 20-Jun-2004 19:16:00 "equipment fault"
06-Jun-2002 19:28:00 "West" 311.86 NaN 07-Jun-2002 00:51:00 "equipment fault"
16-Jul-2003 16:23:00 "NorthEast" 239.93 49434 17-Jul-2003 01:12:00 "fire"
27-Sep-2004 11:09:00 "MidWest" 286.72 66104 27-Sep-2004 16:37:00 "equipment fault"
Determine if the timetable is regular and sorted. A regular timetable has the same time interval between consecutive row times and a sorted timetable has a row time vector is in ascending order.
isregular(TT)
ans = logical
0
issorted(TT)
ans = logical
0
Sort the timetable on its row times using the sortrows
function and display the first 10
rows of the sorted data.
TT = sortrows(TT); TT(1:10,:)
ans=10×5 timetable
OutageTime Region Loss Customers RestorationTime Cause
____________________ ___________ ______ __________ ____________________ __________________
01-Feb-2002 12:18:00 "SouthWest" 458.98 1.8202e+06 07-Feb-2002 16:50:00 "winter storm"
05-Mar-2002 17:53:00 "MidWest" 96.563 2.8666e+05 10-Mar-2002 14:41:00 "wind"
16-Mar-2002 06:18:00 "MidWest" 186.44 2.1275e+05 18-Mar-2002 23:23:00 "severe storm"
26-Mar-2002 01:59:00 "MidWest" 388.04 5.6422e+05 28-Mar-2002 19:55:00 "winter storm"
20-Apr-2002 16:46:00 "MidWest" 23141 NaN NaT "unknown"
08-May-2002 20:34:00 "SouthWest" 50.732 34481 08-May-2002 22:21:00 "thunder storm"
18-May-2002 11:04:00 "MidWest" 1389.1 1.3447e+05 21-May-2002 01:22:00 "unknown"
20-May-2002 10:57:00 "NorthEast" 9116.6 2.4983e+06 21-May-2002 15:22:00 "unknown"
27-May-2002 09:44:00 "SouthEast" 237.28 1.7101e+05 27-May-2002 16:19:00 "wind"
02-Jun-2002 16:11:00 "SouthEast" 0 0 05-Jun-2002 05:55:00 "energy emergency"
Conditionally Import Subset of Data Using Row Filter
Import a subset of data by specifying variables and rows to import by using a row filter.
To import a subset of the outages.parquet
file, create a filter to import only the OutageTime
, Region
, and Cause
variables. Then, refine the filter to import only rows with values that meet certain conditions.
rf = rowfilter(["OutageTime" "Region" "Cause"]); rf2 = (rf.OutageTime > datetime("2013-02-01")) & (rf.Region == "NorthEast") & (rf.Cause == "winter storm"); d = parquetread("outages.parquet",RowFilter=rf2,SelectedVariableNames=["OutageTime" "Region" "Cause"])
d=6×3 table
OutageTime Region Cause
____________________ ___________ ______________
09-Feb-2013 00:55:00 "NorthEast" "winter storm"
13-Feb-2013 01:44:00 "NorthEast" "winter storm"
25-Dec-2013 11:24:00 "NorthEast" "winter storm"
30-Dec-2013 11:40:00 "NorthEast" "winter storm"
22-Feb-2013 02:17:00 "NorthEast" "winter storm"
23-Feb-2013 01:53:00 "NorthEast" "winter storm"
The resulting subset of filtered data contains only the 6 rows that meet the filter conditions and the 3 specified variables.
Input Arguments
filename
— Name of Parquet file
character vector | string scalar
Name of Parquet file, specified as a character vector or string scalar.
parquetread
works with Parquet 1.0 or Parquet 2.0 files.
Depending on the location of the file, filename
can take on one
of these forms.
Location | Form | ||||||||
---|---|---|---|---|---|---|---|---|---|
Current folder or folder on the MATLAB® path | Specify the name of the file in
Example:
| ||||||||
File in a folder | If the file is not in the current folder or in a folder on the MATLAB path, then specify the full or relative path name. Example:
Example:
| ||||||||
Internet URL | If the file is specified as an internet uniform resource locator
(URL), then Example:
| ||||||||
Remote Location | If the file is stored at a remote location, then
Based on the remote location,
For more information, see Work with Remote Data. Example:
|
The parquetread
function can import structured data from Parquet
files. For more information on Parquet data types supported for reading, see Apache Parquet Data Type Mappings.
Data Types: char
| string
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'OutputType','table'
imports the data in the Parquet file as
a table.
OutputType
— Output datatype
'auto'
(default) | 'table'
| 'timetable'
Output datatype, specified as the comma-separated pair consisting of
'OutputType'
and 'auto'
,
'table'
, or 'timetable'
.
'auto'
— Return a table or a timetable. Theparquetread
detects if the output should be a table or a timetable based on other name-value pairs that you specify. For example, when you set timetable related name-value pairs, thenparquetread
infers that the output is a timetable. Setting these name-value pairs indicates that the output is a timetable:RowTimes
,StartTime
,SampleRate
, orTimeStep
.'table'
— Return a table. For more information on the table datatype, seetable
.'timetable'
— Return a timetable. For more information on the timetable datatype, seetimetable
.
Example: 'OutputType','timetable'
Data Types: char
| string
SelectedVariableNames
— Subset of variables to import
character vector | string scalar | cell array of character vectors | string array
Subset of variables to import, specified as the comma-separated pair consisting of
'SelectedVariableNames'
and a character vector, string scalar,
cell array of character vectors, or a string array.
SelectedVariableNames
must be a subset of variable names contained in the Parquet file. To get the names of all the variables in the file, use theVariableNames
property of theParquetInfo
object.If you do not specify the
SelectedVariableNames
name-value pair,parquetread
reads all the variables from the file.
Data Types: char
| string
| cell
RowTimes
— Row times variable
variable name | time vector
Row times variable, specified as the comma-separated pair consisting of 'RowTimes'
and a variable name or a time vector.
Variable name must be a character vector or string scalar containing the name of any variable in the input table that contains
datetime
orduration
values. The variable specified by the variable name provides row time labels for the rows. The remaining variables of the input table become the variables of the timetable.Time vector must be a
datetime
vector or aduration
vector. The number of elements of time vector must equal the number of rows of the input table. The time values in the time vector do not need to be unique, sorted, or regular. All the variables of the input table become variables of the timetable.
Data Types: char
| string
| datetime
| duration
StartTime
— Start time of row times
datetime scalar | duration scalar
Start time of the row times, specified as the comma-separated pair consisting of
StartTime
and a datetime scalar or duration scalar.
If the start time is a datetime, then the row times of
T
are datetime values.If the start time is a duration, then the row times of
T
are duration values.If the time step is a calendar duration, then the start time must be a datetime value.
StartTime
is a timetable related parameter. The
parquetread
function uses StartTime
along with
SampleRate
or TimeStep
to define the time
vector for the output T
.
Data Types: datetime
| duration
SampleRate
— Sample rate
numeric scalar
Sample rate, specified as the comma-separated pair consisting of
'SampleRate'
and a numeric scalar. The sample rate is the number
of samples per second (Hz) of the time vector of the output timetable
T
.
SampleRate
is a timetable related parameter. The
parquetread
function uses SampleRate
along
with other timetable parameters to define the time vector for the output
T
.
Data Types: double
RowGroups
— Indices of row groups to import
positive numeric scalar | vector of positive integers
Indices of row groups to import, specified as a positive numeric scalar or vector of positive integers, referring to indices of row groups to read.
If you specify a scalar, then the function reads a single row group.
If you specify a vector, then the function reads all the specified row groups.
If you do not specify row groups, then
parquetread
imports the entire file.
Example: RowGroups=701:720
RowFilter
— Filter to select rows to import
matlab.io.RowFilter
object
Filter to select rows to import, specified as a
matlab.io.RowFilter
object. The
matlab.io.RowFilter
object designates conditions each row must
satisfy to be included in your output table or timetable. If you do not specify
RowFilter
, then parquetread
imports all rows
from the input Parquet file.
TimeStep
— Time step of time vector
duration | calendarDuration
Time step of time vector, specified as the comma-separated pair consisting of
'TimeStep'
and a duration scalar.
If you specify the time step as a calendar duration (for example, calendar months), then the vector of row times must be a datetime vector.
If you specify the time step as a duration (for example, seconds), then the vector of row times can either be a datetime or duration vector.
TimeStep
is a timetable related parameter. The
parquetread
function uses TimeStep
along with
other timetable parameters to define the time vector for the output
T
.
VariableNamingRule
— Flag to preserve variable names
"modify"
(default) | "preserve"
Flag to preserve variable names, specified as either "modify"
or
"preserve"
.
"modify"
— Convert invalid variable names (as determined by theisvarname
function) to valid MATLAB identifiers."preserve"
— Preserve variable names that are not valid MATLAB identifiers such as variable names that include spaces and non-ASCII characters.
Starting in R2019b, variable names and row names can include any characters, including
spaces and non-ASCII characters. Also, they can start with any characters, not just
letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname
function). To preserve these variable names and row names, set
the value of VariableNamingRule
to "preserve"
.
Variable names are not refreshed when the value of VariableNamingRule
is changed from "modify"
to "preserve"
.
Data Types: char
| string
Output Arguments
T
— Output data
table | timetable
Output data, returned as a table or timetable. The output of the
parquetread
function depends on the value of the
OutputType
name-value pair. For more information, see the
name-value pair description for OutputType
.
Limitations
In some cases, parquetwrite
creates files that do not represent the
original array T
exactly. If you use parquetread
or
datastore
to read the files, then the result might not have the same
format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.
Extended Capabilities
Thread-Based Environment
Run code in the background using MATLAB® backgroundPool
or accelerate code with Parallel Computing Toolbox™ ThreadPool
.
This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.
Version History
Introduced in R2019aR2022b: Read Parquet files containing structured data
Read structured data from Parquet files as nested tables.
R2022b: Use function in thread-based environments
This function supports thread-based environments.
R2022a: Read Parquet file data more efficiently using rowfilter
to
conditionally filter rows
Conditionally filter and read data faster (Predicate Pushdown) from Parquet files when
using parquetread
and parquetDatastore
.
You can create conditions for filtering by using the rowfilter
function,
matlab.io.RowFilter
object, and RowFilter
name-value
argument.
R2022a: Determine and define row groups in Parquet file data
A Parquet file can store a range of rows as a distinct row group for increased
granularity and targeted analysis. parquetread
uses the
RowGroups
name-value argument to determine row groups while reading
Parquet file data. parquetwrite
uses the
RowGroupHeights
name-value argument to define row groups while writing
Parquet file data.
R2022a: Import nested Parquet file data
You can now import nested Parquet file data with:
LogicalType
asLIST
.LogicalType
asNONE
andPhysicalType
as eitherBYTE_ARRAY
orFIXED_LEN_BYTE_ARRAY
.
The parquetread
function converts and imports these data
architectures as cell arrays.
R2021b: Read and write datetimes with original time zones
Parquet files require time-zone-aware timestamps to be in the UTC time zone. When
writing datetimes, parquetwrite
converts them to equivalent UTC values
and stores the original time zone values in the metadata of the Parquet file.
parquetread
uses the stored original time zone values to enable
roundtripping.
R2021a: Read online data
Read files from an internet URL by specifying filename
as a string
scalar or character vector that contains the protocol type 'http://'
or
'https://'
.
R2021a: Use categorical data in Parquet data format
Import Parquet data that contains the categorical
data type.
R2019b: Read tabular data containing any characters
Import tabular data that has variable names containing any Unicode characters, including
spaces and non-ASCII characters. To read tabular data that contains arbitrary variable
names, such as variable names with spaces and non-ASCII characters, set the
PreserveVariableNames
parameter to true
.
See Also
parquetinfo
| parquetwrite
| timetable
| table
| parquetDatastore
| rowfilter
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)