Main Content

Data Sets for Signal Processing

Use these data sets with MATLAB® and Signal Processing Toolbox™ to get started with signal processing applications.

Audio and Acoustics Data Sets

Data SetData Set Information

Air Compressor

Air compressor data set. The figure shows eight signals in the time domain.

This data set comprises 1800 acoustic recordings collected on a single-stage reciprocating-type air compressor [1]. The data is collected at a sample rate of 16 kHz for 3.125 seconds.

The data set contains single-channel signals in eight subfolders that correspond to one of these operational states:

  • Healthy state — Normal compressor operation with no faults

  • Leakage inlet valve (LIV) fault — Fault in the inlet valve causing air leakage

  • Leakage outlet valve (LOV) fault — Fault in the outlet valve causing air leakage

  • Non-return valve (NRV) fault — Fault in the non-return valve preventing proper check valve operation

  • Piston ring fault — Worn or damaged piston rings affecting compression

  • Flywheel fault — Issues with the flywheel mechanism

  • Rider belt fault — Problems with the drive belt system

  • Bearing fault — Damaged or worn bearings causing vibration and noise

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

 Tips and Additional Information

Phonocardiogram (PCG) Data — PhysioNet Challenge 2016

Phonocardiogram data set. The figure shows four signals in the time domain.

This data set comprises 3,829 acoustic recordings of heart sounds from the PhysioNet Computing in Cardiology Challenge 2016 [2][3]. The data is collected at a sample rate of 2 kHz for 5 seconds.

The data set supports binary classification of cardiac health status through automated phonocardiogram interpretation in resource-limited settings, remote cardiac health screening, and early detection of heart conditions.

  • Normal (2,575 recordings) — Persons with normal cardiac function, representing healthy heart sound patterns and reference baseline.

  • Abnormal (1,254 recordings) — Persons with abnormal cardiac function including various cardiac abnormalities, murmurs, valve disorders, and other pathologies.

Applications include automated cardiac screening systems, telemedicine and remote diagnostics, point-of-care cardiac assessment, medical training and education, and algorithm development for cardiac sound analysis.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

 Tips and Additional Information

Acoustic scenes — Detection and Classification of Acoustic Scenes (DCASE) 2013 Challenge

Acoustic-scene DCASE data set. The figure shows two signals in the time domain.

This data set comprises 200 environmental audio recordings from the Detection and Classification of Acoustic Scenes and Events (DCASE) 2013 challenge [4][5]. Each recording is collected at a sample rate of 44.1 kHz for 30 seconds.

The data set contains two-channel signals in two subfolders, comprising training data and test data.

  • The training and test sets consist of 100 training waveforms and 100 test waveforms recorded in 10 different environments: bus, busy street, office, open-air market, park, quiet street, restaurant, supermarket, tube, and tube station.

  • This data set supports acoustic scene classification for environmental sound recognition. Supported applications include smart city monitoring and urban planning, environmental noise assessment, context-aware mobile applications, acoustic event detection systems, audio surveillance and security, and assistive listening devices.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

Free Spoken Digits

Free spoken digits data set. The figure shows 10 signals in the time domain.

This data set comprises 2000 voice recordings of spoken digits (0-9) by four individuals [6]. Each recording is collected at a sample rate of 8 kHz for a variable duration from 0.14 to 2.28 seconds.

The data set contains 16-bit single-channel signals distributed in 200 recordings per spoken digit. The multiple speakers provide diversity in accents and vocal characteristics, supporting development of robust digit recognition systems that generalize across different voices.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

Mozilla.org® Common Voice Speech Denoising

Common Voice Speech Denoising data set. The figure shows four signals in the time domain.

This data set comprises 2800 speech recordings as a curated subset of Mozilla.org Common Voice open-source speech corpus [7]. Each recording is collected at a sample rate of 48 kHz for a variable duration from 2 to 10 seconds.

The data consists of 16-bit single-channel signals distributed in three subfolders: training, validation, and test.

  • The recordings contain read sentences from diverse text sources with varied vocabulary and sentence structures, capturing natural prosody and complete utterances from multiple diverse speakers across different genders, age groups (18-80+ years), English dialects, and recording quality levels.

  • The data enables development of speech enhancement systems that handle various noise scenarios, including environmental noise, electronic noise, and signal distortions. The clean speech serves as reference, with noise augmentation that can be applied to create noisy versions for signal processing applications.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

 Tips and Additional Information

Biomedical Data Sets

Data SetData Set Information

Electrocardiogram (ECG) Data — QT Wave

QT-wave electrocardiogram data set. The figure shows eight signals in the time domain.

This data set comprises 210 ECG recordings from 105 patients for automated waveform segmentation [2][8]. To obtain each recording, examiners placed two electrodes on different locations on a patient's chest and collected ECG waveforms at a sample rate of 250 Hz for approximately 15 minutes.

The data set contains two-channel ECG signals and four labeled cardiac regions:

  • P wave (atrial depolarization, <20 Hz frequency band)

  • QRS complex (ventricular depolarization, 10-40 Hz frequency band, most prominent ECG feature)

  • T wave (ventricular repolarization, <10 Hz frequency band)

  • N/A (unlabeled baseline and isoelectric segments)

The segmentation of these regions can provide the basis for measurements useful for assessing the overall health of the human heart and the presence of abnormalities [9].

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

 Tips and Additional Information

Myoelectric (EMG) Data — Arm motion

Arm-motion myoelectric data set. The figure shows eight signals in the time domain.

This data set comprises 720 EMG signal recordings measuring electrical muscle activity from 30 subjects performing various arm movements [10]. Each recording captured 3 seconds of motion at a sample rate of 3 kHz.

The data set contains 720 files with eight-channel EMG signals and 720 label files.

  • The recordings were collected across 4 sessions with 6 trials per session using with sensors placed on the subjects' forearm muscles to detect muscle activation pattern.

  • The label data includes seven motion categories: hand open, hand close, wrist flexion, wrist extension, supination, pronation, and rest. The motion arrays in each label file mark the forearm movements with numeric values from 1 to 6, and the rest periods have a mark of –1.

  • The multi-channel EMG configuration enables capture of complex muscle activation patterns across the forearm during different movements.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

ECG Data — PhysioNet MIT-BIH

PhysioNet MIT-BIH ECG data set. The figure shows two signals in the time domain.

This data set comprises 46,080 segments of ECG signals and noise specifically prepared for ECG signal denoising applications. The recording captured 43.5 hours of ECG data from 47 subjects at a sample rate of 300 Hz. Each segment has 1024 samples, covering approximately 3.4 seconds of data.

The data set contains high-quality clinical-grade single-lead ECG recordings (typically MLII - Modified Lead II) from the PhysioNet MIT-BIH Arrhythmia database [2][11] and MIT-BIH Noise Stress Test database [2][12].

  • Three kinds of noise were added to the clean signals: baseline wander, muscle artifact, and electrode motion. The noise types are combined and rescaled to produce target SNRs of –2.5, 0, 2.5, 5, and 7.5 dB, and added to the clean signals. This operation creates training pairs of noisy and clean ECG segments for developing and evaluating denoising systems.

  • The data is pre-split in training, validation, and test partitions. The data set provides pairs of noisy predictors and clean targets that you can use to develop and evaluate denoising systems for removing artifacts commonly encountered in clinical ECG recordings.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

Fetal ECG Data — Source Separation

Source-separation fetal ECG data set. The figure shows three signals in the time domain.

This data set comprises 600 fetal ECG synthetic signal recordings ([2], [13]), designed for separating fetal and maternal ECG components from mixed abdominal surface recordings for non-invasive fetal cardiac monitoring during pregnancy. Each recording captured between 10 and 30 seconds of abdominal electrocardiogram (aECG) data from 10 subjects at a sample rate of 1 kHz.

Fetal signals are 3-10 times weaker than maternal signals, requiring sophisticated separation techniques to extract pure fetal ECG from mixed recordings captured by 4-8 abdominal electrode channels. The data set includes four SNR configurations to test model robustness:

  • High SNR (fetal signal relatively strong)

  • Medium-High SNR (moderate fetal signal strength)

  • Medium-Low SNR (weak fetal signal)

  • Low SNR (very weak fetal signal, most challenging condition)

The data set comprises 540 training recordings from nine synthetic pregnant patient profiles and 60 test recordings from one held-out subject.

  • Each recordings contains three signal components: mixed abdominal ECG (aECG) as input, pure maternal ECG as ground truth source 1, and pure fetal ECG as ground truth source 2.

  • The training configuration of 9 subjects × 4 SNR levels × 3 cases × 5 repetitions provides 540 diverse examples, while the test configuration of 1 subject × 4 SNR × 3 cases × 5 repetitions provides 60 independent evaluation samples.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

Electroencephalogram (EEG) and Electrooculogram (EOG) Data — Brain Activity

Brain activity data set. The figure shows an overlay of two signals in the time domain.

This data set comprises 4,514 artifact-free EEG signal segments and 3,40 pure-artifact EOG signal segments, designed for removing eye movement artifacts from brain activity recordings [14]. Both types of segments are recordings that measure brain activity. Each recording captured for 2 seconds of data at a sample rate of 256 Hz.

Eye movements create strong electrical artifacts that contaminate EEG signals, requiring removal while preserving genuine brain activity for clean EEG analysis in research and diagnostics. The data set contains clean (artifact-free) EEG signals and EOG artifact signals.

  • Clean EEG signals contain frequency content from 0.5-45 Hz (typical EEG bandwidth) with microvolts amplitude range, capturing alpha, beta, theta, and delta brain rhythms in artifact-free epochs verified by experts.

  • EOG artifacts contain frequency content from 0.5-10 Hz (low frequency) with amplitudes typically 10-100 times larger than EEG, including eye blinks, horizontal eye movements, and vertical eye movements characterized by sharp transients and slow drifts.

The data set supports multiple denoising approaches, including additive mixing (clean EEG + scaled EOG creates contaminated signal) and adaptive filtering (reference-based artifact removal). This support enables you to explore different signal processing methodologies for optimal artifact removal performance across various application scenarios.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

Geoscience Data Sets

Data SetData Set Information

Stanford earthquake

Stanford earthquake data set. The figure shows a noisy vibrational signal and a clear vibrational signal, both in the time domain.

This data set comprises 20,000 seismic measurements from the Stanford Earthquake Dataset (STEAD) [15]. The data is collected at a sample rate of 100 Hz.

The data set provides clean/noisy seismic signal pairs for development of diverse applications, including earthquake early warning systems, seismic signal denoising and enhancement, and automatic phase picking and wave detection.

  • The data set contains 10,000 noisy signals and 10,000 noiseless (clean) signals.

  • The data set has three subsets (training, validation, and testing) using an 80-10-10 split method.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

 Tips and Additional Information

Noise, Vibration, and Harshness Data Sets

Data SetData Set Information

Colored Noise

Colored noise data set. The figure shows two signals in the time domain.

This data set comprises a synthetic collection of 750 random noise process realizations designed for evaluating classifiers on time-series signal data. Each signal has 2,000 samples.

The data set provides 250 signals for each of three distinct noise types with unique spectral characteristics:

  • White noise: Homogeneous power spectral density (PSD) across frequencies.

  • Brown noise: PSD is proportional to 1/f2 (low-frequency dominated), with sinusoidal frequencies at 0.19π rad/sample and 0.33π rad/sample.

  • Pink noise: PSD is proportional to 1/f (moderate roll-off), with sinusoidal frequencies at 0.17π rad/sample and 0.31π rad/sample.

The normalized random processes provide controlled conditions for demonstrating signal classification workflows.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

 Tips and Additional Information

Radar and Wireless Data Sets

Data SetData Set Information

Rectangular Pulse and Linear Frequency Modulated (RPLFM) Simulated Radar

RPLFM simulated radar data set. The figure shows six signals in the time domain.

This data set comprises 900 simulated radar waveforms in noise designed to model realistic spectrum-sharing scenarios in the Citizens Broadband Radio Service (CBRS) band at 3.5 GHz [16]. The data is collected at a sample rate of 10 MHz for 80 milliseconds.

The data set contains two subfolders, comprising training data and test data.

  • The training data comprises 400 rectangular-pulse (RP) radar waveforms and 400 linear-frequency-modulated (LFM) radar waveforms.

  • The test data comprises 50 RP radar waveforms and 50 LFM radar waveforms.

  • All the waveforms have complex-valued white Gaussian noise added to achieve a realistic simulation environment.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

 Tips and Additional Information

Ultra-Wideband Radar Sensed Gestures

Ultra-wideband radar sensed Gestures data set. The figure shows eight signals in the time domain.

This data set comprises 9,600 radar-sensed 2-D recordings of 12 dynamic hand gestures [17], gathered from eight different human volunteers.

To obtain each recording, the examiners placed a separate UWB impulse radar at the left, top, and right sides of their experimental setup, resulting in three received radar signal data matrices.

  • The data set contains eight subfolders with 12 radar files. Each folder corresponds to a subject and each file corresponds to a hand gesture, for a total of 96 trials stored in 96 MAT files.

  • Each radar file has three matrices, each one corresponding to a radars used in the experimental setup: Left, Top, and Right.

  • Each matrix has 100 recordings concatenated from top to bottom and has a size of 9,000 × 189 (slow-time × fast-time bins). Each matrix is labeled as the hand gesture that generated it.

Movement-based signal data acquired using sensors, like UWB impulse radars, contain patterns specific to different gestures. Correlating motion data with movement benefits several avenues of work, including hand gesture recognition for contactless human-computer interaction.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

 Tips and Additional Information

Continuous-Wave (CW) Radar Health Monitoring

Continuous-wave radar health monitoring data set. The figure shows eight signals in the time domain.

This data set comprises 2,060 files (1,030 CW radar segments and 1,030 ECG segments) of non-contact vital signs gathered from six healthy adult subjects [18]. The data is collected at a downsampled sample rate of 200 Hz for 5.12 seconds per segment.

The data set contains two subfolders, comprising training/validation data and test data.

  • The training/validation set is collected from subjects 1-5 and has 830 segments (704 for training and 126 for validation).

  • The test set is collected from subject 6 and has 200 segments.

The recordings were collected in controlled laboratory settings with subjects in seated or resting positions under normal respiration conditions.

  • The radar signals capture chest wall motion caused by cardiopulmonary activity, with frequency content primarily in the vital signs band (0.5-20 Hz), normalized to arbitrary units. These signals contain physiological information including heartbeat (mechanical cardiac contractions detected via chest displacement), respiration (breathing rate from thoracic expansion/contraction), heart rate variability (beat-to-beat interval variations), and respiratory sinus arrhythmia (heart rate modulation with breathing cycles).

  • The synchronized ECG reference signals are normalized by subtracting the median and rescaling so the maximum peak equals 1, providing ground-truth cardiac timing aligned with radar measurements.

The time-aligned radar and ECG data enable development of systems that map radar signals to cardiac metrics, supporting applications in remote patient monitoring, sleep apnea detection, elderly care, infant monitoring, smart home health monitoring, and vital signs monitoring in environments where contact sensors are impractical.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

 Tips and Additional Information

Radio-Frequency (RF) Frame Detection

RF frame detection data set. The figure shows eight signals in the time domain.

This data set comprises 4,831 complex I/Q baseband signals and a pretrained network [19]. The data is collected at a sample rate of 25 MHz for an average duration of 4.5 milliseconds.

The data set contains labeled radio frequency (RF) signal recordings from software-defined radio (SDR) captures of real-world wireless transmissions in the 2.4 GHz ISM band for identifying wireless communication protocols. The data set includes six RF signal classes:

  • BLE_1MHz — Bluetooth® Low Energy (BLE) 1 MHz bandwidth with frequency-hopping spread spectrum for IoT devices.

  • BLE_2MHz — Bluetooth Low Energy (BLE) 2 MHz bandwidth with extended advertising mode for high-throughput devices.

  • BT_classic — Bluetooth Classic BR/EDR with 1 MHz channels for audio streaming and phone calls.

  • WLAN — Wi-Fi® 802.11b/g/n with OFDM or DSSS modulation and 20-22 MHz channels for wireless networking.

  • Collision — Overlapping transmissions from multiple standards with superimposed frequency patterns.

  • Undefined — Background noise without active transmissions for idle spectrum detection.

You can obtain signal spectrograms using a 256-sample Hann window, 50% overlap between adjoining segments, 256 discrete Fourier transform points, producing time-frequency images with magnitude in dB scale and -80 dB minimum threshold. The high sample rate and detailed time-frequency domain representation enable effective spectrum sensing for cognitive radio and coexistence management in crowded RF environments.

To learn more about this data set, expand these sections.

 MATLAB Code to Access and Explore Data Set

 Data Set File Properties

 Tips and Additional Information

References

[1] Verma, N. K., Sevakula, R. K., Dixit, S., & Salour, A. (2016). "Intelligent Condition Based Monitoring Using Acoustic Signals for Air Compressors." IEEE Transactions on Reliability, Vol. 65, Number 1, pp. 291–309.

[2] Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. Ch., Mark, R. G., Mietus, J. E.,, Moody, G. B. , Peng, C.-K., and Stanley, H. E. (2000) "PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals". Circulation, Vol. 101, Number 23, pp. e215-e220.

[3] Liu et al. "An open access database for the evaluation of heart sound algorithms". (2016) Physiological Measurement, Vol. 37, Number 12, pp. 2181-2213.

[4] Giannoulis, D., Stowell, D., Benetos, E., Rossignol, M., Lagrange, M., and Plumbley, M. D. (2013) "A database and challenge for acoustic scene classification and event detection." 21st European Signal Processing Conference (EUSIPCO 2013), pp. 1–5.

[5] Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., and Plumbley, M. D. (2015) "Detection and Classification of Acoustic Scenes and Events." IEEE Transactions on Multimedia, Vol. 17, Number 10, pp. 1733–46.

[6] Jakobovski. “Jakobovski/Free-Spoken-Digit-Dataset.” GitHub, May 30, 2019. https://github.com/Jakobovski/free-spoken-digit-dataset.

[7] Mozilla Common Voice Corpus, https://commonvoice.mozilla.org/.

[8] Laguna, P., Mark, R. G., Goldberger, A. L., and Moody, G. B. (1997) "A Database for Evaluation of Algorithms for Measurement of QT and Other Waveform Intervals in the ECG." Computers in Cardiology, Vol. 24, pp. 673–676.

[9] Laguna, P., Jané, R., and Caminal, P. (1994) "Automatic detection of wave boundaries in multilead ECG signals: Validation with the CSE database." Computers and Biomedical Research, Vol. 27, Number 1, pp. 45–60.

[10] Chan, A. D. C., and Green, G. C. (2007) "Myoelectric Control Development Toolbox." 30th Conference of the Canadian Medical & Biological Engineering Society, Toronto, Canada.

[11] Moody, G. B., and Mark, R. G. (2001) "The impact of the MIT-BIH Arrhythmia Database." IEEE Engineering in Medicine and Biology Magazine, Vol. 20, Number 3, pp. 45–50.

[12] Moody, G. B., Muldrow, W. E., and Mark, R. G. (1984) "A noise stress test for arrhythmia detectors." Computers in Cardiology, Vol. 11, pp. 381–384.

[13] Andreotti, F., Behar, J., and Clifford, G. D. (2016) "Fetal ECG Synthetic Database" https://physionet.org/content/fecgsyndb/1.0.0/.

[14] Zhang, H., Zhao, M., Wei, C., Mantini, D., Li, Z., and Liu, Q. (2021) "EEGdenoiseNet: A Benchmark Dataset for End-to-End Deep Learning Solutions of EEG Denoising." arXiv:2009.11662 https://arxiv.org/abs/2009.11662.

[15] Mousavi, S. M., Sheng, Y. , Zhu, W., and Beroza, G. C. (2019) "STanford EArthquake Dataset (STEAD): A Global Data Set of Seismic Signals for AI." IEEE Access, Vol. 7, pp. 179464–76.

[16] Caromi, R., Souryal, M., and Hall, T. A. (2017). "RF Dataset of Incumbent Radar Signals in the 3.5GHz CBRS Band." Journal of Research of the National Institute of Standards and Technology, Vol. 124, Number 124038.

[17] Ahmed, S., Wang, D., Park, J., et al. (2021). "UWB-gestures, a public dataset of dynamic hand gestures acquired using impulse radar sensors." Scientific Data, Vol. 8, Article 102.

[18] Schellenberger, S., Shi, K., Steigleder, T. et al. (2020) "A dataset of clinically recorded radar vital signs with synchronized reference sensor signals." Scientific Data, Vol. 7, Article 291.

[19] Wicht, J., Wetzker, U., and Jain, V. (2022). Spectrogram Data Set for Deep Learning Based RF-Frame Detection. Data, Vol. 7, Number 12, p. 168.

See Also

Topics