Hi Finn,
Given the nature of your problem, where each data point is essentially a set of vectors (rework dimensions) associated with a label ('Scrap' or 'Reuse'), and considering that the order of these vectors within each set is not fixed, you're dealing with a complex input structure that requires careful consideration. Here are some strategies you might consider:
1. Feature Engineering:
Try to extract meaningful features from each set of rework dimensions that can be used to represent each data point in a fixed-size feature space.
- Aggregation: Compute statistical measures (mean, median, standard deviation, min, max, etc.) across the vectors in each data point.
- Dimensionality Reduction: Use techniques like PCA (Principal Component Analysis) to reduce the set of vectors to a smaller set of principal components that capture most of the variance. Refer to this documentation link: Principal component analysis of raw data - MATLAB pca (mathworks.com)
2. Sequence Models:
- Sorting: As you mentioned, sorting the vectors by an important column could be a good preprocessing step before feeding them into a sequence model.
- Padding: If necessary, pad the sequences to a fixed length.
3. Set Functions:
Explore neural network architectures that are invariant to the order of the input vectors, such as set functions.
- Deep Sets: This architecture can process sets of vectors and is inherently permutation-invariant. You can refer to this research paper for deep sets: [1703.06114] Deep Sets (arxiv.org)
- Attention Mechanisms: Use attention to weigh the importance of different vectors within each set, which can help the model focus on the most relevant rework dimensions.
4. Graph-Based Models:
If there is a relationship between the rework dimensions, you could represent each data point as a graph, with dimensions as nodes and some logical relationship as edges.
5. Multiple Instance Learning (MIL):
- Instance-Level Predictions: Make predictions on each vector and then aggregate these predictions to make a bag-level prediction.
- Aggregate Features: Extract features from each instance and aggregate them to form a single feature vector for the bag.
6. Custom Model:
7. Similarity-Based Methods:
Use similarity or distance metrics to compare the sets of vectors and use these metrics as features for a machine learning model.
Before implementing any of these strategies, it's crucial to understand the nature of the rework dimensions and the domain knowledge behind the aircraft maintenance data. This understanding can guide the feature engineering process and the choice of model architecture. Additionally, it's often beneficial to start with a simple model to establish a baseline before moving on to more complex models.