How to use svm in Matlab for my binary feature vector.
Show older comments
Let say I have a main feature set which combine of six binary feature vector. These six binary feature vector are 105X6 logical. Eg:
1. 10100001000001111111100000000001..
2. 00001010101111000010101010110001..
3. 00101011101111111100001000000000..
4. 11111111110000101010101001010111..
5. 0000011110000101010101001010111..
6. 11111111110000101010101001010110..
While three of the feature vector is for benign, another three is for malware. How can I train my feature vector using svmtrain and svmclassify? I have no idea how to start, please guide me.
Answers (2)
Walter Roberson
on 8 Apr 2017
0 votes
Do you mean you have 105 samples, each of which have feature vectors totaling 6 bits, or do you mean you have 6 samples, each of which has a total of 105 bits of features?
If you only have 6 samples with 105 bits of features per sample, then you do not have enough data to do classification.
2 Comments
ai ping Ng
on 9 Apr 2017
Walter Roberson
on 9 Apr 2017
To do the calculations for classifications, you need at least as many samples as you have bits of features. More than that, actually.
user2030669, @cbeleites answer below is superb but as a rough rule of thumb: you need at least 6 times the number of cases (samples) as features. – BGreene Mar 7 '13 at 14:48 2 ... in each class. I've also seen recommendations of 5p and 3p / class. – cbeleites Mar 7 '13 at 20:02
[...] but you need a minimum of 96 observations to accurately predict the probability of a binary outcome even if there are no features to be examined [this is to achieve of 0.95 confidence margin of error of 0.1 in estimating the actual marginal probability that Y=1].
Ilya
on 11 Apr 2017
0 votes
You most certainly do not need as many samples as you have features. Statements like "you need at least 6 times the number of cases (samples) as features" are sheer nonsense.
However, with so few observations (6) you will likely find that several, perhaps many, features individually give perfect separation between the two classes. For example, staring at the posted patterns, I observe that the 6th bit is 0 for the first three samples and 1 for the last three samples. So if the first three are benign and the last three are malignant, the 6th bit is a perfect predictor. And there may be more.
You do not need SVM or any clever classifier for this problem. Just find all such perfect predictors and see if they make sense. Passing data to smart black boxes shouldn't be the first step in your analysis. Think about what your data means first. See if you can get a simple classification model by hand. If you fail, proceed with sophisticated algorithms.
11 Comments
Walter Roberson
on 11 Apr 2017
Edited: Walter Roberson
on 11 Apr 2017
If you do not have several times more samples features than you have features, then you cannot have any confidence in the results you get.
Suppose all of the bits are completely random with 50% probability. Perfect separation would occur for the pattern 111000 or 000111, which is 2 out of 64 possibilities. The probability that 2/64 will occur in 150 trials is
1 - (1 - 2/64)^150
which is
5871774454433103102736206148887867569674846210149791489181024631928236388589376223442652675527716522064281898939641739109744905857476491552609181777826572913299575583271090340607076436768655579338897754969521679704777730258623 / 5922386521532855740161817506647119732883018558947359509044845726112560091729648156474603305162988578607512400425457279991804428268870599332596921062626576000993556884845161077691136496092218188572933193945756793025561702170624
which is about 0.9915, 99.15% .
The average number of occurrences of such an event would be (2/64) * 150 = 75/16 = 4.6875 . So on the order of 5 apparent perfect separations in the bunch would be expected just by random chance.
You might be able to find a perfect separation, but you cannot have any confidence in it at all unless you have many more samples.
Ilya
on 11 Apr 2017
You can have certain confidence in your results when there are more features than samples, and sometimes that confidence would suffice to publish a paper, submit a grant proposal etc, that is, for all practical purposes. Analysis of wide data (more features than samples) is what people do routinely in various fields. One example is microarray experiments. Many statistics and machine learning papers are written about such applications.
Walter Roberson
on 12 Apr 2017
6 samples is not enough for any confidence for binary features.
My calculation is that if you have F binary features and you want the probability to be less than 1/20 that the perfect separation arose by chance, then you need the number of samples, S, to be at least 1-ln(-exp(ln(19/20)/F)+1)/ln(2) . Near F = 150, that works out as S > 12.5-ish . This does not increase quickly as F increases, needing approximately an increase of 1 for each doubling of F.
So, okay, Yes, if one examines only this aspect of binary features, then you do not need many times as many samples as you have features.
Ilya
on 12 Apr 2017
I agree that 6 samples are not enough for high confidence. Your original response contained far broader statements such as "To do the calculations for classifications, you need at least as many samples as you have bits of features." or "as a rough rule of thumb: you need at least 6 times the number of cases (samples) as features." I think making such broad statemehts in response to questions from inexperienced users is counterproductive, especially without knowing their analysis goal.
Walter Roberson
on 12 Apr 2017
The part I quoted "in each class. I've also seen recommendations of 5p and 3p / class. – cbeleites Mar 7 '13 at 20:02" was written by the main author of the paper linked to on stackoverflow; https://arxiv.org/pdf/1211.1323.pdf
Ilya
on 13 Apr 2017
I do not see any discussion of the relation between sample size and number of features in that paper. The "6 times the number of cases as features" recommendation is not in that paper. I have trouble seeing how that recommendation can be made based on that paper; in fact, reading the summary and knowing that Raman spectroscopy typically invloves thousands of features, I reach the opposite conclusion: Far less than 6p samples per class are needed.
Back to my point. We (people who reply to questions on this site) should not be in the business of copying and pasting someone's hectic advice from some web page, without any justification or source given. This does not help inexperienced users.
Walter Roberson
on 13 Apr 2017
It is not, in my opinion, reasonable to expect that we independently research and validate answers provided by someone who wrote a paper on the topic. At some point we are allowed to assume that such a person knows what they are talking about.
I can't help but notice that you have not provided any source or formulae for your claims.
Ilya
on 14 Apr 2017
It is, in my opinion, unreasonable to assume that if someone wrote a paper on a related topic, not on the topic of concern in the question, we should take for granted statements from that person on a related subject, especially when such statements look like gross over-simplifications. You are not allowed to assume that if someone wrote a paper about a learning curve, that someone knows something about the relation between number or samples and number of features. These two are different subjects and the mere fact that someone knows how to compute a learning curve for a dataset in a specific application does not mean that this someone knows how to capture, in meaningful terms, the relation between data size and dimensionality. It is reasonable to expect that at least you read the paper before offering it to someone and make sure that the paper has material related to the question.
The only claim I am making is that it is difficult to capture the relation between number of samples and features, and for that reason over-simplifications you quoted are sheer nonsense. I am happy to provide a bunch of references illustrating how analysis is done on wide datasets in violation of rules such as "6p" and all that other nonsense in your post. Unlike you, I read such papers for a living and have a large collection on my hard disk.
Walter Roberson
on 14 Apr 2017
"You are not allowed to assume that if someone wrote a paper about a learning curve, that someone knows something about the relation between number or samples and number of features."
Incorrect. You will find that I am allowed to assume exactly that. Or are you paying my bills?
Walter Roberson
on 14 Apr 2017
Ilya, this resource (MATLAB Answers) is not an academic journal: it is a resource in which people do the best they can in their spare time to help other people.
Cross-checking competing papers takes time, and might require years of background experience to know all the relevant factors, and to know things like which papers were later refuted. From time to time someone with a lot of deep knowledge in a topic wanders by here and helps out.
But... mostly topic experts do not wander here and help out. That leaves the volunteers with a choice:
A) Leave nearly all the questions here unanswered because we are not the topic experts; or
B) Do some surface-level research of appropriate papers and books, hoping that our S/T/E/M backgrounds are enough to guide us to something useful that we can interpret for the people asking the questions; or
C) Answer based upon our memory, and using past postings of how other people have answered similar questions in the past (people who might not have been topic experts either.)
I often end up answering questions that involve matters outside my topic expertise, including on topics that I may never have heard of before. I would prefer if there were experts on hand on every topic, ready to step in promptly... but those people simply are not available.
So I have a look; and I answer what I can, in the time I have available; to the extent that my health allows.
It is not the best of situations, but unfortunately a lot of the time I am the only help people have. It is, to be frank, a very heavy burden at times.
Ilya
on 15 Apr 2017
Walter, I appreciate this explanation.
I agree that this resource is not an academic journal, and the threshold for posting an answer is much lower than that for a publication. I also note that there are no consistent rules for people who answer on Answers (at least I am not aware of any), and for that reason you can choose any philosophy you like with respect to the quality/thoughtfulness of your answers. Yet I believe that answering questions outside your expertise without doing some verification first is dangerous and often produces plain wrong (not just somewhat incorrect) answers, which is worse than not giving any answer at all. Just like you said - "those people simply are not available", where "those people" means "experts". Because experts are not available, no one is there to refute a wrong answer, and the wrong answer stays on this site forever, serving as a source of confusion and support for similarly misguided future answers.
On my part, I choose to answer only questions for which I consider myself an expert. I doubt that by doing so I fail to provide critical help to people out there. Many people asking questions on this site are students, and they can certainly find other sources of help such as, for instance, their professors. This is especially true for questions such as this one, where the entire discussion revolves around theory and has nothing to do with MATLAB. It's just that submitting a question to Answers takes less effort than scheduling an appointment with faculty, and they resort to this easy way. If they knew the likelihood of getting a plain wrong answer was high, they would likely not resort to this easy way.
I appreciate your desire to help and am not asking you to apply the same level of scrutiny as that for an academic publication. I think though that raising the bar a bit higher would be a positive change toward improving quality, perhaps at the expense of reducing the overall number of answers; I think such a reduction would be acceptable since it would also lead to reduction of plain wrong answers. Also, doing more verification would allow you to learn the material at a deeper level and develop knowledge of new areas. I do not know to what extent you are interested in learning, of course.
Categories
Find more on Matrix Indexing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!