How to use svm in Matlab for my binary feature vector.

Question

ai ping Ng on 5 Apr 2017

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/333717-how-to-use-svm-in-matlab-for-my-binary-feature-vector

Commented: Ilya on 15 Apr 2017

Let say I have a main feature set which combine of six binary feature vector. These six binary feature vector are 105X6 logical. Eg:

1. 10100001000001111111100000000001..

2. 00001010101111000010101010110001..

3. 00101011101111111100001000000000..

4. 11111111110000101010101001010111..

5. 0000011110000101010101001010111..

6. 11111111110000101010101001010110..

While three of the feature vector is for benign, another three is for malware. How can I train my feature vector using svmtrain and svmclassify? I have no idea how to start, please guide me.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Walter Roberson on 8 Apr 2017

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/333717-how-to-use-svm-in-matlab-for-my-binary-feature-vector#answer_262318

Do you mean you have 105 samples, each of which have feature vectors totaling 6 bits, or do you mean you have 6 samples, each of which has a total of 105 bits of features?

If you only have 6 samples with 105 bits of features per sample, then you do not have enough data to do classification.

2 Comments
Show NoneHide None

ai ping Ng on 9 Apr 2017

Yes. My question means that I have 6 samples with 105 bits of features per sample. How if I maximise my samples now until 100 with 105 bits of feature?

Walter Roberson on 9 Apr 2017

To do the calculations for classifications, you need at least as many samples as you have bits of features. More than that, actually.

http://stats.stackexchange.com/questions/51490/how-large-a-training-set-is-needed

user2030669, @cbeleites answer below is superb but as a rough rule of thumb: you need at least 6 times the number of cases (samples) as features. – BGreene Mar 7 '13 at 14:48 2 ... in each class. I've also seen recommendations of 5p and 3p / class. – cbeleites Mar 7 '13 at 20:02

[...] but you need a minimum of 96 observations to accurately predict the probability of a binary outcome even if there are no features to be examined [this is to achieve of 0.95 confidence margin of error of 0.1 in estimating the actual marginal probability that Y=1].

Sign in to comment.

Answer 2

Ilya on 11 Apr 2017

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/333717-how-to-use-svm-in-matlab-for-my-binary-feature-vector#answer_262554

You most certainly do not need as many samples as you have features. Statements like "you need at least 6 times the number of cases (samples) as features" are sheer nonsense.

However, with so few observations (6) you will likely find that several, perhaps many, features individually give perfect separation between the two classes. For example, staring at the posted patterns, I observe that the 6th bit is 0 for the first three samples and 1 for the last three samples. So if the first three are benign and the last three are malignant, the 6th bit is a perfect predictor. And there may be more.

You do not need SVM or any clever classifier for this problem. Just find all such perfect predictors and see if they make sense. Passing data to smart black boxes shouldn't be the first step in your analysis. Think about what your data means first. See if you can get a simple classification model by hand. If you fail, proceed with sophisticated algorithms.

11 Comments
Show 9 older commentsHide 9 older comments

Walter Roberson on 11 Apr 2017

Edited: Walter Roberson on 11 Apr 2017

Open in MATLAB Online

If you do not have several times more samples features than you have features, then you cannot have any confidence in the results you get.

Suppose all of the bits are completely random with 50% probability. Perfect separation would occur for the pattern 111000 or 000111, which is 2 out of 64 possibilities. The probability that 2/64 will occur in 150 trials is

1 - (1 - 2/64)^150

which is

5871774454433103102736206148887867569674846210149791489181024631928236388589376223442652675527716522064281898939641739109744905857476491552609181777826572913299575583271090340607076436768655579338897754969521679704777730258623 / 5922386521532855740161817506647119732883018558947359509044845726112560091729648156474603305162988578607512400425457279991804428268870599332596921062626576000993556884845161077691136496092218188572933193945756793025561702170624

which is about 0.9915, 99.15% .

The average number of occurrences of such an event would be (2/64) * 150 = 75/16 = 4.6875 . So on the order of 5 apparent perfect separations in the bunch would be expected just by random chance.

You might be able to find a perfect separation, but you cannot have any confidence in it at all unless you have many more samples.

Ilya on 14 Apr 2017

It is, in my opinion, unreasonable to assume that if someone wrote a paper on a related topic, not on the topic of concern in the question, we should take for granted statements from that person on a related subject, especially when such statements look like gross over-simplifications. You are not allowed to assume that if someone wrote a paper about a learning curve, that someone knows something about the relation between number or samples and number of features. These two are different subjects and the mere fact that someone knows how to compute a learning curve for a dataset in a specific application does not mean that this someone knows how to capture, in meaningful terms, the relation between data size and dimensionality. It is reasonable to expect that at least you read the paper before offering it to someone and make sure that the paper has material related to the question.

The only claim I am making is that it is difficult to capture the relation between number of samples and features, and for that reason over-simplifications you quoted are sheer nonsense. I am happy to provide a bunch of references illustrating how analysis is done on wide datasets in violation of rules such as "6p" and all that other nonsense in your post. Unlike you, I read such papers for a living and have a large collection on my hard disk.

Walter Roberson on 14 Apr 2017

Ilya, this resource (MATLAB Answers) is not an academic journal: it is a resource in which people do the best they can in their spare time to help other people.

Cross-checking competing papers takes time, and might require years of background experience to know all the relevant factors, and to know things like which papers were later refuted. From time to time someone with a lot of deep knowledge in a topic wanders by here and helps out.

But... mostly topic experts do not wander here and help out. That leaves the volunteers with a choice:

A) Leave nearly all the questions here unanswered because we are not the topic experts; or

B) Do some surface-level research of appropriate papers and books, hoping that our S/T/E/M backgrounds are enough to guide us to something useful that we can interpret for the people asking the questions; or

C) Answer based upon our memory, and using past postings of how other people have answered similar questions in the past (people who might not have been topic experts either.)

I often end up answering questions that involve matters outside my topic expertise, including on topics that I may never have heard of before. I would prefer if there were experts on hand on every topic, ready to step in promptly... but those people simply are not available.

So I have a look; and I answer what I can, in the time I have available; to the extent that my health allows.

It is not the best of situations, but unfortunately a lot of the time I am the only help people have. It is, to be frank, a very heavy burden at times.

Ilya on 15 Apr 2017

Walter, I appreciate this explanation.

I agree that this resource is not an academic journal, and the threshold for posting an answer is much lower than that for a publication. I also note that there are no consistent rules for people who answer on Answers (at least I am not aware of any), and for that reason you can choose any philosophy you like with respect to the quality/thoughtfulness of your answers. Yet I believe that answering questions outside your expertise without doing some verification first is dangerous and often produces plain wrong (not just somewhat incorrect) answers, which is worse than not giving any answer at all. Just like you said - "those people simply are not available", where "those people" means "experts". Because experts are not available, no one is there to refute a wrong answer, and the wrong answer stays on this site forever, serving as a source of confusion and support for similarly misguided future answers.

On my part, I choose to answer only questions for which I consider myself an expert. I doubt that by doing so I fail to provide critical help to people out there. Many people asking questions on this site are students, and they can certainly find other sources of help such as, for instance, their professors. This is especially true for questions such as this one, where the entire discussion revolves around theory and has nothing to do with MATLAB. It's just that submitting a question to Answers takes less effort than scheduling an appointment with faculty, and they resort to this easy way. If they knew the likelihood of getting a plain wrong answer was high, they would likely not resort to this easy way.

I appreciate your desire to help and am not asking you to apply the same level of scrutiny as that for an academic publication. I think though that raising the bar a bit higher would be a positive change toward improving quality, perhaps at the expense of reducing the overall number of answers; I think such a reduction would be acceptable since it would also lead to reduction of plain wrong answers. Also, doing more verification would allow you to learn the material at a deeper level and develop knowledge of new areas. I do not know to what extent you are interested in learning, of course.

Sign in to comment.

How to use svm in Matlab for my binary feature vector.

0 Comments
Show -2 older commentsHide -2 older comments

Answers (2)

2 Comments
Show NoneHide None

11 Comments
Show 9 older commentsHide 9 older comments

See Also

Categories

Tags

Community Treasure Hunt

How to use svm in Matlab for my binary feature vector.

0 Comments Show -2 older commentsHide -2 older comments

Answers (2)

2 Comments Show NoneHide None

11 Comments Show 9 older commentsHide 9 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None

11 Comments
Show 9 older commentsHide 9 older comments