Supervised/Unsupervised learning (Data Mining) help

sports_brah · 5 May 2014 at 22:26

https://www.mediafire.com/?o8iyec169tutf7x

As per the attached link (for anyone else interested), I was wondering if someone could give me some more information in regards this, with reference to Data mining.

I'm struggling to get any real depth of knowledge on this. I know it supports methods such as predictions, clustering etc, but I'm in dire need of some pertinent examples/expanations.

Any advice?

(I realise this is wonderful stuff to be talking about on a Bank holiday!)

dowie · 5 May 2014 at 22:55

How much time do you have

Andrew Ng's lectures are a good intro

http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1

The following book will give you more detail - can be downloaded as a free pdf

http://statweb.stanford.edu/~tibs/ElemStatLearn/

Also check out some of the videos on this YouTube channel:

http://m.youtube.com/user/mathematicalmonk

Tuppy_Glossop · 5 May 2014 at 23:02

It isn't a great description at all.

In supervised learning you have a training data set in which you are able to give the learning algorithm a label or category or other form of result for each record in the set. The algorithm then maps the input values of each record in the training data to the output labels/categories/results (according to some type of distance measurement) and creates decision boundaries that are used to determine which category any future record ought to belong to.

In unsupervised learning there is no a priori classification or labelling of the training data and the learning algorithm partitions the output data according to optimality that is derived from values of the output data (ie. is data-driven).

Example one (supervised): we have measured the height and weight (input variables) of a class of kids and calculated their BMI value. The BMI value can, of course, be classified as obese, overweight, etc. We do this for the each record inm the training data (using overweight/obese etc as output variable... in reality they will be integer numbers not words) and run our learning algorithm. We should end up with decision boundaries (inside the model) that delineate the different caegories we used. Take the model and present it the same data from other kids (ie. test data) and it should be able to tell which category they are in. In essence it is learning the BMI equation and the boundaries on a BMI chart. How good it is depends on using the right distance measure, enough (but not too much) training data, and using an appropriate learning algorithm with suitable parameters.

Example two (unsupervised): Using the same training data as example one we would skip the labelling and use the BMI value as the output value. In training the model then attempts to discover structure in the input and output data. For our BMI data, the number of output classes the model finds will depend on the method (and parameters) used and the data itself (it is tempting to imagine that if there were no clinically obese kids the model might find only four classes but with unsupervised learning it is possible that only two are found or even ten... ). When presented with the data of the other kids (test data) the model will separate them into the categories it found according to the decision boundaries it learned.

Supervised learning has known outputs such that, for any given predicted output, a distance measure can be used to decide if the prediction is right or wrong (and often, how wrong). Unsupervised learning has no a priori knowledge of structure within the data.

A great book is Pattern Classification by Duda, Stork, and Hart.

sports_brah · 5 May 2014 at 23:31

dowie said:
How much time do you have

Andrew Ng's lectures are a good intro

http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1

The following book will give you more detail - can be downloaded as a free pdf

http://statweb.stanford.edu/~tibs/ElemStatLearn/

Also check out some of the videos on this YouTube channel:

http://m.youtube.com/user/mathematicalmonk

Thanks for the links, appreciate it. My data allowance won't let me check out Youtube til tomorrow when I've got alternative internet.

My exam is on Wednesday, so I'm kind of limited....

Tuppy_Glossop said:
It isn't a great description at all.

In supervised learning you have a training data set in which you are able to give the learning algorithm a label or category or other form of result for each record in the set. The algorithm then maps the input values of each record in the training data to the output labels/categories/results (according to some type of distance measurement) and creates decision boundaries that are used to determine which category any future record ought to belong to.

In unsupervised learning there is no a priori classification or labelling of the training data and the learning algorithm partitions the output data according to optimality that is derived from values of the output data (ie. is data-driven).

Example one (supervised): we have measured the height and weight (input variables) of a class of kids and calculated their BMI value. The BMI value can, of course, be classified as obese, overweight, etc. We do this for the each record inm the training data (using overweight/obese etc as output variable... in reality they will be integer numbers not words) and run our learning algorithm. We should end up with decision boundaries (inside the model) that delineate the different caegories we used. Take the model and present it the same data from other kids (ie. test data) and it should be able to tell which category they are in. In essence it is learning the BMI equation and the boundaries on a BMI chart. How good it is depends on using the right distance measure, enough (but not too much) training data, and using an appropriate learning algorithm with suitable parameters.

Example two (unsupervised): Using the same training data as example one we would skip the labelling and use the BMI value as the output value. In training the model then attempts to discover structure in the input and output data. For our BMI data, the number of output classes the model finds will depend on the method (and parameters) used and the data itself (it is tempting to imagine that if there were no clinically obese kids the model might find only four classes but with unsupervised learning it is possible that only two are found or even ten... ). When presented with the data of the other kids (test data) the model will separate them into the categories it found according to the decision boundaries it learned.

Supervised learning has known outputs such that, for any given predicted output, a distance measure can be used to decide if the prediction is right or wrong (and often, how wrong). Unsupervised learning has no a priori knowledge of structure within the data.

A great book is Pattern Classification by Duda, Stork, and Hart.

*brain explodes*

Thanks very much for taking the time to post this, I'll (try) digest this before bed tonight.

dowie · 5 May 2014 at 23:44

sports_brah said:
My exam is on Wednesday, so I'm kind of limited....

yikes... best thing to do then is get hold of all available past papers and just work through them... questions (or variants of them) often essentially get recycled - even if you're in a massive hole then simply working through each paper in slow time, working through/writing out the answers etc... could be enough for you to tackle similar questions in the exam.

sports_brah · 5 May 2014 at 23:50

dowie said:
yikes... best thing to do then is get hold of all available past papers and just work through them... questions (or variants of them) often essentially get recycled - even if you're in a massive hole then simply working through each paper in slow time, working through/writing out the answers etc... could be enough for you to tackle similar questions in the exam.

Here's a sample question.

I feel like a day of work tomorrow (This is for a Business Intelligence module so this is only a small topic in a vast topic, hence limited time can be placed) I can get a brief overview.

The problem is that for the amount of marks each question is worth, I'm struggling to get real depth down.

sports_brah · 6 May 2014 at 10:35

Any other advice?

Appreciate it!

Supervised/Unsupervised learning (Data Mining) help

More options

sports_brah

sports_brah

Thug

dowie

dowie

Tuppy_Glossop

Tuppy_Glossop

sports_brah

sports_brah

Thug

dowie

dowie

sports_brah

sports_brah

Thug

sports_brah

sports_brah

Thug