Data mining functions

Caporegime
Joined
3 Jan 2006
Posts
25,317
Location
Chadderton, Oldham
Hi, unsure were to place this thread so I placed it in General Discussion.

I'm looking for some information on Data Mining Functions, I've been searching for a while and came across nothing helpfull, I'm doing an assignment that basically says I have to explain Data Mining Purpose, I guess that means what Data Minding is and then I have to also explain the functions of Data Mining but I cant find any propor info on it.



Thanks.
Willz.
 
sorry, that was incredibly unhelpful.

the wikipedia articles on this sort of thing are normally pretty comprehensive and accurate. Are you talking about data mining from an ethical standpoint, or a spammy standpoint?
 
What context are you referring to here? Data Mining is a pretty broad term and I 'd expect that if you are doing a course along these lines that you have had some pre-amble about the history of data mining/ data analysis.
 
Is this in the context of factor analysis, principle component analysis (PCA), information data extraction, machine learning, prediction?
 
I dont know, I'm finding it ahrd to focus at the moment sat on an helpdesk in a college learning center with hundreds of people for some sort of 'work experience'.

I am looking at the IVA feedback sheet we have which basically shows us what we have to do, dont know if you've heard of IVA, its a sort of assignment due in at Easter.

This is what we've been told to do:

13P4 understand the principles of SQL

Task 13/3 Information sheet or presentation:
● Principles of SQL
●Structure of SQL (DML, DDL, DCL)


13P5 describe the function and purpose of data mining.
Task 13/3 Information sheet or presentation as above including:
●Data mining function
●Data mining purpose

I've done most of it, just data mining and DCL, well basically SQL I'm having difficulty finding the right information, cant ask my teacher (who BTW is not usually that helpfull anyway).
 
DM is a very diverse subject area covering lots of different concepts. The basic overall reason it exists however is to find ways to automatically extrapolate meaning from something that inherently has little initial meaning - data.

I'm not going to re-iterate what others have said, but one very interesting piece of software is WEKA, devloped by a small team at Waikato University in NZ.

Its essentially a free collection of the most common datamining methodologies, all wrapped up in a reasonable Java GUI. You can download it here: http://www.cs.waikato.ac.nz/ml/weka/ They also have some quite nice in depth documentation.

If nothing else, it will at least give you the names of some of the more common algorithms out there (such as OneR).

Enjoy :)
 
Some typical data mining techniques:

Segmentation / Clustering - Using a stats package such as SPSS or SAS to cluster data into similar groups (normally the groups are groups of customers). Pioneered by Tesco's / Don Humby in the 90s, the famous example was when Tesco segmented their clubcard customers into different groups and analysed what they were spending their money on. They realised that their "mother and baby" group weren't spending enough on baby products (as they didn't trust cheapo Tescos for stuff for their babies) so were shopping at places like Boots instead for their baby stuff. Tesco's spent a lot of money marketing to these customers to convince them that their baby stuff was OK and managed to steal a lot of custom from Boots.

The most common clustering technique is K-Means which is popular as it can handle processing large chunks of data. There is probably a technical explanation of how it works on wikipedia (something to do with establishing centroids and then sending each data point to it's nearest centroid).

Decision Trees / CHAID / CART - This is a way of linking independent variables to a dependant variable (similar to regression models) and works by running lots of t-tests. The output will then give you a tree diagram that shows which of the independent variables is the biggest predictor on the value of the dependant variable (and also what the natural cut-off is for the independent variable in question). An example application of this would be spotting users who are likely default on loan re-payments. You would feed in independent variables such as number of payments missed, age of customer etc and then the resultant model would "score" each of your customers on how likely they are to default (so the model would be established on a known group of older customers that you already know whether they default or not and then applied to a newer group of customers to decide which are in the high-risk category).

Linear Regression / Binary Regression - Similar to decision trees (linking independent variables to a dependant variables) but works by running "least square regression" tests for all of the variables rather than t-tests. The application and output are almost identical to decision trees (linear regresion is when you are predicting a scale variable such as "value of customer" and the binary version is when the dependant variable is a yes/no, eg "will customer default on loan?"
 
Some typical data mining techniques:

Segmentation / Clustering - Using a stats package such as SPSS or SAS to cluster data into similar groups (normally the groups are groups of customers). Pioneered by Tesco's / Don Humby in the 90s, the famous example was when Tesco segmented their clubcard customers into different groups and analysed what they were spending their money on. They realised that their "mother and baby" group weren't spending enough on baby products (as they didn't trust cheapo Tescos for stuff for their babies) so were shopping at places like Boots instead for their baby stuff. Tesco's spent a lot of money marketing to these customers to convince them that their baby stuff was OK and managed to steal a lot of custom from Boots.

The most common clustering technique is K-Means which is popular as it can handle processing large chunks of data. There is probably a technical explanation of how it works on wikipedia (something to do with establishing centroids and then sending each data point to it's nearest centroid).

Decision Trees / CHAID / CART - This is a way of linking independent variables to a dependant variable (similar to regression models) and works by running lots of t-tests. The output will then give you a tree diagram that shows which of the independent variables is the biggest predictor on the value of the dependant variable (and also what the natural cut-off is for the independent variable in question). An example application of this would be spotting users who are likely default on loan re-payments. You would feed in independent variables such as number of payments missed, age of customer etc and then the resultant model would "score" each of your customers on how likely they are to default (so the model would be established on a known group of older customers that you already know whether they default or not and then applied to a newer group of customers to decide which are in the high-risk category).

Linear Regression / Binary Regression - Similar to decision trees (linking independent variables to a dependant variables) but works by running "least square regression" tests for all of the variables rather than t-tests. The application and output are almost identical to decision trees (linear regresion is when you are predicting a scale variable such as "value of customer" and the binary version is when the dependant variable is a yes/no, eg "will customer default on loan?"

Thanks a bunch for that :).

Did you write all that or you got it from somewere? If you got it for somewere please could you provide a link for the reference as I'll get some notes from it and reference if.

Thanks.
Willz.

Thanks all of you for the help really apreciate it.
 
Woah. Be careful there. I was wrong about CHAID, it uses Chi-Squared tests, not t-tests. And it's "Dunnhumby" not "Don Humby". I think the rest of it is pretty much correct though.

Links:

K-Means Clustering

http://en.wikipedia.org/wiki/K-means

Article on Dunn Humby

http://people.ischool.berkeley.edu/...h07/Lectures/Lock-in/Articles/club-cards.html

CHAID

http://en.wikipedia.org/wiki/CHAID

http://en.wikipedia.org/wiki/Multiple_regression

Ok, well thanks for the help, all I really need is about 1 or 2 sentances on it :p
Regression
 
Ah OK. Thought you had to write a whole essay.

OK well mention clustering (K-Means, two-step and Hierarchical), Multiple linear regression, and decision tree (CHAID and CART). Oh yeah and there's also a technique called scorecard which is done using a package called Statistica. I don't know a lot about that though.

That's all the data-mining techniques that I've ever seen used professionally (there are other bespoke techniques that get touted around here and there by consultants, but they are mostly just refinements or combinations of the above).

Fred
 
Hmm, its appears I have to do 2 pages on data mining :eek:

Some person on the helpdesk with me printed off 45 pages of it but I cant make notes from it, too complicated. Found about 6 pages I can get some notes from I think actually.
 
Last edited:
If you're looking for applications of data mining, maybe you ought to look at Data Warehousing. Marketing are a big user of data mining btw.
 
Back
Top Bottom