degree level Stats/mathmatic help..I feel my job is begining to get beyond my level of understanding

413x · 16 Oct 2013 at 19:27

I posted a while ago about worries over my job and they are increasing.
This is a long read, apologies.

I think the role i currently do at work is beyond my level of understanding in regards to statistics

The company did say that they had problems finding anyone to meet 3 criteria that the job required and felt i was most suitable: biological degree/farming knowledge/dealing with large datasets

My CV was honest and i expressed my limitations and strengths in the interview.

Up until now i have many been doing/learning vba from scratch which is fine, to amalgamate different messy datasets, im happy and progressing, line manager is happy, all good.

the second step is stats. The data set in excel has variables to column GA (both independent Xs and dependent Ys)

basically the company want to find any underlying patterns in the data. The data is far from perfect, lots of correlated variables and wasnt produced originally with this purpose in mind

the stats package is expansive to say the least and i dont think i have a good enough stats knowledge to use this. I am dubious about what has been done before also.

Ie, chuck some variables into PCA, press go,
accept what PCA churns out in respect to removing correlated variables..
feed new variables into model creation
dont worry about this box, or this, press run
run models, mainly non linear regression partition based trees,
is what comes out good? (r^2 etc etc)
ok, this model explains this much variation

its not this easy i know and here in lies the problem. I know the way i have been introduced to the software etc is too simple, but to do it right when there is no simple right answer.

I feel it is beyond me and seems to require a degree, or at least some sort of stats education at degree level beyond what i have done. Im not sure if i can teach myself this level of stats. If it was pure maths where there is more likely to be a right answer i could figure it out, but stats is a different beast

to sum up

basically im asking if i have a chance at this (getting my head around stats properly) without a stats degree if i try and teach myself what i need?
or
do i need to start now on something else (after work training) in something else so if i am released at some point i have a fall back ..was looking at finance/accounting to which a forum member has been helpful

D.P. · 16 Oct 2013 at 20:20

I do that kind of stuff all the time, although I code it and don't use software packages. I don't have a degree in stats, but took every damn stats class I could lay my finger on and have a background in machine learning, which is basically advanced statistics.

What you are discussing is not degree related stats though, but stats you might cover in a degree like biology, physics, CS, artificial intelligence and even a good psychology/social science degree. Statistical techniques are reused and renamed across science, PCA is a fundamental algorithm used in anything from questionnaire factor analysis in psychometrics, to eigen-vector based face recognition in computer vision, to data reduction in LHC datasets in physics.

Some textbooks might help, otherwise some online lecture material. It helps if you can view the data and mates more abstractly. You might find general textbooks on linear algebra useful for understanding things like correlation matrices and SVD.

You will go much further when you understand the underlying mathematical principles than plugging numbers into software. Saying that, this kind of stuff is a bit above and beyond your average workplace stats. If the work involves heavy data processing en your employer probably should have looked closely for a math/stats/physics/CS major and let them learn any of the underlying biology on the job.

I would take this as a challenge to try to learn how it all works. If you get comfortable going through big data sets then It opens up many career possibilities. A large part of of my job is basically statistical inference on large data sets and I get about £65k pea, some of my friends earn 50% more doing the same thing in banking.

elmarko · 16 Oct 2013 at 20:22

al4x said:
I posted a while ago about worries over my job and they are increasing.
This is a long read, apologies.

I think the role i currently do at work is beyond my level of understanding in regards to statistics

The company did say that they had problems finding anyone to meet 3 criteria that the job required and felt i was most suitable: biological degree/farming knowledge/dealing with large datasets

My CV was honest and i expressed my limitations and strengths in the interview.

Up until now i have many been doing/learning vba from scratch which is fine, to amalgamate different messy datasets, im happy and progressing, line manager is happy, all good.

the second step is stats. The data set in excel has variables to column GA (both independent Xs and dependent Ys)

basically the company want to find any underlying patterns in the data. The data is far from perfect, lots of correlated variables and wasnt produced originally with this purpose in mind

the stats package is expansive to say the least and i dont think i have a good enough stats knowledge to use this. I am dubious about what has been done before also.

Ie, chuck some variables into PCA, press go,
accept what PCA churns out in respect to removing correlated variables..
feed new variables into model creation
dont worry about this box, or this, press run
run models, mainly non linear regression partition based trees,
is what comes out good? (r^2 etc etc)
ok, this model explains this much variation

its not this easy i know and here in lies the problem. I know the way i have been introduced to the software etc is too simple, but to do it right when there is no simple right answer.

I feel it is beyond me and seems to require a degree, or at least some sort of stats education at degree level beyond what i have done. Im not sure if i can teach myself this level of stats. If it was pure maths where there is more likely to be a right answer i could figure it out, but stats is a different beast

to sum up

basically im asking if i have a chance at this (getting my head around stats properly) without a stats degree if i try and teach myself what i need?
or
do i need to start now on something else (after work training) in something else so if i am released at some point i have a fall back ..was looking at finance/accounting to which a forum member has been helpful

Most of the work I do in this area is the pre-work & data preparation/exploration & some predictive modelling.

Understanding the relationships between the variables can be pretty simple (chucking the data into various programs, in my case SPSS & running a basic CHAID model to find the key variables). What's much harder is understanding the data fields & finding out which may be self-fulfilling or unavailable until after the fact (if the intention is prediction).

Before jumping into modelling I'd find out what the key variables they want to understand what he correlation is (say if a certain number of columns are more important) then play around in pivots across the populations.

See if you can bring in additional reference data to see if that has a stronger correlation with the keys, create new fields (by linking/merging different variables) or even creating bands (fixed or percentile). Usually analytical requests beforehand give a good indication as to what fields have the strongest correlation, but without knowing the data it's hard to give more specific advice.

I studied art & music so don't have any formal qualifications myself, so the degree really isn't necessary for this kind of work (I've been working in the team for 2.5 years now & performing well above the average).

413x · 16 Oct 2013 at 20:57

Thanks for taking the time to read

I was unfortunate in that i do have a molecular biology degree but never did any stats in that at all, literally never used mini tab or SAS. My stats education comes from A level which at 28 seems an age ago and was only a few distributions

The program i have access to is SAS JMP Pro 11.

Yes, jumping into the modelling is something i have been encouraged to do but i dont like it it. Im the sort of person who just cant accept doing something without understanding it and matrices is something else i have not come across, and one of the things i am trying to learn, literally today

What i have been given is they want to look at only a few outputs, which at least helps. What doesnt help is very limited support within the company due to no one being an expert in this. I am approaching being the best in the company at this according to the current person which is worrying. (big company, plenty of cash)

currently i am trying to merge/ remove obvious columns to reduce duplication and dimensions
How i have been shown to reduce the variables going into a model to stop overfitting (which is likelyoccurring) is to use PCA/factor analysis and something called cluster analysis which does not reform the variables like PCA but it uses PCA but i dont trust its output

I have been reading a lot about PCA/FA/clustering is (as it is what has been used in past before my joining) to produce the variables for model making and someo
what have an idea about how eigenvectors/values etc work, again, without matrices, this is tough

it produces clusters which i have been told (by SAS and inhouse) removes correlated variables. Put Var 1, 2, 3, 4, 5, 6, 7 etc in and out comes
group 1 - var 1, 4, 6
group 2 - 2
group 3 - 3, 5

use the most representative member from each of these and produce a model

i can tell often the model is over fitted as validation r^2 is often lower which i feel is an issue

I agree with the prospects, i feel i need to learn the underlying mathematics and like you say if i can get to grips with it it would be quite an interesting prospect for the future. If this was a dead end job i would be doing something else already. I am not one for pushing buttons thus why i just wont accept what i have been told
I also feel a dedicated maths graduate would have been more appropriate

I guess i feel a little at sea with all of this i have done what i am comfortable with and am now at the uncomfortable stage
SAS JMP appears very complex with more options than any single user will ever need, i do not know which is important or not

i dont think it helps that the first data set i have to learn on is so large and messy
Work are happy with the speed at which i have got the dataset together with picking up VBA instead of doing it manually which they envisaged

gregorius · 17 Oct 2013 at 08:57

You’ve expressed quite a few questions in your postings – but let me deal with your confidence first! The great advantage of having gone through some sort of formal training in statistics is that it gives you an overview of how all the tools and techniques fit together; without this overview the subject tends to look like a miscellaneous collection of bits and pieces simply connected by the general theme of modelling uncertainty. However, I know several statisticians who started out in other fields and gained experience by a combination of doing and training. So, I think my first suggestion to you is not to lose heart, but to try and identify some training courses that will help you get started. (I’m assuming your company, given their difficulty in finding suitable candidates, would look kindly on developing the skills of the staff they do have!) If you want some help identifying suitable training, let me know and I’ll see what I can do.

Now, to the particular problem that you’ve described; are you able to say any more about it? It sounds as if you have a dataset with ~150 variables, some of which might be considered as outcome variables and others which may be considered explanatory variables. Are all of these variables observations on units of some sort? If so, how many units do you have? (i.e. we know the dataset goes up to columns GA in Excel, but how many rows does it have?) The answer to this simple question may determine restrictions on the types of techniques that you can apply.

You’ve said that your company is interested in finding underlying patterns in the data; this is a bit of a vague objective, and I think a key part of how you approach your task is to refine what the objectives of the analysis are. So, why are they looking for patterns? What sort of patterns are they interested in? How would these patterns be interpreted? What would the non-existence of certain types of patterns mean?

Once you can be specific about the questions, the tools and techniques that you use often become much clearer!

413x · 17 Oct 2013 at 13:29

That's an interesting question in itself. Approaching the company saying I need help.

Yes, the dataset contains at least 1000 units and is static atm but can be added to as time goes on
These units can be partially grouped as maybe 20 were from one location and another 20 were from another etc. Location is a variable in itself

The variables themselves range from units of percent to grammes to simple on/off discrete variables
The outputs are related to farm outputs

But other datasets I would be given could vary widely from this

More specifically the outputs (from this data set) are for maximising yields and efficiencies
If model A explains B percent of the variable say animal weight what X variable has a significantiimpact on this
Is this variable something we can change in the real world
If it isn't at least we are aware of it

The company chose the software it did partially on the output of the software where it shows how all variables change as well as out put when you only change one single input
You can tell this profile to maximise the output variable for example where it will showyou best outcome by optimising all the significant inputs.
This they liked at it is visual and a good showcase for external customers

gregorius · 18 Oct 2013 at 09:05

From your description, I think that you’re well into the realms of the tricky. The sort of size of dataset you’re talking about (let’s say 1000 observations on 150 explanatory variables) means that you’ve got to apply some sort of data reduction – the rule of thumb to avoid over-fitting in regression models is roughly 20 observations per included explanatory variable. But the choice of data reduction technique depends upon what you’re attempting to get out of the analysis.

If you want to get the best predictive model, then you might go for a technique like PCA; but in doing so you tend to lose the ability to interpret the model in terms of the original variables. PCA gives you linear combinations of the original explanatory variables that explain the most variation in the outcome, but may not themselves stand for anything interpretable.

If you want an explanatory model then I’d be tempted (with the number of potential explanatory variables you have) to do the process manually based on expert knowledge. That is, if you have sets of highly correlated explanatory variables, pick one from each set based on expert knowledge of the situation. You can do automatic variable selection in this situation (techniques like stepwise variable selection) but these are actually very tricky to get right (there’s been a lot of research done in recent years showing how bad these automatic techniques can be!)

If you’re not able to persuade your employer to get you trained up, then I suggest that you start looking at books like:

Steyerberg: Clinical Prediction Models (easy overview, but incomplete and focussed on clinical prognostics)
Harrell: Regression Modelling Strategies (very useful, but hard, and a new edition is coming out soon).
Hastie, Tibshirani, Friedman, Elements of Statistical Learning (a good overview of the techniques available in the field, but unclear on the details and should only be used as an index into the literature and to other textbooks).

oulton · 18 Oct 2013 at 09:22

I can't believe you would have such little knowledge of stats with your degree.

Read up on:

Principle Components Analysis.
Fast Fourier Transform
Pearson's Product Moment Correlation Coefficient

Freefaller · 18 Oct 2013 at 09:29

I used to use minitab a lot for work (six sigma/lean/process improvement) but only really scratched the surface of what it can do - purely because I was interested in capability charts, standard deviation, r^2 values, x bar, cp and cpk values and basic stuff like that,. The rest of the functions are way beyond my understanding, and manipulating the data to make sense of it took me most of my time.

I think you're going about it the right way, understanding the data, rather than just shoving it into a modelling software tool and accepting it at face value, you're examining/questioning what you've got in front of you which is laudable and refreshing.

What I'd suggest is that you ask for help from either someone you trust within the organisation, ask to be put on a course, or ask some mathematical/data analyst friends you may have to help you with the tasks you're faced with. It might mean breaking it down into smaller data sets and get your head round how to manipulate them? I'm not very good with stats (other than questioning the quality of the data presented to me by peers/colleagues

) however, I've built up the experience to know how to interpret data that has been presented.

Can you get any further tuition? Will work offer you support? Can you get external support? Is the data sensitive or would you be able to get someone from OcUK to help? Have you looked at online/self taught guides for data modelling/statistical modelling tools?

If you haven't used maths/stats for a while it disappears quickly, heck I did electronic and systems engineering, lots of stats and Fourrier series and such like - I don't really remember it much but that's because I don't use it. So it's not surprising you've forgotten if you haven't had to use it extensively.

413x · 18 Oct 2013 at 10:24

@ gregorius

This has actually reassured me, I have come to exactly the same conclusion
I know I don't want too many variables and was aiming for less than 15
I understand that PCA reforming the variables means I loos the ability to select one X to change
I am at the point of asking now which of these variables can be removed /combined
Of the ones that are left I'm looking for a valid way to model

The clustering analysis available in our stats package does this but I don't trust it's output
It is supposed to achieve the same result as PCA in a different way
It does not reform the variables but uses correlation and PCA to put variables into these cluster
How it does this I am not sure as it does not always make sense to me. It's not as simple as producing a correlation matrix and putting members into a group based on that as the output ie. Put all variables in cluster 1 if they correlated above 0.8. The cluster output doesn't match this method (it does however with simple data sets)
Someone from SAS is actually coming in next week for a QandA and this is something I need to ask.. How does it work.

@ Oulton
Tbh I didn't realise how much I was missing until now. And a little disappointed to say the least

@Freefaller
The data is too sensitive for me to show which is annoying for me. I am wary about asking for too much help as it may appear as incompetence. It's something I will have to gauge
I am currently looking into outside help, even if I have to pay, Tbh I don't want to loose this opportunity

I genuinely believe I can do it, I'm not stupid, but whether I can in time is another issue

I will definitely check out those books

Sorry if this reply is a bit scrappy I'm on my phone atm

413x · 4 Nov 2013 at 18:42

im beginning to worry a bit about the future of this tbh, both in my ability to fully explain what is desired and also i do not think there is anyhting significant to be extracted from what i have

we had a statistics tech in from SAS who told us (me and others involved) that PCA was not at all needed and the models themselves (for non linear) and PLS for linear will effectively tell you what is correlated and what is not for variable reduction, this turned out to be absolutely not the case, which i half suspected, this was a little disappointing to say the least

im still mainly having problems with sorting what goes into the model in regards to correlated variables and am really fearing that this extra step is just to much to learn whilst work expect results. PCA sorts the variables well, and i can use linear regression and the VIFs to show that within the independant variables when var 1 = var2 + var3 +var4 reduction needs to be performed (this happens a lot in the data and it isnt obvious)
We dont want PCA outputs in its native form, which complicates things, we want to say that var x y and z are correlated enough to just use var z, as we want to be able to adjust z to achieve an improvement in the final outcome knowing it is Z not a combination of variables into new ones. It also doesnt help that by convention the whole dataset still throws up too many PCs for a reliable model

i have also discovered a potential time effect. tbh its a horrendous dataset to learn on

i am thinking i should at least prepare for this not ending well, Cant think of much to do except AAT, i dont think i will have have a problem with this.
i will still persist with the stats as i would love to feel confident enough to do this as a career.
i cant afford to loose this job and have nothing to back me up, it would mean loosing more than my job

Tokenbrit · 4 Nov 2013 at 19:21

If it makes you feel better, a lot of maths undergraduates would not cover this type of material, and what you do with data sets of this type is often times very arbitrary anyway.

If I were presented with such data I would do the following:
1) Cluster it with varying numbers of clusters, use this to detect outliers and delete them.
2) PCA the data, use it to project the data on a lower dimensional subspace.
3) Fit some interpolant/approximating function to your low dimensional data, then you can optimize that function under some constraints you choose, or do whatever you like with it. If in doubt just choose low order polynomials or something.

gregorius · 4 Nov 2013 at 20:00

Is there any way that you could release some part of the data so that I (or someone else qualified here) could have a quick look at it?

If not, I'm beginning to think that you may need to bring in a consultant under NDA. If the company want the job done and it's important to them, they should look favourably on someone who shows the initiative to do that!

dowie · 4 Nov 2013 at 20:20

@OP - perhaps take a look at this course... starts in Jan so if you've got some time to get your maths up to scratch before it begins

https://www.coursera.org/course/compmethods

413x · 5 Nov 2013 at 12:56

@Token
It does and it doesn't. At least it means I would not really be expected to know it but obviously doesn't help with the job difficulties
The bridging between the PCA and the models is the difficult part. As said we really want to not reform the variables but select the best representative of a group of correlated one. Even using PCA the dataset does not look nice at this point. Many many PCs with no distinct or obvious place to cut. Cutting at eigenvalue of 1 leaves more variables than I would like and doesn't really explain much of X variation anyway. This is why I feel the dataset isn't brilliant. It has a lot of data on a small area of what affects the Y imo

@ gregorius
I am bbeginning to think that too but wonder where it will leave me
The guy from SAS was a consultant and what he said was not appropriate really
I am wary about releasing the data Tbh I don't think it's that exciting but it is fairly protected, I would love to but can't risk it really

Thanks dowie I will enrol on that, looking at the prerequisites I haven't done that since a level! I don't really have an exciting home life Tbhso iI'llget to work on this as well as a backup plan

I made tthe mistake of mentioning to my gf about my worries, which has stressed her out and just makes me feel worse. Lesson learnt on that one

pitchfork · 5 Nov 2013 at 13:03

I did a 2 week stats course down in London, it helped me a ****load, solid groundwork to build up my stats knowledge and get my head around some of the more advanced stuff the postdocs use.

413x · 5 Nov 2013 at 13:08

pitchfork said:
I did a 2 week stats course down in London, it helped me a ****load, solid groundwork to build up my stats knowledge and get my head around some of the more advanced stuff the postdocs use.

What was your prior background?

D.P. · 5 Nov 2013 at 13:36

al4x said:
@token
It does and it doesn't. At least it means I would not really be expected to know it but obviously doesn't help with the job difficulties
The bridging between the PCA and the models is the difficult part. As said we really want to not reform the variables but select the best representative of a group of correlated one. Even using PCA the dataset does not look nice at this point. Many many PCs with no distinct or obvious place to cut. Cutting at eigenvalue of 1 leaves more variables than I would like and doesn't really explain much of X variation anyway. This is why I feel the dataset isn't brilliant. It has a lot of data on a small area of what affects the Y imo

@ gregorius
I am bbeginning to think that too but wonder where it will leave me
The guy from SAS was a consultant and what he said was not appropriate really
I am wary about releasing the data Tbh I don't think it's that exciting but it is fairly protected, I would love to but can't risk it really

Thanks dowie I will enrol on that, looking at the prerequisites I haven't done that since a level! I don't really have an exciting home life Tbhso iI'llget to work on this as well as a backup plan

I made tthe mistake of mentioning to my gf about my worries, which has stressed her out and just makes me feel worse. Lesson learnt on that one

Have you examined any statistical properties of the data? What are the distributions of the variables like? Ultimately if the data is badly behaved making sensible models will be complex and the PCA results misleading.

Instead of picking an arbitrary threshold on the eigen values you should construct models with various subsets of the variables and compare performance ona validation dataset. Then I would look for statistically significant difference in the predictive capability. E.g. If there is no sig. dif. When adding a tenth variable then simply exclude it from the model. Depending on the data size and your model complexity this may become computationally I feasible to do an exhaustive search across all variable subsets. But It if it takes a couple of weeks of number crunching en it isn't such a big deal.

I don't know what you are doing to create the models but in theory a good automatic model creation methodology won't see much advantage in using PCA to trim the data first. The model creation will find its own variables that perform well. Furthermore, PCA is linear but you might have non-linear relationships between variables. A simple example is you want a model to predict the weight of a person and you have 2 predictor variables, age and height. Both will correlate with weight, and with each other, but both have different predictive capability. Age alone is going to be a strong predictor of weight in children, since height is correlated strongly, but in adults age won't be a big factor and height will. Really, you can have arbitrarily complex relationships between variables that PCA will not easily highlight, but a complex model learning method might be able to pick up on.

I mostly use things like neural nets, radial basis function networks, SVMs etc and I would tend to throw all data at my model learning procedures (various stochastic and methane autistic approaches) and let the model creation process find out about variable relationships and correlations. With a model in place you can then do further examinations, e.g. Pruning variables from the model and examining performance.

thep02 · 5 Nov 2013 at 13:59

I've read half the thread and you appear to got useful and constructive help so this may be useless to you or you've already done it but I put SAS JMP Pro 11into a YouTube search and got plenty hits, may help you or simply exhibit my ignorance

Tokenbrit · 5 Nov 2013 at 14:10

al4x said:
@token
It does and it doesn't. At least it means I would not really be expected to know it but obviously doesn't help with the job difficulties
The bridging between the PCA and the models is the difficult part. As said we really want to not reform the variables but select the best representative of a group of correlated one. Even using PCA the dataset does not look nice at this point. Many many PCs with no distinct or obvious place to cut. Cutting at eigenvalue of 1 leaves more variables than I would like and doesn't really explain much of X variation anyway. This is why I feel the dataset isn't brilliant. It has a lot of data on a small area of what affects the Y imo

@ gregorius
I am bbeginning to think that too but wonder where it will leave me
The guy from SAS was a consultant and what he said was not appropriate really
I am wary about releasing the data Tbh I don't think it's that exciting but it is fairly protected, I would love to but can't risk it really

Thanks dowie I will enrol on that, looking at the prerequisites I haven't done that since a level! I don't really have an exciting home life Tbhso iI'llget to work on this as well as a backup plan

I made tthe mistake of mentioning to my gf about my worries, which has stressed her out and just makes me feel worse. Lesson learnt on that one

Whether the eigenvalues are one or not is neither here nor there unless the data is normalized properly.

Divide them by the sum of their total (so they sum to one) and take the "n" biggest ones in order until they add up to say 0.95. So now you know you can roughly represent your data with n components

If n is more than (1/3)ish of the eigenvalues PCA is wasting your time.

At that point the standard thing is to choose the new "pseudo-basis" as the set of PC vectors, but you aren't obligated to do so. You can select a new set of vectors how you like.

What you should remember is that the PC vectors are just in some sense n linear equations and you can write those equations in infinite arbitrary ways.

What you could do is manually examine the PC vectors, and choose a small number of variables k < n which you think are representative. Then just apply gram-schmidt to partially orthogonalize with respect to those variables. Then you can span the same sub-space that the PC components give you with a different basis.

This seems a pointless endeavour to me as the model will be the same, just in disguise, but it might fool an idiot upstairs who doesn't know what he is talking about.

The alternative is to chuck out PCA completely, build the basis how you like and find the covariance of it with respect to your data. It is trial and error but if you have a good intuition for your data you might be able to cook up something useful.