Statistical Analysis :S

Soldato
Joined
20 Oct 2004
Posts
13,144
Location
Nottingham
Help me OCUK, you're my only hope...

A year or so back I asked you chaps to undertake a survey for some stuff I was doing with work / uni and you very helpfully did and provided me with really good data. Anyway I didn't finish that piece of work due to work/family (children being born)/covid, general real life stuff that gets in the way etc, Until now!

I've got to do some form of data analysis on this but its not in anyway something I understand (Im in Architecture). I have an excel spread sheet which I can take very basic visual graphs and pie charts from but I am wondering if there's a better way to do it given in quite a varied data set.

For reference it related to physical movement around your working office environment and sedentary behaviour within the office and thus each participant provided quite a lot of data which is stored in each cell (example below).

Basic Example;
Q. What physical activities outside of work do you participate in?
A1. Gym, Walking, House work, Gardening
A2. Badminton, Walking
A3. Cycling
...
A100. Running, Gym, Cycling, Walking, House work

Can anyone offer any ideas as to how I can analyse this? I have enabled the analysis plug-in for excel but its beyond me really as to what its asking lol.
 
What do you want to show with the data? Frequency of a particular activity? Activity trends versus participant demographic? Ideally you need to break down the free text and group it into sensible categories - sedentary, light and heavy exercise for example?

There’s nothing at all wrong with a basic graph or figure if it demonstrates your data wel.
 
Frequency of particular answer and correlation between say age sets or sex ect.

I agree I need to break this down a bit more into something more useable
 
You will need to do some data preparation, and I don't think excel will be a good place for this.

some very simple python code would let you split the text into tokens, you could then map them to particular categories like tom said.


Sre these text strings likely to be coherent, i.e. no spelling errors or alternative names for the ssme thing (walk vs walking)?
 
You will need to do some data preparation, and I don't think excel will be a good place for this.

some very simple python code would let you split the text into tokens, you could then map them to particular categories like tom said.


Are these text strings likely to be coherent, i.e. no spelling errors or alternative names for the same thing (walk vs walking)?

you lost me at python sadly. Its not a massive data set, its spread over 2 surveys with 62 and 100 participants and the answers were for the most part multiple choice but some anomalies crept in because of the use of an "other" field in some questions.
 
is the input data in a standard format? you mention it's all in a single sell is this a long string of text you have to sift through?

i'm guessing you want the data sorted into a numerical form, so you can list each question and have a number associated, for example:
candidate: walking: cycling: badminton:
1 0 1 0
2 1 0 0
3 1 1 0

edit: ocuk doesn't like a bunch of spaces but i hope you get the idea

as others have mentioned sorting through text strings isn't excel's strongest point, however you might be able to get around it (probably not as gracefully as could be done elsewhere though)

lets say if you know that if a candidate said they played badminton that somewhere in their response would be the phrase "badminton" (or at least a phrase that's always spelled the same way if they answered badminton) then you could have a formula that looks for that phrase and returns 1 or 0 (or true/false) if it finds/doesn't find that word.
 
If the text answers are all separated by commas then you might be able to use 'text to columns' to get them all as individual activities, one per cell. From there easiest thing I can think of would be copy and paste then all into a single column as a reference.

Keeping it as simple as I can, I'd copy that somewhere else and remove duplicates, sort alphabetically and then do a 'countif' on the reference column with each cell in the deduped column as the criteria. That would give you the number of people that responded with each activity, as each count in the reference column has come from an individual person. Might need to do some manual work to combine like categories after the count before turning it into a chart (e.g. if you have 5 'walk's and 10 'walking's you really want to have a single category of 15).

Hope that's not too hard to follow. Should be plenty of examples of how to use 'countif' online if you're unfamiliar with it. Good luck!
 
Can you post a link to the data set on here - if not confidential in any way ?

Really hard to know how to approach this without a clearer idea of the data.
 
Can you post a link to the data set on here - if not confidential in any way ?

Really hard to know how to approach this without a clearer idea of the data.

LINK

I've stripped out names and any personally identifying data, but that is what it looks like in its current state. The other one is the same but with more people and slightly different questions.
 
Do you have access to IBM SPSS?

Also, rather that correlation, you're probably looking for statistically significant differences - i.e. "There was a statistically significant difference in activity intensity between the males and females (F(1,23)=33.07, p<.05) with the male activity intensity (µMAI=106.40s) being significantly higher than the female activity intensity (µFAI=210.90s).

There are various tests to run to look for significant differences - from simple t-tests for a pair of means to one-way and two-way ANOVA etc.
 
Last edited by a moderator:
This is going to be an evening with a large glass of wine, some music and digging through your spreadsheet to sort the data out.

Python, R and SPSS are going to overcomplicate matters at this stage.
 
Do you have access to IBM SPSS?

Also, rather that correlation, you're probably looking for statistically significant differences - i.e. "There was a statistically significant difference in activity intensity between the males and females (F(1,23)=33.07, p<.05) with the male activity intensity (µMAI=106.40s) being significantly higher than the female activity intensity (µFAI=210.90s).

There are various tests to run to look for significant differences - from simple t-tests for a pair of means to one-way and two-way ANOVA etc.

No access to IBM SPSS no, I am very green with statistics as its way outside of my field.

This is going to be an evening with a large glass of wine, some music and digging through your spreadsheet to sort the data out.

Python, R and SPSS are going to overcomplicate matters at this stage.

I think you are probably correct, how would you sort the information out? (sorry)

Am I the only one that read that as sex acts?
Pervert :P
 
Did you get the answer choices grouped in the way you've shown above? Cycling shows in both A3 and A100?

Are those the choices respodents gave, or have they been grouped?

Ah, it was a "tick all that apply" thing.

OK,
1 break your answer up using text to columns in excel
2 then use COUNTIF to search the broken down answers to that question and get numbers for "cycling" etc in new columns at the end. Use "1" as the return value for the countif. That'll let you analyse what you had for each type.

Then count up how many individual entries you have on each row.

Decide how you want to analyse that- just fir the individual entries, or whether the correlations for multiple choices are significant, eg "I found cyclists were most likely to select multiple activity types, though this was less significant in women under 30" or whatever. Otherwise you're just counting stuff...

Choice of graph should be last thing you do, once your data is clean. Bar charts are easy to read. Pie charts don't work well where you have a lot of types, or small differences between types.
 
Last edited:
Did you get the answer choices grouped in the way you've shown above? Cycling shows in both A3 and A100?

Are those the choices respodents gave, or have they been grouped?

Not sure I understand, cell A3 is Male and A100 is blank.
 
LINK

I've stripped out names and any personally identifying data, but that is what it looks like in its current state. The other one is the same but with more people and slightly different questions.

looks pretty feasable for many of the columns.

had a go just with a few options (age, gender, activity):
https://drive.google.com/file/d/1Qf81s-v_NiOj7Y8KC97YJJrh9N37ypRp/view?usp=sharing

not the prettiest method, but it's just an example of how you could go splitting up the strings of text into something that you can make comparisons with.
 
I think you are probably correct, how would you sort the information out? (sorry)

For a dataset this size you’re probably just as well doing it manually. As long as you’re clear what information you want out you can sort the data into more meaningful groups.
 
Back
Top Bottom