Very large sample of english needed.

Associate
Joined
7 Jan 2009
Posts
771
Location
Germany
Hi all,

I'm after a very large sample of English in the scale of 50,000+ words, it needs to be English and preferably UK English as apposed too American because they cant spell and use too many Z's.
Reason being is I need to calculate the letter frequencies for the English language and it needs to be accurate. Ideally the form would be a .txt file or something I can copy into a .txt file for reading with the program i've created.
Any help would be appreciated. Cheers
 
Why not just browse online newspapers, and copy and paste them into word to get a word count. I doubt many people will have theses lying around (and even then they won't all necessarily be that big). By doing the newspaper thing you get a different selection of writers. There are not many large blocks of texts like that around in day to day life.
 
Well, I have my thesis which comes in at around 55k words...

To be honest though, you would be better looking at a wider range of material, rather than just a single large body of text. A single piece of scientific literature might not be the best thing to base the estimate on.
 
cheers guys im well aware of what the frequencies are and their order but this is all from data produced by someone else, One of the aims is to produce this myself including digraphs and trigraphs. The Gutenberg website could be useful though.
 
You do know that the use of Z in British English is accepted, and is widely (and correctly) used? Indeed, the OED use 'z' in place of 's'.
It's a particular pet hate of mine that people think that is an Americanization when z is used, when it is fact appropriate use of the British English language.
 
i would say look for brute force password lists or password list generator. All im going to say on the matter but they are normal .txt files and single word per line.
 
I've looked at those lists and the problem is they don't use standard english, so wrxzt would be counted as a word which isn't what im after. I've created a file now from a few Thesis and other books from the Gutenberg site. This way it covers varying authors and styles.
cheers for all the help.
 
Gutenberg plus the works of Dickens should be OK I'd have thought. HG Wells might also be another good one, or alternatively get some epic translated works like War and Peace.
 
Just remember, if this is for an academic project of some kind, sometimes it is best to almost deliberately have a weakness in your sample set such that you can point out the weakness and say that in future you would do x, y, z to rectify. :o
 
surely you'd need to qualify your sample text - letter frequency will vary over years (well maybe decades), regionally, with the type of writing and even with the subject mater...
if for example you pulled the whole message text from OCUK forums I'd imagine you'd get a different profile than the complete works of Shakespeare, a couple of copys of the Sun or all the Harry potter books...
 
Back
Top Bottom