Storing data in a database VS .txt file

sunama · 14 Oct 2010 at 18:55

Hi,

I am creating a program which is contstantly saving/accessing data.

At present I am using simple .txt files to save this data.

However, I would like to know what advantage would I have by setting up a database to save the data to and retrieve data from?

Speed is VERY important to me.

Would it be worth me changing from a .txt file data storage system to a full blown database?

Thanks

sunama · 14 Oct 2010 at 19:35

I'm using VB.NET to program my software.
Would it not be easier for me to use the inbuilt database offered in Visual Studio?

sunama · 14 Oct 2010 at 20:59

I've been reading around.

I like the look of MS SQL Express, which is free. The only problem is that it has a 10GB size limit...which is quite frankly laughable.

I fully expect the size of the database data to exceed 100GB..easily. The server (program) is going to be directly connected the database (initially, both will be run on the same machine). At no stage will the client connect directly to the database...only the server will ever connect to the database itself.

What is the maximum size of database which SQLITE can handle?

I'm thinking of using SQLITE, with a GUI for it, to create the tables. I shall then use the server (programmed in .NET), to connect to the database and read from and write to the database file. Does this sound like a decent plan?

sunama · 14 Oct 2010 at 21:42

Its going to be all text. At this point, the data contained in the database shall be all text. Lots and lots of text.

In time (may be in a few years), I plan to be storing other file types (most certainly binary).

I'm creating a Natural language processing AI, which is able to learn English Natural Language. To give you an idea of the amount of text which can be generated...in a 10 minute conversation with AI, it can quite easily save the text files of size > 2MB. This is with a single client, storing/processing text only. No images yet.

The plan is to have 100s of 1000s of clients hooked up to this thing. 100GB will be filled with ease. I shall also be creating an algorithm which discards the least used data, so that only the top x number of files shall be stored on the database.

sunama · 14 Oct 2010 at 21:51

BlackDragon said:
Be careful that you are only storing data you need, and that it is carefully designed otherwise you will find database operations will be hugely expensive.

Yep. During testing, I shall iron out any issues I have with slowness. I am currently storing all data in multiple text files. The most regularly used info is cached in RAM. I spent a month developing the caching algorithm and it works beautifully.

It was only yesterday, while talking to someone, they suggested that I use a database.

What I like about using text files (instead of a database), is that it gives me much flexibility. I can use threads to read each file...eg. if I have 100 text files of size 100MB each, I can have 100 separate threads all reading data at the same time. In practise, I have got this working nicely. However, I was wondering that perhaps if I use a formal database, the speed may increase further, especially when I start reaching file sizes which are very large.

I've just read that sqlite is fine with smaller databases, but when you hit say 3M+ records, there are issues.

I'm now looking into mysql.

Here's a question. Google use databases and they are able to search masses of data within milliseconds. How do they achieve this, if people are scoffing at 100GB of text? I mean seriously 100GB is child's play. Its quite possible that my server will have text files totaling over 1TB, within a year or so...especially if I decide to deal with Spanish language.

sunama · 14 Oct 2010 at 22:04

BunnyKillBot said:
Hmm, yes, how does one of the largest, most invovative, most well to do companies in the history of the world worth more than 150 billion dollars do it

LOL

sunama · 14 Oct 2010 at 22:09

Izi said:
you could have a look in to search engines such as lucene and see if you could modify this to suit your needs. It may well have a lot of the functionality you need already .

It uses text based indexes, similar to google i believe.

Thank you.

sunama · 14 Oct 2010 at 23:33

datalol-jack said:
For an A.I. training algorithm, I would recommend something like Firebird: http://www.firebirdfaq.org/faq59/; as outlined in its FAQ, and indeed true for most modern, robust RDBMS systems, you will be hitting the file size limit of the file system in question in use way before you exhaust the database constraints. Good thing about Firebird is that it is light weight, has good performance and you still get nice kinks like advanced data compression, backups and indexing. Check it out.

I've done a few hours of reading on firebird. It seems pretty good. Its performance is decent and scalable.

I downloaded all the files. I spent an hour trying to register the server...unfortunately it was throwing up all sorts of errors, so I shall be giving mysql a go. This database has more support so if I do run into problems, the support should be better.

Given that I've had this problem other collaborators who are working with me are also likely to have similar problems, so this is definitely a no-go.

Thank you for suggesting firebird...it would've been great if it had worked, but unfortunately it didnt.

sunama · 15 Oct 2010 at 00:01

Okay, I've had a thought here.

I was looking at some comparisons with Sqlite vs mysql. It would seem that for smaller DBs sqlite is faster.

Also, apparently sqlite is just a file. This is why it is so lightweight (a good thing).

So, I was thinking, is it possible to run multiple sqlite databases on a single machine? This way, I could run a separate database for each table. Of course I would lose the ability to relate these tables to one another, however, I can easily do this in my AI program itself.

If I arrange the DBs in this fashion, my server could use multiple threads to access each table, simultaneously, without needing to queue up (if the sqlite DB is being written to), as the locks will be minimal.

So for example,
we have 1 database called DB1, another called DB2.....DB1000, say. Each database is a separate table. Could I run multiple databases, in this fashion using sqlite?

sunama · 15 Oct 2010 at 00:04

roger that said:
That's a hell of a project you've got there sunama.

Way above anything I'll do, but choose your DB wisely and good luck with it.

Thanks roger.

I was writing down the concepts for the program 18 months ago, but only started coding in February (I think). So its been a tad over 6 months. Summer 2011 will be the make or break time for this.

The direction I take now, regarding the database will dictate how this program ultimately moves forward.

sunama · 15 Oct 2010 at 00:25

Inquisitor said:
100GB of data and you weren't even convinced you might need a real database!?

Nope.

Multiple text files are able to do the job. Its just a question of how fast/slow they would be.

Even now, I am deliberating on whether to go with a proper database as I am currently getting the job done pretty well by using caching algorithms and multiple text files. If I use a database, then my caching algorithm will be bypassed and performance may decrease.

sunama · 20 Oct 2010 at 21:07

daz said:
ZFS with flat files or ZFS with MySQL would be my choices.

First off, thank you for all your responses.
I haven't had time to reply to your suggestions as I've been a little busy...apologies.

ZFS and Windows
I looked into this.

It appears to be the file system of the future. This is what I was after.

Particularly, I liked the fact that its limits were almost unlimited.

Unfortunately, it would appear that this system is not runnable on Windows...yet.

As I am already in the process of developing the AICore on a Windows based machine, this is a no go.

I am trying my best to incorporate the latest systems, unfortunately, the older systems which are more mature, are the most well supported, are well documented and user friendly; and any problems I have with these older systems have plenty of information available on the net, to offer solutions.

Firebird
I really liked the idea of firebird, but I just couldn't get it to run on my PC, so ditched it.

MySQLite
I liked the idea of having something really lightweight, but all my research told me that for very small databases, the "lite" version is great. However, on larger databases, it loses ground to other database systems.

MySQL
In the end, I settled for MySQL.
It isn't the best thing out there, however, it has none of the problems listed above.

Database Creation
I have already started incorporating a MySQL database into the AICore. I have coded some algorithms which are working nicely with a MySQL database I have running on my PC, 24/7.

Database Development
During the next month or 2 I shall spending all my time incorporating the (MySQL) database into the AICore. It is a major task, but all my research tells me that a "formal" database is unavoidable (at this point in time). This goes against my original plan of having a proprietary (simple), text file based system of storing data.

The plan is run the text file based storage system and run it alongside the formal database.

All data shall be stored in the database, however, the data which is regularly used/accessed in the database shall also be stored and maybe accessed from the text files. This means, that if an instance of the AICore server is ever run in isolation, away from the database (or unable to connect to it), it will still be fully functional (but not to the same level, as if it would have full access to the database).

The text file based storage system also allows the AICore to be completely portable. If the text files take up say 4GB of storage space, the AICore can, in theory, be run off a single DVD, without even installing it on a PC.

If the AICore has access to the database, it has access to extra knowledge.

Last Night
I tweaked and tested out the general information verification algorithm last night and it was working flawlessly. This means that if a user types in some information, the information is verified as a fact or not (ie. true or inconclusive), which is done by checking on the internet.

In summary, AICore can now access the internet and use Google, Bing, yahoo, etc to verify information which a user inputs. Any information which is verified by the AICore, is stored away in the database. In effect, the more the AICore verifies information (on the internet), the more knowledgeable it becomes. This shall assist AICore in answering questions from the user.

sunama · 20 Oct 2010 at 21:12

BunnyKillBot said:
For some random reason i thought about this the other day. Since you appear to be using raw txt files, could you not run a compression algorithm at the cost of some additional overheads for data retrieval?

I've thought about this, too.

The problem is that every time we need to read or write from a text file, the file would need to be compressed and uncompressed. This may take extra time when generating a reply to a user's query.

Due to the extra CPU overhead and the extra time taken to compress/uncompress the file, this idea is a no-go.

For me, rather than spending extra cpu cycles on compression, I prefer to use some form of encryption, so that any data saved to file, cannot be read by a human, as it is encrypted. I haven't started working on the encryption algorithm yet...this is something I intend to do in March 2011.

sunama · 20 Oct 2010 at 21:15

Tilluss said:
The problem with running multiple databases like that is you will have to establish a connection to each database each time you want to access something which will slow things down.

This information was very helpful in my final decision, when choosing MySQL. Thank you.

sunama · 20 Oct 2010 at 21:28

BlackDragon said:
Be careful that you are only storing data you need, and that it is carefully designed otherwise you will find database operations will be hugely expensive.

ZombieFan said:
Another note on database design... Is all the data you are storing in text files unqiue, or is there quite a bit of duplication of strings?

Thanks for the advice.

My design avoids the use of storing duplication of data.

Pretty much all information is given a unique id. The relationships are made using the ids.

AI has the ability to sort data in such a way that if a peice of data is rarely accessed/used/written-to/read-from, it eventually falls away to the bottom of the table or gets deleted.

The idea is that the AICore behaves much like a human so that if we regularly use information, we tend to remember it. However, the less we use it, the more likely we are to completely forget it. At this point we would need to revise or re-learn the information. AICore will operate in a similar way which will help avoid the situation of getting bogged down with superfluous data/information.

sunama · 20 Oct 2010 at 21:31

ZombieFan said:
If you are going down that route, then you might also want to look into SQL Server Compact Edition.

I looked into MS SQL Server.
The free version is limited to 4GB which I felt was laughable.

The enterprise version (which allows for huge databases) is expensive. My research also told me that although MS SQL Server has a shed-load of functions, it suffers in the speed department because of this.

MYSQL won out.

sunama · 20 Oct 2010 at 21:35

Dj_Jestar said:
stuff

Mr DJ.
You wrote plenty of stuff.
But I'm sorry to say, you were no help whatsoever.
I'm bemused as to why you even bothered replying.

sunama · 20 Oct 2010 at 21:49

sniffy said:
Very interesting research project. Notoriously difficult though

Would be interested to hear more about it.

It is very difficult.

The lines of code have hit the 13k mark now.

What I find has made the coding difficult is the multi-threading. Had there been no multi-threading, I actually feel the coding process would've been fairly predictable.

There was a stage where I spent about adding locks to prevent various threads from accessing the same file/variables, etc.

One thing I will say about threads is that "I LOVE THREADS". The performance gain I've experienced when I switched to using threads was MASSIVE. To give you an idea of the sort of performance boost observed...

At start up, a lot of information is loaded into memory.
StartUp Time Without Multi-threading: proximately 2m30s
StartUp Time Witht Multi-threading: proximately <1s

This sort of performance gain was observed throughout runtime.

These results are taken from a computer running a Core2Duo, [email protected] with 4GB of RAM.

sunama · 20 Oct 2010 at 22:07

Haircut said:
Sorry to say it, but there's zero chance you're getting a 150x speed up by multi-threading.
What you're probably doing is avoiding blocking the GUI thread so the app becomes more responsive. It is a very different thing to being a massive performance gain though.

That's probably what has happened. I put in some other optimisations, which were done over a period of about 3-4 weeks. But, by the end that 3-4 week period, the start time (which involved loading data from files), was significantly reduced.

By running the file loading processes in separate/individual threads, this probably avoided conflicts/competition with the GUI...as a result start time was reduced.

sunama · 20 Oct 2010 at 23:04

sniffy said:
Incredible.

To what extent can it determine something to be true? For example could it determine if blue is a colour? If so in what format do you structure the question so the algorithm can "understand" it? What type of questions can't it answer?

Some results which have been run through the AICore:
[Input,,output]...direct copy and paste.

blue is a colour,,true
blue is not a colour,,false
london is the biggest city in britain,,true
london is in southeast england,,inconclusive
london was the first city in the world to have an underground railway,,true

The thing that I found most satisfying was when it was running the search all by itself and making deductions, based on the results. This was all happening after it received fresh, new input, through a client which I was typing in.

The beauty of this algorithm is that AICore cannot be fooled. In the case of most "bots", if enough users give false input (eg. red is not a colour), the bot will have algorithms which will eventually believe this to be the case. AICore will always verify this information. Information is routinely re-verified...this covers the scenario where if something was true 2 weeks, it won't necessarily be true today (eg. Gordon Brown is the prime minister of Britain - after the elections, 2 weeks later, this information will no longer be true, so will have a new value).

In the above way AICore is able to adapt to changes which happen in the real world...just like a human.

PS. All input is in natural language. No codes/commands needed.