Storing data in a database VS .txt file

Tilluss · 15 Oct 2010 at 08:22

sunama said:
Okay, I've had a thought here.

I was looking at some comparisons with Sqlite vs mysql. It would seem that for smaller DBs sqlite is faster.

Also, apparently sqlite is just a file. This is why it is so lightweight (a good thing).

So, I was thinking, is it possible to run multiple sqlite databases on a single machine? This way, I could run a separate database for each table. Of course I would lose the ability to relate these tables to one another, however, I can easily do this in my AI program itself.

If I arrange the DBs in this fashion, my server could use multiple threads to access each table, simultaneously, without needing to queue up (if the sqlite DB is being written to), as the locks will be minimal.

So for example,
we have 1 database called DB1, another called DB2.....DB1000, say. Each database is a separate table. Could I run multiple databases, in this fashion using sqlite?

The problem with running multiple databases like that is you will have to establish a connection to each database each time you want to access something which will slow things down. Having 1 database connection and then just using multiple query's on that database is going to be a lot faster than establishing a new connection each time you want to query something.

daz · 15 Oct 2010 at 08:53

ZFS with flat files or ZFS with MySQL would be my choices.

BunnyKillBot · 20 Oct 2010 at 12:13

For some random reason i thought about this the other day. Since you appear to be using raw txt files, could you not run a compression algorithm at the cost of some additional overheads for data retrieval?

ZombieFan · 20 Oct 2010 at 12:49

sunama said:
Okay, I've had a thought here.

I was looking at some comparisons with Sqlite vs mysql. It would seem that for smaller DBs sqlite is faster.

Also, apparently sqlite is just a file. This is why it is so lightweight (a good thing).

So, I was thinking, is it possible to run multiple sqlite databases on a single machine? This way, I could run a separate database for each table. Of course I would lose the ability to relate these tables to one another, however, I can easily do this in my AI program itself.

If I arrange the DBs in this fashion, my server could use multiple threads to access each table, simultaneously, without needing to queue up (if the sqlite DB is being written to), as the locks will be minimal.

So for example,
we have 1 database called DB1, another called DB2.....DB1000, say. Each database is a separate table. Could I run multiple databases, in this fashion using sqlite?

If you are going down that route, then you might also want to look into SQL Server Compact Edition. Compact Edition is the same as express but it is a single file which you just add into your VB.NET project from within Visual Studio. It's (IMO) the quickest and easiest way to hook a database up in the .NET envronment. You are limited to a 4GB database though...

In my opinion, using a proper database is going to be much quicker than your current text files. A database is not just there to store data; there is also hell of a lot of work going on under the covers to make sure data is served back to you in the most efficent way possible. As long as you've got a good database design behind it with good indexes etc then it should be like lightning. If nothing else, learning how to design and configure databases is a very useful skill to have in your belt

More info about MS SQL CE here - http://www.microsoft.com/sqlserver/2008/en/us/compact.aspx

Another note on database design... Is all the data you are storing in text files unqiue, or is there quite a bit of duplication of strings? If there is quite a bit of duplication then designing the database properly before you begin could reduce the size of the data by a huge amount. It will also increase performance since the database will make full use of internal caching etc.

sniffy · 20 Oct 2010 at 13:14

Very interesting research project. Notoriously difficult though

Would be interested to hear more about it.

Dj_Jestar · 20 Oct 2010 at 13:26

I can't help but think you are being way over optimistic if you think text files for 100GB+ of data is actually faster than a database. The lack of indexes for a start.

What is this text you are storing?

Speed is important to everyone, but never as important as having a working and maintainable solution.

You need to give info to get info, tbh. Far too much of what it is you are trying to do is being left out, thus all you are gettin in return are generic "use mssql/sqllite/other well known DB engine" replies.

What are your load test results as your system is now? What are they when using some like mssql server?

And just to be blunt, you are coming across as someone who doesn't know what they are talking about, but is trying hard to look like they do. Apparently not even knowing what a database is, let alone the benefits of using one.

daz · 20 Oct 2010 at 14:49

Dj_Jestar said:
I can't help but think you are being way over optimistic if you think text files for 100GB+ of data is actually faster than a database. The lack of indexes for a start.

The index I would imagine would be the filename, which the file system would handle as part of its duties. Something like ZFS would deal with it easily enough.

That said I would still recommend a proper database system for this.

Dj_Jestar · 20 Oct 2010 at 15:44

I can't even tell if they are files or just long strings from the information given. If the former, then yes file names, if the latter (or a combination of the two) then an index on the content will be needed as well.

sunama · 20 Oct 2010 at 21:07

daz said:
ZFS with flat files or ZFS with MySQL would be my choices.

First off, thank you for all your responses.
I haven't had time to reply to your suggestions as I've been a little busy...apologies.

ZFS and Windows
I looked into this.

It appears to be the file system of the future. This is what I was after.

Particularly, I liked the fact that its limits were almost unlimited.

Unfortunately, it would appear that this system is not runnable on Windows...yet.

As I am already in the process of developing the AICore on a Windows based machine, this is a no go.

I am trying my best to incorporate the latest systems, unfortunately, the older systems which are more mature, are the most well supported, are well documented and user friendly; and any problems I have with these older systems have plenty of information available on the net, to offer solutions.

Firebird
I really liked the idea of firebird, but I just couldn't get it to run on my PC, so ditched it.

MySQLite
I liked the idea of having something really lightweight, but all my research told me that for very small databases, the "lite" version is great. However, on larger databases, it loses ground to other database systems.

MySQL
In the end, I settled for MySQL.
It isn't the best thing out there, however, it has none of the problems listed above.

Database Creation
I have already started incorporating a MySQL database into the AICore. I have coded some algorithms which are working nicely with a MySQL database I have running on my PC, 24/7.

Database Development
During the next month or 2 I shall spending all my time incorporating the (MySQL) database into the AICore. It is a major task, but all my research tells me that a "formal" database is unavoidable (at this point in time). This goes against my original plan of having a proprietary (simple), text file based system of storing data.

The plan is run the text file based storage system and run it alongside the formal database.

All data shall be stored in the database, however, the data which is regularly used/accessed in the database shall also be stored and maybe accessed from the text files. This means, that if an instance of the AICore server is ever run in isolation, away from the database (or unable to connect to it), it will still be fully functional (but not to the same level, as if it would have full access to the database).

The text file based storage system also allows the AICore to be completely portable. If the text files take up say 4GB of storage space, the AICore can, in theory, be run off a single DVD, without even installing it on a PC.

If the AICore has access to the database, it has access to extra knowledge.

Last Night
I tweaked and tested out the general information verification algorithm last night and it was working flawlessly. This means that if a user types in some information, the information is verified as a fact or not (ie. true or inconclusive), which is done by checking on the internet.

In summary, AICore can now access the internet and use Google, Bing, yahoo, etc to verify information which a user inputs. Any information which is verified by the AICore, is stored away in the database. In effect, the more the AICore verifies information (on the internet), the more knowledgeable it becomes. This shall assist AICore in answering questions from the user.

sunama · 20 Oct 2010 at 21:12

BunnyKillBot said:
For some random reason i thought about this the other day. Since you appear to be using raw txt files, could you not run a compression algorithm at the cost of some additional overheads for data retrieval?

I've thought about this, too.

The problem is that every time we need to read or write from a text file, the file would need to be compressed and uncompressed. This may take extra time when generating a reply to a user's query.

Due to the extra CPU overhead and the extra time taken to compress/uncompress the file, this idea is a no-go.

For me, rather than spending extra cpu cycles on compression, I prefer to use some form of encryption, so that any data saved to file, cannot be read by a human, as it is encrypted. I haven't started working on the encryption algorithm yet...this is something I intend to do in March 2011.

sunama · 20 Oct 2010 at 21:15

Tilluss said:
The problem with running multiple databases like that is you will have to establish a connection to each database each time you want to access something which will slow things down.

This information was very helpful in my final decision, when choosing MySQL. Thank you.

sunama · 20 Oct 2010 at 21:28

BlackDragon said:
Be careful that you are only storing data you need, and that it is carefully designed otherwise you will find database operations will be hugely expensive.

ZombieFan said:
Another note on database design... Is all the data you are storing in text files unqiue, or is there quite a bit of duplication of strings?

Thanks for the advice.

My design avoids the use of storing duplication of data.

Pretty much all information is given a unique id. The relationships are made using the ids.

AI has the ability to sort data in such a way that if a peice of data is rarely accessed/used/written-to/read-from, it eventually falls away to the bottom of the table or gets deleted.

The idea is that the AICore behaves much like a human so that if we regularly use information, we tend to remember it. However, the less we use it, the more likely we are to completely forget it. At this point we would need to revise or re-learn the information. AICore will operate in a similar way which will help avoid the situation of getting bogged down with superfluous data/information.

sunama · 20 Oct 2010 at 21:31

ZombieFan said:
If you are going down that route, then you might also want to look into SQL Server Compact Edition.

I looked into MS SQL Server.
The free version is limited to 4GB which I felt was laughable.

The enterprise version (which allows for huge databases) is expensive. My research also told me that although MS SQL Server has a shed-load of functions, it suffers in the speed department because of this.

MYSQL won out.

sunama · 20 Oct 2010 at 21:35

Dj_Jestar said:
stuff

Mr DJ.
You wrote plenty of stuff.
But I'm sorry to say, you were no help whatsoever.
I'm bemused as to why you even bothered replying.

sunama · 20 Oct 2010 at 21:49

sniffy said:
Very interesting research project. Notoriously difficult though

Would be interested to hear more about it.

It is very difficult.

The lines of code have hit the 13k mark now.

What I find has made the coding difficult is the multi-threading. Had there been no multi-threading, I actually feel the coding process would've been fairly predictable.

There was a stage where I spent about adding locks to prevent various threads from accessing the same file/variables, etc.

One thing I will say about threads is that "I LOVE THREADS". The performance gain I've experienced when I switched to using threads was MASSIVE. To give you an idea of the sort of performance boost observed...

At start up, a lot of information is loaded into memory.
StartUp Time Without Multi-threading: proximately 2m30s
StartUp Time Witht Multi-threading: proximately <1s

This sort of performance gain was observed throughout runtime.

These results are taken from a computer running a Core2Duo, [email protected] with 4GB of RAM.

Haircut · 20 Oct 2010 at 21:58

sunama said:
One thing I will say about threads is that "I LOVE THREADS". The performance gain I've experienced when I switched to using threads was MASSIVE. To give you an idea of the sort of performance boost observed...

At start up, a lot of information is loaded into memory.
StartUp Time Without Multi-threading: proximately 2m30s
StartUp Time Witht Multi-threading: proximately <1s

This sort of performance gain was observed throughout runtime.

These results are taken from a computer running a Core2Duo, [email protected] with 4GB of RAM.

Sorry to say it, but there's zero chance you're getting a 150x speed up by multi-threading.
What you're probably doing is avoiding blocking the GUI thread so the app becomes more responsive. It is a very different thing to being a massive performance gain though.

sunama · 20 Oct 2010 at 22:07

Haircut said:
Sorry to say it, but there's zero chance you're getting a 150x speed up by multi-threading.
What you're probably doing is avoiding blocking the GUI thread so the app becomes more responsive. It is a very different thing to being a massive performance gain though.

That's probably what has happened. I put in some other optimisations, which were done over a period of about 3-4 weeks. But, by the end that 3-4 week period, the start time (which involved loading data from files), was significantly reduced.

By running the file loading processes in separate/individual threads, this probably avoided conflicts/competition with the GUI...as a result start time was reduced.

sniffy · 20 Oct 2010 at 22:24

sunama said:
Last Night
I tweaked and tested out the general information verification algorithm last night and it was working flawlessly. This means that if a user types in some information, the information is verified as a fact or not (ie. true or inconclusive), which is done by checking on the internet.

In summary, AICore can now access the internet and use Google, Bing, yahoo, etc to verify information which a user inputs. Any information which is verified by the AICore, is stored away in the database. In effect, the more the AICore verifies information (on the internet), the more knowledgeable it becomes. This shall assist AICore in answering questions from the user.

Incredible.

To what extent can it determine something to be true? For example could it determine if blue is a colour? If so in what format do you structure the question so the algorithm can "understand" it? What type of questions can't it answer?

sunama · 20 Oct 2010 at 23:04

sniffy said:
Incredible.

To what extent can it determine something to be true? For example could it determine if blue is a colour? If so in what format do you structure the question so the algorithm can "understand" it? What type of questions can't it answer?

Some results which have been run through the AICore:
[Input,,output]...direct copy and paste.

blue is a colour,,true
blue is not a colour,,false
london is the biggest city in britain,,true
london is in southeast england,,inconclusive
london was the first city in the world to have an underground railway,,true

The thing that I found most satisfying was when it was running the search all by itself and making deductions, based on the results. This was all happening after it received fresh, new input, through a client which I was typing in.

The beauty of this algorithm is that AICore cannot be fooled. In the case of most "bots", if enough users give false input (eg. red is not a colour), the bot will have algorithms which will eventually believe this to be the case. AICore will always verify this information. Information is routinely re-verified...this covers the scenario where if something was true 2 weeks, it won't necessarily be true today (eg. Gordon Brown is the prime minister of Britain - after the elections, 2 weeks later, this information will no longer be true, so will have a new value).

In the above way AICore is able to adapt to changes which happen in the real world...just like a human.

PS. All input is in natural language. No codes/commands needed.

sniffy · 20 Oct 2010 at 23:08

Oh my god, that gives me a bit of a semi. Absolutely brilliant work.