.Net question about List - is '.add' threadsafe?

sunama · 8 Feb 2011 at 18:18

Hi,

I have a quick question regarding Lists in VB.NET and multi threading.

Lets say we have a list of strings, called alphabetList.

We then have 26 threads, each adding a single letter of the alphabet to alphabetList.

so...

thread1: alphabetList.add("a")
thread2: alphabetList.add("b")
...
thread26: alphabetList.add("z")

Now let us assume that we create all threads and then start them all at exactly the same time.

Would the above scenario be thread safe, ie. will there be 26 distinct elements, all added correctly to alphabetList?

My plan is to load up 100000+ words, into a list, using multiple threads and I need to know if locks will be required or if I am OK to simply add all the words in without any locking.

At present, the loading up process takes 5-8 seconds and is quite complex, involving multiple locks and threads. However, if I were to input items into the list without locking, it may speed up the loading process.

Thanks

Dj_Jestar · 8 Feb 2011 at 18:36

No, it's not safe.

Move your "add" logic into one place and use something like:

Code:

public void ThreadSafeAdd(T value)
{
  lock(list)
  {
    list.add(value);
  }
}

where T is the same type as that of List<T> for list.

Or better yet, create a wrapper and implement IList<T>.

Phyre · 8 Feb 2011 at 18:47

How long does it take with just the one thread adding these words...?

Dj_Jestar · 8 Feb 2011 at 19:00

Just had a bash at this myself, and a simple extension method would do the trick..

Code:

public static void ThreadSafeAdd<T>(this IList<T> source, T value)
{
    lock (source)
    {
        source.Add(value);
    }
}

sunama · 8 Feb 2011 at 19:03

Dj_Jestar said:
No, it's not safe.

Move your "add" logic into one place and use something like:

Code:

public void ThreadSafeAdd(T value) { lock(list) { list.add(value); } }

where T is the same type as that of List<T> for list.

Thanks for your reply.

Thats pretty much what I've done (in VB.NET, though), but I feel that adding locks is slowing things down. Hey ho.

sunama · 8 Feb 2011 at 19:05

Phyre said:
How long does it take with just the one thread adding these words...?

My first (non-threaded) version of the LoadWordList algorithm loaded all words from a text file, in over 4 minutes.

I now load up the word list, using 26 threads, from 26 different text files and it takes less than 8 seconds.

Spyhop · 8 Feb 2011 at 19:55

You could also load each file in to a separate List and collect them (.AddRange) once all threads have finished.

Haircut · 8 Feb 2011 at 20:14

Spyhop said:
You could also load each file in to a separate List and collect them (.AddRange) once all threads have finished.

Yep, if you're reading in from text files you're probably IO bound rather than compute bound, so I'd look to parallelise that bit of the work and then load them into the list once all of that stuff is done.

Loading 100000 items into a list should take next to no time if it's all in memory in the first place.

EDIT: I should probably say, .NET has built in mechanisms for asynchronous IO that don't eat up threads, so you should probably look at using these.
http://msdn.microsoft.com/en-us/library/kztecsys(v=vs.71).aspx

Dj_Jestar · 8 Feb 2011 at 20:16

It's also worth asking: Do you actually need those 100000 items in the list at once? Would it be possible to chunk them throughout the entire process?

I only ask this, because time and time again I've seen people load everything into memory, when they only need a subset of it for the task at hand.

sunama · 8 Feb 2011 at 21:54

Right,

I've been playing around with regards to performance, unfortunately, I think I have hit a brick wall and seem to have reached a fastest load time of around the 8s mark.

Spyhop said:
You could also load each file in to a separate List and collect them (.AddRange) once all threads have finished.

This is actually what I do.

Roughly speaking I do the following:
Spawn 26 threads, each loading up words starting with a different letter of the alphabet.

I declare a global/public variable word as list (of String). This is the variable holding all the (114717) words.

Inside each thread we do the following:

Load up the full word list (starting with a given letter of the alphabet) into a smallList (of String).

Then I attain word array lock.

Then I: word.addrange(smallList)

release word array lock

The remaining 25 threads keep adding to the main 'word' List.

I've found this to be the fastest method of loading up words into a List.

As a double check, the number of words loaded into word is noted and stored.

I then reload the word list as above and compare the word.count with the previous word.count and make sure that they are identical. This ensures that an identical number of words have been loaded.

If the 2 word.counts are different, we reload again until we get 2 successive word.counts, equal to eachother.

The above all takes place when the program is started up and it takes 8s until all words have been loaded (twice).

Without multithreading, the process was taking over 4 minutes to load all the words up a single time.

Haircut said:
Yep, if you're reading in from text files you're probably IO bound rather than compute bound, so I'd look to parallelise that bit of the work and then load them into the list once all of that stuff is done.

In effect that it was I do above.

Dj_Jestar said:
It's also worth asking: Do you actually need those 100000 items in the list at once? Would it be possible to chunk them throughout the entire process?

Hehehe. When it initially took 4+ minutes to load everything, I was thinking exactly as you are. As mine is a natural language text processing software, it must have, in memory, all the time, a word list. This is essential and if I decide to "bin" the lesser used words, at some point I may have reload all the words, should I come across a word which is not listed in the word List. The memory footprint for those 114744 words is actual very little, so it makes sense to keep that full List in memory all the time.

The solution came in the form of multithreading.

Dj_Jestar said:
I only ask this, because time and time again I've seen people load everything into memory, when they only need a subset of it for the task at hand.

Consider that there shall potentially be millions of requests being made to access the word List, every minute, the last thing we need is a partial word List which is constantly have to load items from a database or text file...all in the effort to save a few MBs of RAM.

Jammin1984 · 8 Feb 2011 at 22:00

Just to add on to what has already been stated, if your using .NET 4.0 you have System.Collections.Concurrent and Task Parallel Libs. Tasks are essentially a wrapper around System.Threading and the threadsafe bags are nice and simple to work with.

sunama · 8 Feb 2011 at 22:20

Unfortunately, I am using VS2008 (I don't feel the need to move to VS2010 yet), so I'm stuck with .NET 3.5.

I was looking at the parallel libs yesterday.

In my program I make extensive use of for loops and if I can use the parallel for loop (for multicore cpus), this could speed up my program. However, from what I've read, if what is contained in the loop is BIG, there is increase in speed. However, if the contents of the For loop are small-medium, it may actually be slower than a normal/single-threaded For loop.

I may have a play around on my other computer with VS2010 and parallel For loops, to test out performance, but I don't want to go through the pulava of changing to VS2010 and changing my code to .NET 4.0, only to find that VS2010 sucks (which some reports have stated, forcing many to downgrade back to VS2008).

Phyre · 9 Feb 2011 at 00:10

I'm going to be honest, I don't understand it is what you're doing... I can load a single text file with 1.1million words in 3.35 seconds, single threaded. 114000 words is not a lot at all, and for it to take 4-8 minutes single-threaded is pretty crazy.

Try using a string (or string array) and simply load the file(s) into a StreamReader and read to the end. Split by spaces if you're using the array...

As has been said, once it's in memory it should be incredibly fast.

edit: You can try out a Parallel.For method in 3.5. There is a version of it floating around on the net somewhere..

sunama · 9 Feb 2011 at 00:40

Ahh.

Something I forgot to mention.

The file is first loaded.
As you say...this is no biggy.

After loading, the words are extracted from each line, processed and stored away in a list, as appropriate.

Any duplicates are removed.

The words are contained in curly brackets. These are removed.
Comment lines are removed.

If necessary, the files are cached.

Its not as simple as just using 5 lines of code to load up the contents of a file, there is a lot more to it than that. I should've made this more clearer.

Parallel.For method
I think I read up on this. Somebody wrote one and tested it. What they found was that on smaller for blocks, their code has similar performance to the native .net 4.0 parallel for method. However, as the block of code got larger, the performance level between the 2 diverged.

Unless you are talking about a different Parallel.For method?

Jammin1984 · 9 Feb 2011 at 08:17

Im not 100% sure on all the operations you are doing but List<T> might not be your best bet, as mentioned above.

This is an excellent overview of all the collections.

If you are using List with a large number of items, and you know the size before hand which you do, use the constructor that sets the intial size. It will stop as many expensive copy operations as the Lists internal array does this when it needs to grow.

Phyre · 9 Feb 2011 at 09:58

I understand exactly what it is you're doing. Are you reading line-by-line in that case? That's not exactly the fastest way of doing it. I was thinking more like...

var words = streamReader.ReadToEnd(file).Replace("}", " ").Replace("{", " ").Split(SPLIT BY SPACE);

You can then do your duplicate checking. The only difficulty with that is removing the comment lines. Do they need to be there? Maybe just putting it all into a string first, removing the commented lines up to the end of line (\n or whatever), and then continue.

sunama · 9 Feb 2011 at 18:43

Hi,

I'm not reading line by line. I am reading the contents of the entire file. Once in memory, the manipulation/processing of the lines is done.

If I choose to use the caching algorithm, then the items will be loaded up from the text file if not available in cache. Otherwise, the contents of the file will be loaded from the cache.

The comment lines are very important. In fact, everything held in the text files is important...there is no excess fat...if you get my drift.

sunama · 9 Feb 2011 at 18:49

Jammin1984 said:
Im not 100% sure on all the operations you are doing but List<T> might not be your best bet, as mentioned above.

This is an excellent overview of all the collections.

Ive read through that and I thank you for that. I may use some other collections at a later date and have saved the link for future reference.

Jammin1984 said:
If you are using List with a large number of items, and you know the size before hand which you do, use the constructor that sets the intial size. It will stop as many expensive copy operations as the Lists internal array does this when it needs to grow.

I spent half an hour doing an experiment.

With Lists, everytime we add and the (natural) limit of the list is hit, the capacity, doubles. This doubling, allegedly, uses resources and wastes time. As you have stated, by resizing the array, before using the series of .add methods, should, in theory save time.

I tried this and it didnt work.

I set up a timing system and in general, I got the following results:

typical load time (when i resize the array, before using .add)..............7.2s
typical load time (when I do not resize the list, before using .add) ......7.0s

I thought that this was a little bizarre, but I have to rely on what I see and the results of my tests, as opposed to the theory. Hence, I shall not be resizing the capacity of the List, before calling .add.

Phyre · 9 Feb 2011 at 19:40

Could you upload the code somewhere? Pastebin or something?

Jammin1984 · 9 Feb 2011 at 21:08

sunama said:
I thought that this was a little bizarre, but I have to rely on what I see and the results of my tests, as opposed to the theory. Hence, I shall not be resizing the capacity of the List, before calling .add.

Strange! And although I must admit ive never benchmarked myself, is it possible your inadvertently newing up a List in a LINQ expression?