.Net question about List - is '.add' threadsafe?

Soldato
Joined
18 Oct 2002
Posts
15,861
Location
NW London
Hi,

I have a quick question regarding Lists in VB.NET and multi threading.

Lets say we have a list of strings, called alphabetList.

We then have 26 threads, each adding a single letter of the alphabet to alphabetList.

so...

thread1: alphabetList.add("a")
thread2: alphabetList.add("b")
...
thread26: alphabetList.add("z")

Now let us assume that we create all threads and then start them all at exactly the same time.

Would the above scenario be thread safe, ie. will there be 26 distinct elements, all added correctly to alphabetList?

My plan is to load up 100000+ words, into a list, using multiple threads and I need to know if locks will be required or if I am OK to simply add all the words in without any locking.

At present, the loading up process takes 5-8 seconds and is quite complex, involving multiple locks and threads. However, if I were to input items into the list without locking, it may speed up the loading process.

Thanks
 
No, it's not safe.

Move your "add" logic into one place and use something like:
Code:
public void ThreadSafeAdd(T value)
{
  lock(list)
  {
    list.add(value);
  }
}
where T is the same type as that of List<T> for list.

Thanks for your reply.

Thats pretty much what I've done (in VB.NET, though), but I feel that adding locks is slowing things down. Hey ho.
 
How long does it take with just the one thread adding these words...?

My first (non-threaded) version of the LoadWordList algorithm loaded all words from a text file, in over 4 minutes.

I now load up the word list, using 26 threads, from 26 different text files and it takes less than 8 seconds.
 
Right,

I've been playing around with regards to performance, unfortunately, I think I have hit a brick wall and seem to have reached a fastest load time of around the 8s mark.

You could also load each file in to a separate List and collect them (.AddRange) once all threads have finished.

This is actually what I do.

Roughly speaking I do the following:
Spawn 26 threads, each loading up words starting with a different letter of the alphabet.

I declare a global/public variable word as list (of String). This is the variable holding all the (114717) words.

Inside each thread we do the following:

Load up the full word list (starting with a given letter of the alphabet) into a smallList (of String).

Then I attain word array lock.

Then I: word.addrange(smallList)

release word array lock

The remaining 25 threads keep adding to the main 'word' List.

I've found this to be the fastest method of loading up words into a List.

As a double check, the number of words loaded into word is noted and stored.

I then reload the word list as above and compare the word.count with the previous word.count and make sure that they are identical. This ensures that an identical number of words have been loaded.

If the 2 word.counts are different, we reload again until we get 2 successive word.counts, equal to eachother.

The above all takes place when the program is started up and it takes 8s until all words have been loaded (twice).

Without multithreading, the process was taking over 4 minutes to load all the words up a single time.

Yep, if you're reading in from text files you're probably IO bound rather than compute bound, so I'd look to parallelise that bit of the work and then load them into the list once all of that stuff is done.

In effect that it was I do above.

It's also worth asking: Do you actually need those 100000 items in the list at once? Would it be possible to chunk them throughout the entire process?

Hehehe. When it initially took 4+ minutes to load everything, I was thinking exactly as you are. As mine is a natural language text processing software, it must have, in memory, all the time, a word list. This is essential and if I decide to "bin" the lesser used words, at some point I may have reload all the words, should I come across a word which is not listed in the word List. The memory footprint for those 114744 words is actual very little, so it makes sense to keep that full List in memory all the time.

The solution came in the form of multithreading.

I only ask this, because time and time again I've seen people load everything into memory, when they only need a subset of it for the task at hand. :)

Consider that there shall potentially be millions of requests being made to access the word List, every minute, the last thing we need is a partial word List which is constantly have to load items from a database or text file...all in the effort to save a few MBs of RAM.
 
Unfortunately, I am using VS2008 (I don't feel the need to move to VS2010 yet), so I'm stuck with .NET 3.5.

I was looking at the parallel libs yesterday.

In my program I make extensive use of for loops and if I can use the parallel for loop (for multicore cpus), this could speed up my program. However, from what I've read, if what is contained in the loop is BIG, there is increase in speed. However, if the contents of the For loop are small-medium, it may actually be slower than a normal/single-threaded For loop.

I may have a play around on my other computer with VS2010 and parallel For loops, to test out performance, but I don't want to go through the pulava of changing to VS2010 and changing my code to .NET 4.0, only to find that VS2010 sucks (which some reports have stated, forcing many to downgrade back to VS2008).
 
Ahh.

Something I forgot to mention.

The file is first loaded.
As you say...this is no biggy.

After loading, the words are extracted from each line, processed and stored away in a list, as appropriate.

Any duplicates are removed.

The words are contained in curly brackets. These are removed.
Comment lines are removed.

If necessary, the files are cached.

Its not as simple as just using 5 lines of code to load up the contents of a file, there is a lot more to it than that. I should've made this more clearer.

Parallel.For method
I think I read up on this. Somebody wrote one and tested it. What they found was that on smaller for blocks, their code has similar performance to the native .net 4.0 parallel for method. However, as the block of code got larger, the performance level between the 2 diverged.

Unless you are talking about a different Parallel.For method?
 
Hi,

I'm not reading line by line. I am reading the contents of the entire file. Once in memory, the manipulation/processing of the lines is done.

If I choose to use the caching algorithm, then the items will be loaded up from the text file if not available in cache. Otherwise, the contents of the file will be loaded from the cache.

The comment lines are very important. In fact, everything held in the text files is important...there is no excess fat...if you get my drift.
 
Im not 100% sure on all the operations you are doing but List<T> might not be your best bet, as mentioned above.

This is an excellent overview of all the collections.

Ive read through that and I thank you for that. I may use some other collections at a later date and have saved the link for future reference.

If you are using List with a large number of items, and you know the size before hand which you do, use the constructor that sets the intial size. It will stop as many expensive copy operations as the Lists internal array does this when it needs to grow.

I spent half an hour doing an experiment.

With Lists, everytime we add and the (natural) limit of the list is hit, the capacity, doubles. This doubling, allegedly, uses resources and wastes time. As you have stated, by resizing the array, before using the series of .add methods, should, in theory save time.

I tried this and it didnt work.

I set up a timing system and in general, I got the following results:

typical load time (when i resize the array, before using .add)..............7.2s
typical load time (when I do not resize the list, before using .add) ......7.0s

I thought that this was a little bizarre, but I have to rely on what I see and the results of my tests, as opposed to the theory. Hence, I shall not be resizing the capacity of the List, before calling .add.
 
How do mean resize the array?

Sorry...not array, but List. (I'm so used to using arrays and arrayList)

So for example.

testList.capacity = newSizeOfList


Not so much when you have <1m items:

Word list is 114718 in size, so this would explain why the benefits are not apparent.

Be aware all these are in milliseconds so this is extremely in micro optimisations category.

This would also explain why I am getting no performance gain by defining the capacity of the List.
 
Could you upload the code somewhere? Pastebin or something?

This is the code of LoadFileContentsIntoList.
I have decided to keep the

List.capacity = xxx

line in the code as it doesnt do any harm and may help when dealing with huge lists.


Code:
    'given a file (dir and name) this sub loads its contents, line by line, into a given array
    Public Sub LoadFileContentsIntoList(ByVal directoryAndFileName As String, ByRef fileContentsList As List(Of String), Optional ByVal extract As Boolean = True, Optional ByVal useCache As Boolean = False, Optional ByVal includeDuplicates As Boolean = True)
        If File.Exists(directoryAndFileName) Then
            Dim extractYesOrNo As String
            If extract = True Then
                extractYesOrNo = "yes"
            Else
                extractYesOrNo = "no"
            End If


            Dim contentsOfFile As String = GetDataFromFileByPrecedingText(directoryAndFileName, "[all]", extractYesOrNo, useCache) ', False, False)
            fileContentsList.Capacity = fileContentsList.Capacity + GetNumberOfLinesInFile(directoryAndFileName, useCache) + 1

            If extract = False Then
                Dim ap As New ArrayProcessingClass
                ap.LoadMultipleLineContentsIntoEmptyList(contentsOfFile, fileContentsList, False)
                Exit Sub
            End If


            Dim maxNumberOfCharsInBigString As Integer = contentsOfFile.Length - 1
            Dim ignoreLine As Boolean = False
            Dim maxNumberOfChars2 As Integer = maxNumberOfCharsInBigString - 5
            Dim tp As New TextProcessingClass
            For charNumber = 0 To maxNumberOfCharsInBigString
                If extract = True Then
                    ''donot copy lines which start with ''''', into the array
                    If charNumber < maxNumberOfChars2 Then
                        If String.Equals(contentsOfFile.Substring(charNumber, 5), "'''''") Then
                            ignoreLine = True
                        End If
                    End If

                    If Char.Equals(contentsOfFile.Chars(charNumber), NEWLINE) Then
                        ignoreLine = False
                    End If
                End If


                If ignoreLine = False Then
                    If String.Equals(contentsOfFile.Chars(charNumber), "{") Then
                        'keep going until we hit "}"
                        Dim internalIndex As Integer
                        Dim startInternalIndex As Integer = charNumber + 1
                        For internalIndex = startInternalIndex To maxNumberOfCharsInBigString
                            While cpuUsage >= maxCpuUsage 'performance management
                                Threading.Thread.Sleep(50) 'halts thread if cpu usage is too high
                            End While

                            If String.Equals(contentsOfFile.Chars(internalIndex), "}") Then
                                'now assign the small string to the array, } amd { are excluded
                                Dim tempLineContents As String = contentsOfFile.Substring(charNumber + 1, internalIndex - charNumber - 1)
                                tempLineContents = tp.RemoveCurlyBracketsFromStartAndEndOfString(tempLineContents)

                                If includeDuplicates Then
                                    fileContentsList.Add(tempLineContents)
                                Else
                                    If Not fileContentsList.Contains(tempLineContents) Then fileContentsList.Add(tempLineContents)
                                End If

                                charNumber = internalIndex + 2 'this takes into account the carriage return char
                                Exit For
                            End If
                        Next internalIndex
                    End If
                End If
            Next charNumber
        Else
            errorLog.Add("ERROR27: An error occurred while reading " & directoryAndFileName & _
                    ". Operation aborted.") ', MsgBoxStyle.OkOnly, "File Read Error")
        End If

        fileContentsList.TrimExcess()
    End Sub
 
My belief is that any program (especially an "intelligent" program), must have the ability to throttle/monitor its own resources. The idea is to avoid the scenario of locking up the computer with near 100% cpu usage. My belief is that the computer which runs this application must be able to deal with other stuff as well.

Performance management is still in an experimental phase and I shall probably be creating a completely separate class which deals with this. Rather than placing sleep/wait statements all over the place, it might be easier to simply hold the entire program.

In any case, cpu usage hasn't really been dealt with.

Initially, when I developed the application, the idea was to be able to run this program on any machine (even a mobile phone). As the program developed, I created a server - client scenario, which now allows the clients to be run on a slow PCs, while the main work is done by the Server.

If there is no cpu management, it is quite possible that the server could run at 100 cpu utilization 24/7. The AI is going to be fairly advanced which means that if it is sitting idle, it shall be able to go on the net and research certain subjects. It shall also have the ability to make posts on the official website. The idea is to also allow it to "tweet" and possibly even post on forums (though I haven't yet come up with the algorithm for this).

The idea is that the AI will be continuously evolving and shall never be sitting idle. If I let it go wild, it will quite happily eat 100% of the cpu, 24/7. IMO CPU management is important.
 
Those were my intial thoughts, however, my understanding is as follows:

Lets say AICore (my program name), is receiving input/output requests at a rate of say 10/second. To process a single request requires a fair amount of processing power and this, when you consider that 80% of the code has not yet been written. By time I get close to completing the final version 1 (end of this year), each input/output request from a client will completely bog the server down.

Consider also that AICore (while processing input/output requests from clients), will also be going on the internet and looking up various "terms". It shall be finding meanings to words and be checking whether or not certain statements which users have written are true or false, there is heavy emphasis on web browsers being opened and closed at rapid speed.

Now, if I were to set a thread priority for each thread (consider that there will be 100s of threads and processes working concurrently), it will become very difficult to manage which thread should have a high priority and which has a low priority. Simply to remember all the different threads which may (or may not) be running, during the coding process would be a nightmare. For this reason, I felt it better to simply use a bog standard cpu throttling mechanism, whereby AICore (or the user) can throttle the entire program and not individual threads.

At present I'm at 20k+ lines of code. By the Summer I will hit the 30k mark and by December should be at 45k. As you can imagine, identifying which threads gets what priority could be prove a nightmare.

To get around this problem I was thinking about throttling the ENTIRE program, at certain points. I felt that using the CPUsage was the best method to achieve this.

Also, as I understand, even if a program has the lowest priority, CPU usage may still hit 100%.

Bear in mind, I'm still thinking how best to tackle the issue of cpu usage, so I'm just sounding out my thoughts.
 
Back
Top Bottom