I don't believe .net can't do this...

Soldato
Joined
18 Oct 2002
Posts
7,139
Location
Ironing
I've got a data source that I pull a whole list of words out of. I just need to count how many times each word comes out and then return the top 10 or so. Easy? Apparently not.

My first thought was to throw everything into some sort of hash or collection, with the word string as the key and the number of times it has appeared as the value. However, there is no class like that which lets you sort by value to return the highest 10. The only way I could think of doing it is throwing the name/value pairs into a hashtable, going through each one and then adding the index of the hashtable key into an arraylist comparing it's value with every existing entry in the arraylist to see where it should go. I'd then end up with an arraylist containig the sorted indexes of the keys in the hashtable. A fairly bad way of doing it.

Any ideas?
 
Perhaps I'm missing something but why are you doing this in the database? DBMS's are much better than code at set based operations.
 
it doesn't depend so much on .Net but on the database your using.

If your using SQL Server Or Oracle then you can use a stored Procedure and TSQL to do what your asking. Don't ask me the exact code to do it I just know that this is possible :)
 
The problem is that all the words are stored in a column of a table, but there might be more than one word in each row, space separated. THerefore, one cell might contain "bunny wopt hacksaw" whereas the next row may be NULL and the next one might contain just one word. I need to get the complete list of individual words out, and because this involves splitting, I was under the impression the database couldn't do this...
 
I'd have said the other way round - chuck the words in an arraylist then increment a value in a hashtable using the string as the key.
I'd probably use a List<string> and a Dictionary<string,int> but an arraylist and hashtable would work fine if you like to live life on the wild side :p

Code:
            List<string> strings = new List<string>();
            strings.Add("a");
            strings.Add("b");
            strings.Add("a");
            strings.Add("a");
            Dictionary<string,int> wordCount = new Dictionary<string,int>();
            foreach (string s in strings)
            {
                int count;
                wordCount.TryGetValue(s, out count);
                wordCount[s] = ++count;
            }

Which yields {[a,3],[b,1]} which is pretty much what I hoped for :p

Sort by value and voila... you have your top 10 ;)
 
Oh, and i nearly forgot, since you can write stored procedures in C# if you're using SQL server 2005 express/full you could even put that in the database...
 
NathanE said:
SQL Server 2000 can do stored procedures also...
But only in T-SQL no?


Anyway, turns out sorting a dictionary based on values isn't that simple...

Here's a slightly more fleshy version that works:

Code:
        class StringCount : IComparable<StringCount>
        {
            public string Str;
            public int Count;

            public int CompareTo(StringCount other)
            {
                return other.Count.CompareTo(Count);
            }
        }

        private void button1_Click(object sender, EventArgs e)
        {
            List<string> strings = new List<string>();
            strings.Add("c");
            strings.Add("b");
            strings.Add("a");
            strings.Add("a");
            strings.Add("a");
            strings.Add("b");
            strings.Add("a");
            strings.Add("a");
            strings.Add("a");
            strings.Add("b");
            strings.Add("a");
            strings.Add("a");
            strings.Add("c");
            strings.Add("c");
            strings.Add("c");
            strings.Add("c");
            strings.Add("c");
            strings.Add("c");
            Dictionary<string,StringCount> wordCount = new Dictionary<string,StringCount>();
            foreach (string s in strings)
            {
                StringCount count;
                wordCount.TryGetValue(s, out count);
                if (count == null)
                {
                    count = new StringCount();
                    count.Str = s;
                    count.Count = 0;
                }
                count.Count ++;
                wordCount[s] = count;
            }
            List<StringCount> valueList = new List<StringCount>(wordCount.Values);
            valueList.Sort();
            StringBuilder sb = new StringBuilder();
            sb.AppendFormat("You're top {0} are.....\r\n", valueList.Count);
            foreach (StringCount sc in valueList)
            {
                sb.AppendFormat("{0} x {1}\r\n", sc.Count, sc.Str);
            }
            MessageBox.Show(sb.ToString());
        }
 
This is definitely an operation a database would do best.

What db are you using?

Top of my head, I'd normalise your data into a temporary table then perform the correct sum,group by, order by, top 10 statement for whichever db you're using.

I can tell you how for SQL Server as it's ace, the others shouldn't be dissimilar.
 
Not got time for the full solution but here is a start

Code:
CREATE FUNCTION dbo.SplitString
(
  @List  varchar(500),
  @Delim varchar(10)
)
RETURNS 
@ParsedList table
(
  ListItem varchar(255)
)
AS
/*----------------------------------------------------------------------------------------------
' Function       : SplitString
'
' Description    : Split a string using the passed delimeter - returns a Table variable
'
' Change History :
'
' WHEN        WHO  WHAT
'----------------------------------------------------------------------------------------------
'--------------------------------------------------------------------------------------------*/
BEGIN
  
  DECLARE @ListItem  varchar(255)
  DECLARE @Pos       int

  -- Ensure we have a trailing delimiter
  IF RIGHT(@List,LEN(@Delim)) <> @Delim
    SET @List = LTRIM(RTRIM(@List))+ @Delim

  SET @Pos = CHARINDEX(@Delim, @List, 1)

  IF REPLACE(@List, @Delim, '') <> ''
  BEGIN

    -- Break up the string
    WHILE @Pos > 0
    BEGIN
      
      SET @ListItem = LTRIM(RTRIM(LEFT(@List, @Pos - 1)))
      
      IF @ListItem <> ''
        INSERT INTO @ParsedList (ListItem) VALUES (@ListItem)

      SET @List = RIGHT(@List, LEN(@List) - @Pos - LEN(@Delim) + 1)
      SET @Pos = CHARINDEX(@Delim, @List, 1)

    END

  END	
  RETURN

END
The above code creates a VB like split function.

You'd then need to loop through the values in the table using a cursor, inserting the results from this function into a new table, probably a temp one if disk and processor are not your concern.


If you want to do this a lot, please rethink your data design though and not store the words in space separated format, they should be in a separate table. It would make life so much easier.
 
First thing I would do is normalize your database structure.

Run a query in a cursor to split the space delimitted words into separate fields and the insert these rows into a new table, so that you have one row for each word - life will be much simpler from there - whether you choose to do the processing in the database (recommmended) on in your code.

If you can't do this you can do the same thing at runtime but thats going to be hideous performance wise.

Here's some links to get you started:

Split in Transact SQL

Cursors

There's plenty about cursors in Transact SQL Help to
 
I've got into this habit of thinking up really good reasons for doing something one way, and then forgetting it. There was a very very very (I assure you) very good reason for putting the words in this way, but the reason for doing that over a normal table has escaped me.

What I'll probably do now is re-engineer the whole thing to have the tags in a separate table, and then when I'm 80% done remember the reason why I'm not doing it that way. I'll get back to you lot then...
 
growse said:
THerefore, one cell might contain "bunny wopt hacksaw" whereas the next row may be NULL and the next one might contain just one word

I've got into this habit of thinking up really good reasons for doing something one way, and then forgetting it. There was a very very very (I assure you) very good reason for putting the words in this way, but the reason for doing that over a normal table has escaped me.

There couldn´t ever be a good reason for modelling your database as poorly as that :p :D
 
You're right, I think I was on drugs or something when I did that. I've re-engineered it and overcome the problem that I thought was a lot bigger than it actually was.

Trivial now. Thanks everyone.
 
Back
Top Bottom