Splitting address and house number

Noto · 26 May 2006 at 09:49

Hi all,

I have a lil problem: Basically I have data of a lot of address that I would like store in a database. But I would like to split the house number / name from the first line of address.

eg. 23 madeup street.

So one field would store the "23" and another store "made up street"

This wouldnt be too much of a problem until you take into account of address' like this:

Flat 27 Made up street
Flat 27a Made up street
Kings Cottage Made up street
Kings house 9 Made up street.

Any ideas or help would be greatly appretiated.

Thanks Ernest

Thrash · 26 May 2006 at 10:07

What database are you intending to use?

Noto · 26 May 2006 at 10:09

I had some ideas for rules:

Flat 27 Made up street
Flat 27a Made up street
Kings Cottage Made up street
Kings house 9 Made up street.

For Flat 27 and Flat 27a, could I do something like:

If text string "flat" is found then take the word flat and everything between the two " " spaces. So this would take Flat 27 and Flat 12232a.

Also if there was no number in address line 1 to store it all into the street field. So then I could search for the blank house number fields and fix them manually if needed.

Kinda funny ideas seem to hit you more when you write things down

iron to gold....

ClearChaos · 26 May 2006 at 10:59

split the string 1 character after the number and you should in theory get what you want providing all the data has been entered in the same format

Beansprout · 26 May 2006 at 11:14

I'm rusty on preg_match and have no idea why it's matching twice and partially (or how to stop it), but it seeems to work...

Code:

<?php

$strings = array("Flat 27 Made up street",
"Flat 27a Made up street",
"Kings Cottage Made up street",
"Kings house 9 Made up street");

echo 'Numbers:<br /><pre>';
foreach($strings as $k => $v){
	preg_match('/^(.*)\d{1,4}[a-f]{0,1}/i',$v,$matches);
	print_r($matches);
}
echo '</pre>';

echo 'Addresses:<br /><pre>';
foreach($strings as $k => $v){
	print_r(preg_split('/^(.*)\d{1,4}[a-f]{0,1}/i',$v));
	//print_r($matches);
}
echo '</pre>';


?>

I used preg_match() to match the first part which is the house number, then preg_split() to split off the house number (if it has one) and leave just the address line.

It outputs this...

Code:

Numbers:

Array
(
    [0] => Flat 27
    [1] => Flat 2
)
Array
(
    [0] => Flat 27a
    [1] => Flat 2
)
Array
(
)
Array
(
    [0] => Kings house 9
    [1] => Kings house 
)

Addresses:

Array
(
    [0] => 
    [1] =>  Made up street
)
Array
(
    [0] => 
    [1] =>  Made up street
)
Array
(
    [0] => Kings Cottage Made up street
)
Array
(
    [0] => 
    [1] =>  Made up street
)

But now I go to lunch so I'll return and find a resident regexp expert has put me to shame

ByteJuggler · 26 May 2006 at 11:26

[Update: Note this comment was written before Beansprouts, so isn't commenting directly on his, though I see it's relevant anyway

]

It's actually even a little more complicated than that. My address for example is something like:

Flat 4, 90 Flemming Road,
etc etc

or

4 Higham House, 70 Flemming Road,
etc etc

So you can't just split after the last number.... you have to effectively split after the _first_ occuring number, together with whatever follows it, excluding punctuation and whitespace, up to the next number/letter...

Possibly I'd suggest investigating using a regular expression (or perhaps 2 or 3) to do the parsing rather than trying to come up with some code to deduce it.

JIMA · 27 May 2006 at 00:06

Hi,

Another way to possibly do this, assuming you're using a relational database (I used Ingres to run a quick test of this), would be to split the address on space characters (also include commas etc), so you end up with multiple parts for each address. These would then be inserted into a database table such as:

address_id integer not null,
seq_no integer not null,
addresspart varchar(50) not null

Each address is given a unique numeric id, each part is then given a unique seq_no. So you end up with a load of different rows for the one address, the parts being ordered by seq_no.

You can then select back addresses based on their parts. For example, you could select back all address_id where the address contains the values "FLAT" and "26" by using something like:

select distinct(a.address_id)
from address a
where a.address_id in
(select a2.address_id
from a2.address
where a2.addresspart = 'FLAT'
and a2.address_id in
(select a3.address_id
from a3.address
where a3.addresspart = '26'));

If you want to ensure that the 26 immediately follows FLAT then you could amend the last part of the select to be:

(select a3.address_id
from a3.address
where a3.addresspart='CRESCENT'
and a3.seq_no = a2.seq_no + 1));

This basic outline could be refined no-end e.g making special cases for commonly used words such as "St", "Road", "Crescent".

Advantages are that complex parsing to cater for different address formats is largely avoided. The downside is that you end up with a pretty big address table. Also some awkward SQL could be needed to run queries, probably constructed dynamically at run-time from an entered search string. It would depend on how you wanted to retrieve addresses.

There could well be better ways to write my SQL above to make it less awkward and possibly faster. I don't know what performance would be like, given appropriate keys on the table, having only run it on a handful of addresses.

Hope that's useful.

Jim

Noto · 30 May 2006 at 15:19

thanks for all the help

I should be able to come up with a solution now.