Idea: Standardise values block?

daniel · March 26, 2012 4:13AM

I am quite interested in the proposed operation for matching up by Steve from another thread. I can see how quickly this would address match up problems rather then slow iterations with Search and replace. I definitely think this would save time for people with data integration problems as well as making the process much simpler.

I do have one more suggestion though is that the match up would allow for Many to One scenarios as I have encountered examples where an area name match up would be many to one as a number of areas merged into a single area i.e:

Old Name of Areas :Chester-le-Street, Derwentside, Durham, Easington, Sedgefield, Teesdale, Wear Valley

Which are all now grouped up as just County Durham due to UK parliamentary boundary changes. On the whole the UK has been shrinking the number of areas so often you may get an outdated list with "extra" areas names than the current UK parliament boundaries list.

Original comment taken from a different thread:

This is quite a common problem, but not one which the merge/join block should attempt to solve. Fuzzy matching won't help here. For example, some common value mismatches, most of which you can't solve with fuzzy matching:

- US, USA, United States, US of A, America
- Cardiff, Caerdydd, Cardiff/Caerdydd
- Russia, Russian Federation
- London, Londres
- GB, UK, United Kingdom, Britain, Great Britain, British Isles
etc.

Currently you have to use a Search/Replace block to manage these, which is very tedious to configure. You can use merge/join to identify unique mismatches, then auto-generate a Search/Replace configuration based on those mismatches, but you are still going to have to manually locate each right-side value for each missing left-side value.

A possible solution is a new kind of block, "Standardise values". It would have two inputs: (1) the main data (2) the lookup file. You would pick a field from each side, much as you do in merge/join. It would then enumerate all unique values from left that are missing from right, with searchable drop-downs to pick the correct mapping. It would result in the main data from the left, with all values standardised according to the right file.

steve · March 26, 2012 5:13AM

Thanks for re-posting separately Daniel. Please clarify the many-to-one issue.

daniel · March 26, 2012 5:51AM

Ok, so I have two lists of Local Area District names with different numbers of area names. One looks like a historical list of 400(ish) LAD Names and the other is a current list obtained from the Codepoint file with only 380(ish) Area names.

What looks like has happened is that the local government boundary changes may have regrouped smaller areas to a single body name. In my example, many areas to just one current area name of County Durham.

What I want to avoid is to not end up with just 1:1 as at most only 380/400 would get matched up.

County Durham is a good example because the old list has a separate area called Durham which only concerned the town of Durham but after some years, other areas have abolished and grouped up with the now current County Durham boundary as below.

Durham - County Durham (1:1 Match)
Chester-le-Street - County Durham
Derwentside - County Durham
Easington - County Durham
Sedgefield - County Durham
Teesdale - County Durham
Wear Valley - County Durham

So from your suggestion I would get the situation where Durham will be matched up to County Durham so it may not be selectable from the drop down list for the other areas which would now fall under the County Durham area. Another area which sees a similar situation is Northumberland. For me an inclusion of a button in the drop down list to give the option to search already matched up names would get around.

Extract from Wikipedia http://en.wikipedia.org/wiki/County_Durham#Local_government

Local government

County Durham (governed by Durham County Council). The unitary district was formed on 1 April 2009 replacing the previous two-tier system of a county council providing strategic services and seven district councils providing more local facilities. It has 126 councillors. The seven districts abolished were:[8][9]
Chester-le-Street, including the Lumley, Pelton and Sacriston areas
Derwentside, including Consett and Stanley
City of Durham, including Durham city and the surrounding areas
Easington, including Seaham and the new town of Peterlee
Borough of Sedgefield, including Spennymoor and Newton Aycliffe
Teesdale, including Barnard Castle and the villages of Teesdale
Wear Valley, including Bishop Auckland, Crook, Willington, Hunwick, and the villages along Weardale
The Borough of Darlington: previous to 1 April 1997 the borough came under the control of Durham County Council.[10]
The Borough of Hartlepool: Until 1 April 1996 the borough was one of four districts in the County of Cleveland.[11][12]
The part of the Borough of Stockton-on-Tees that is north of the centre of the River Tees. Stockton was also part of Cleveland until 1996.[11] The remainder of the borough is part of the ceremonial county of North Yorkshire.[12]

steve · March 26, 2012 8:48AM

That's fine. If you're using merge/join to do a lookup file, your constraint is most likely to be that, for each *record* (not value) on the left, there is only one record on the right.

	Test Fuzzy Match.iok	108K
	CP LADs.iok	5K
	OE LADs.iok	5K

Welcome!

Categories

Ideas Parade

Tagged