what are data cleaning techniques?

What are data cleaning techniques?

I’m gonna go through and just very briefly run through the processes that we would typically go through, data hygiene and data cleaning really is a bore and a chore for most database managers we know that we understand that, but I think everyone knows it’s an absolute necessity and basically its a dirty job but we do it for you, they way that we would typically run through a data hygiene process would be first of all to format the data, we need to make sure that all the towns are in the town field, all the names are in the right name field, the postcodes in the postcode field and so on and so forth.

So the first stage we go through is data formatting and we typically do that by running what’s called pathing the post office address file which is the single database of the post office maintains of every single address in the UK. So the first thing we do is take the records and match them against the path file update and clean them so we now have a nice cleaned version of the data with all the fields in the right place the next step we then look at would be to basically run the deduplication exercise, this is where we make sure there is no duplicates in the rack in the database. So if you’ve got multiple versions of Mark Robinson and a database at the end of the processing that we would run you’d have a single version of Mark Robinson in the database. And we use a number of different techniques called fuzzy matching where we create and define different datasets around the dataset to ensure that they are matching. Things like phonetics, so that Mark is spelt with a K would always match with Mark spelt with a C, so if you’ve got two versions of Mark Robinson because someone spelt it wrongly our solution will pick that up. Having deduped the data, the next stage is then to look at the contacts so we would typically run through the contacts we may create, so making sure that the forenames are in the forenames and the surnames are in the surnames field and of course titles in the title. We’d always like to make sure that it’s always kind of useful to make sure that the salutation is correct for the name. So being sent something to Mrs Mark Robinson is not cool and I don’t want that, I really like my name titled changed to Mr Mark Robinson, so we’ve run that kind of processing to make sure salutation is correct.

The next part may be running suppression, now this is a really crucial part of any campaign hygiene if your mailing to consumers, if your audience is a consumer audience then you must be making sure that no one has died in your audience, so if you have your database you need to run it against the deceased files, the mortality files, make sure that those unfortunately deceased people are removed. We need to make sure that people have moved addresses, you’ve got the correct addresses if a company has changed the company name or changed details lilke that so again all correct so that would be the next phase of running data hygiene, following on from there we then look at potentially appending or enhancing the data so many clients want to add demographics on whether they be B2B database or a B2C database and by demographics I mean if its a B2B you may want to what SIC or what sector the business is in, number of employees, we get a feel for how big the organisation is and their turnover and their financial information, we can append all that information, which wil then help for segmentation and targeting and analysis, if its a consumer we may want to add things on like possibly council tax balance, we can get a feel for the wealth of an individual, may be wanting to add on things like gender if we don’t already have that may want to append things like age, so we can look at age band, now all of that information can be enhanced and appended onto the dataset through that data hygiene processing, that very briefly is a quick explanation of a very dull subject.