Record linkage (matching) algorithm

From Rodovid Engine

Jump to: navigation, search

Would you like to combine our efforts? It's very easy for me create links from names, surnames and places to special pages like from country or from wiki of person or family. --Baya 18:24, 4 March 2006 (EET)

I'm interested. What did you have in mind? I could direct people who were interested in a person-wiki to your website, and/or I could help you develop a "record linkage" (matching) algorithm to determine when two different person-pages were likely to be about the same person. I'm open to other suggestions as well. I was the CTO for the Family History Department of the LDS Church until a year ago when I left to start the "Foundation for On-Line Genealogy", a non-profit organization whose mission is to help people discover and share genealogical information on-line. At some point I'd like to give people the opportunity to upload gedcom files to the WeRelate website, and I'd like to tell them when people in their pedigree are probable matches to people in others' pedigrees, but I'm not planning to create wiki pages for the uploaded people. I'm thinking instead about an Ajax-based webpage for viewing the gedcom files directly. So I don't think we'd be stepping on each others' toes by partnering.--Dallan 08:38, 6 March 2006 (EET)
  • Most data devided into diferent fields.... Page is only convinient view.... SONDEX is a good feature, but does not work with a non latin caracters... You known how compare data about same person but in different languages? --Baya 15:44, 6 March 2006 (EET)
  • There are four things you need to worry about when trying to match individuals in different pedigrees:
  1. How similar are the names? To compare two names, you can base it upon of "edits" (differences) between the two names divided by the length of the longest name. So John and Jon would get a score of 1/4, since there is one difference (deletion of the h), and the longest name is 4 characters. Katherine and Catherine would get a score of 1/9, since the only difference is the replacement of a K with a C. The next step is to "weight" the edits, so that a K<->C substitution "costs" less than a K<->M substitution, since at least in English, K and C often sound alike. The substitution costs could be made language-specific. And it is possible to "learn" a set of substitution costs from a corpus of hand-labeled similar names. I've done that. You need to augment this automated approach with a hand-created database of nicknames. For example, Jon and Jonathan have a score of 5/8, which isn't very good. But they're nicknames, so their score should be set to be close to 0. WeRelate has a similar-name wiki based upon a combination of learned costs and some manual data-entry. It's not great yet, but I'm hoping it will be improved over time. I don't have much in the way of non-latin names right now unfortunately.
  2. How similar are the places? Here you're trying to figure out how need a good database of current and historical places, including alternate names that places have been known by over time, and the various counties, provinces, and countries that places have belonged to over time. So that when someone says person X was born in Bavaria, and someone else says Person Y was born in Germany, you would know that it might be two different names for the same place. WeRelate also has a wiki (the largest one I'm aware of) of current and historical places to help with this. Again, I'm hoping it gets improved over time.
  3. How similar are the dates? This one's easy. The only trick is that you can't use number-of-days difference between two dates as a measure. If I say person X was born on Jan 13, 1850, and you say person Y was born on Jan 21, 1850, chances are they are not the same. But if I say person X was born on 1850 and you say person Y was born on 1851, there's a much higher chance they're the same.
  4. How do you weight the similarity in personal names and event places and dates for the pair of individuals in question as well as their immediate relatives (parents, spouses, children, siblings)? You can do this using machine-learning techniques. You just need someone to sit down and tell you whether a bunch of potential merges are really merges or not. You can continue to gather decisions based upon whether or not people agree with your suggestions of who to merge, and improve your algorithm over time.--Dallan 08:36, 7 March 2006 (EET)

[edit] very important and new notes for me

Idea is good. How it can be implemented? If all data is stored into mysql tables? And table at this time 2000 records + into engine 2000 yet. (We have only 10 users :) ) ?

Сan we continue discussion on engine?

I think it would be best to index information for each person and their immediate relatives in a full-text database like Lucene. Which page on engine would you like to continue the discussion on?--Dallan 07:24, 8 March 2006 (EET)

I must read something info about Lucene. I look for it and will answer over couple of days --Baya 12:02, 8 March 2006 (EET)

Personal tools
Джерельна довідка за населеним пунктом