Talk:Similar search

From Rodovid Engine

Jump to: navigation, search

Contents

[edit] Discussion was started at User talk:Rob Hooft

[edit] 4

Comparison should be a little less sensitive. "Cornelis Hooft" born 1846 is definitely not the same person as "Cornelis Hooft" born 1919. Birth years should be compared along with the name. This could also help to distinguish siblings with the same name. I now sometimes get more than 10 "Similar records exist" for a new import....

Comparison.... I add year checking (but I think it also be optional? ) birth date must be not less then gedcom-5 years and not more then gedcom+5years. Dallan propose other ideas about similar search - but it must be coded.
This is much better. I can think of other ways that may restrict false hits even more:
def areSimilar(this, other):
    if this.name is not similar to other.name:
        return False
    if this.deathdate and other.birthdate:
        if this.deathdate is earlier than other.birthdate:
            return False
    if this.birthdate and other.deathdate:
        if this.birthdate is later than other.deathdate:
            return False
    if this.birthdate and other.birthdate:
        if abs (this.birthdate - other.birthdate) > 5 years:
            return False
    return True

I'm afraid use dates as good cause to add or remove person from similar list. BTW. Maybe score system will be more usefull Dallan comments? But I never work with scoring. How we can decide wich similarity has more score compare to other? Current algoritm

def posibleSimilarFor(this):
   foreach( this.spouses as other ) possibleHW[] += other[].spouses
   foreach( this.ind_refs as refnum ) posibleREFN[] += all with similar refnum
   foreach( allpersons as person ) {
	if abs (this.birthdate - other.birthdate) > 5: next
	nameSimilar = false;
	foreach( this.name as name )
                   if person.name is similar name: nameSimilar = true; break;
	if not nameSimilar: next;
	surnameSimilar = false;
	foreach( this.surname as surname ) 
                   if person.surname is similar surname: surnameSimilar = true; break;
	if not surnameSimilar: next;
	possibleSA[] += person;
    }
    return merge(possibleHW, posibleREFN, possibleSA)
You are using an "OR merge" of three lists. But if the name is not similar, how can the spouse make it the same person anyway? People do get married more than once!

Automatic processing checks only father and mother. So this does not infulence. During manual import it is just a bit additional info.

  • Generally, the main goal of search for similar to protect adding persons twicly and more...
  • What is criteria for ERROR "similar exists", now ?
    Do you use FULL NAME + DATE BIRTH(when exists) + Father + Mother to find similar record ?
    Otherwise, all or very much records will be similar... :-(
    Today I continue testing gedcom parser (http://engine.rodovid.org/wk?title=User_talk:Morais69br&action=edit&section=9) and just some records DON'T show this error... but in select list don't exist this same FULL NAME. Example, file Moraes.ged: Morais 05:03, 17 March 2006 (EET)
@I204@ INDI
1 NAME Ursula Maria De /JEZUS/
1 SEX F
1 FAMS @F202@

Status: clan from surname
Errors: unknown birth date, similar exists

(SA) ♀ Roza Maria De Jezus (* ?)
 (SA) ♀ Maria De Jezus (* ?)
 (SA) ♀ Felisberta Izabel De Jezus (* ?)
 (SA) ♀ Maria Custodia De Jezus (* ?)
 (SA) ♀ Roza Maria De Jezus (* ?)
 (SA) ♀ Delfina Candida De Jezus (* ?)
 (SA) ♀ Maria Antonia De Jezus (* ?)
 (SA) ♀ Maria Antonia De Jezus (* ?)
 (SA) ♀ Joaquina Maria De Jezus (* ?)
 (SA) ♀ Anna Maria De Jezus (* ?)
 (SA) ♀ Joana Maria De Jezus (* ?)
 (SA) ♀ Benta Maria De Jezus (* ?)
 (SA) ♀ Rita Maria De Jezus (* ?)
 (SA) ♀ Thereza Maria De Jezus (* ?)
 (SA) ♀ Joaquina Maria De Jezus (* ?)
 (SA) ♀ Maria Maximina De Jezus (* ?)
 (SA) ♀ Maria De Jezus (* ?)
Clã
  • this is due to same surname prefixies. Now surname prefixes not checkes --Baya 14:54, 17 March 2006 (EET)

[edit] 13

There may be bugs in the "similar name" detection. I just got a "similar exists" on input of "1 NAME Marijtje /de Let/" with the two records " Rachel van Dalen (* ?)" and " Cornelia Marshal (* ?)" listed as similar.

  • similar detection provided by: name and surname comparison (SA mark), existing spouses (HW), refnums (REFN)
Ah, I understand now. This gives very many false positives. I think if the names of the person itself are not similar, then the name of the spouse may be identical but it is not the same person!

[edit] Current similar search - SA

Why use only ONE of surname and ONE of name for search similar ? Each 2 or 3 records found a similar person and STOP importing, but not same people because FULL NAME is other ! :-( Morais 04:29, 18 March 2006 (EET)

  • FNE added. --Baya 15:27, 20 March 2006 (EET)

[edit] Fortran code to calculate similarity between words

Biologists compare protein sequences using special matrix-based comparison algorithms. It is possible to use these kind of algorithms to compare words (or names) to prevent a simple typo from failing the match. Here is a fortran program that does this. This algorithm is O(m*n) in the lengths of the two strings, but it guaranteed to find the best match. For comparison of "english" words, I normally extend the basic algorithm to use a matching matrix. In the algorithm here, replacing "a" with "s" is as bad as replacing "a" with "l", but it is possible to augment the algorithm to make "a"/"s" replacements less bad because they are adjacent on a QUERTY keyboard. Biologists do the same: e.g. the amino acids "D" and "E" are more similar than "A" and "N". Rob Hooft 12:47, 18 March 2006 (EET)

      real      function align(word1,word2)
      implicit  none
      character word1*(*)
      character word2*(*)
c---- constants
      integer   maxchar
      parameter (maxchar=25)
      integer   maxcharplus1
      parameter (maxcharplus1=maxchar+1)
      real      gapopenpenalty
      parameter (gapopenpenalty=-0.3)
      real      gapelongationpenalty
      parameter (gapelongationpenalty=-0.2)
c---- local vars
      real      lh(maxcharplus1,maxcharplus1)
      logical   gapped(maxcharplus1,maxcharplus1)
      integer   ii
      integer   jj
      real      ld
      real      gpv
      real      gph
      real      s
      real      sv
      real      sh
      real      sd
      integer   l1
      integer   l2
      integer   oc1
      integer   oc2

      l1 = len(word1)
      l2 = len(word2)
      if(word1.eq.word2) then
        align=l1+1.0
        return
      endif
      if(l1.gt.maxchar.or.l2.gt.maxchar) then
        align=0.0
        return
      endif

      lh(1,1) = 0.0
      gapped(1,1) = .false.

      lh(2,1) = gapopenpenalty
      gapped(2,1) = .true.

      do ii=3,l1+1
        lh(ii,1) = lh(ii-1,1) + gapelongationpenalty
        gapped(ii,1) = .true.
      enddo

      lh(1,2) = gapopenpenalty
      gapped(1,2) = .true.

      do jj=3,l2+1
        lh(1,jj) = lh(1,jj-1) + gapelongationpenalty
        gapped(1,jj) = .true.
      enddo
c
c---- align
c
      do jj=1,l2
        do ii=1,l1
          oc1=ichar(word1(ii:ii))
          oc2=ichar(word2(jj:jj))
          if (oc1.eq.oc2) then
            ld=1.0
          else
            ld=0.0
          endif
          if (gapped(ii+1,jj)) then
            gph=gapelongationpenalty
          else
            gph=gapopenpenalty
          endif
          if (gapped(ii,jj+1)) then
            gpv=gapelongationpenalty
          else
            gpv=gapopenpenalty
          endif
          s=lh(ii,jj)+ld
          sh=lh(ii+1,jj)+gph
          sv=lh(ii,jj+1)+gpv
          sd = max(sh,sv)
          if (s.ge.sd) then
            lh(ii+1,jj+1) = s
            gapped(ii+1,jj+1) = .false.
          else
            lh(ii+1,jj+1) = sd
            gapped(ii+1,jj+1) = .true.
          endif
        enddo
      enddo
c---- maximum score is in the bottom right corner
      align=lh(l1+1,l2+1)
      return
      end                       ! align

http://engine.rodovid.org/compare.php and just converted php code

function compare( $word1, $word2 ) {
    #c---- constants
    $maxchar = 25;
    $maxcharplus1 = $maxchar + 1;
    $gapopenpenalty = -0.3;
    $gapelongationpenalty = -0.2;
    #c---- local vars

    $l1 = strlen( $word1 );
    $l2 = strlen( $word2 );

    if ( $word1 == $word2 ) return $l1 + 1;
    if ( $l1 > $maxchar || $l2 > $maxchar ) return 0;
    $lh[1][1] = 0;
    $gapped[1][1] = false;

    $lh[2][1] = $gapopenpenalty;
    $gapped[2][1] = true;

    for ( $ii = 3; $ii <= $l1+1; $ii++ ) { // do ii=3,l1+1
        $lh[$ii][1] = $lh[$ii-1][1] + $gapelongationpenalty;
        $gapped[$ii][1] = true;
    }

    $lh[1][2] = $gapopenpenalty;
    $gapped[1][2] = $true;

    for ( $jj = 3; $jj <= $l2+1; $jj++ ) { //  do jj=3,l2+1
        $lh[1][$jj] = $lh[1][$jj-1] + $gapelongationpenalty;
        $gapped[1][$jj] = true;
    }
    #c---- align
    for( $jj = 1; $jj <= $l2; $jj++ ) {
        for( $ii = 1; $ii <= $l1; $ii++ ) {
          $oc1 = $word1[$ii];
          $oc2 = $word2[$jj];
          if ( $oc1 == $oc2 ) $ld=1.0; else $ld=0.0;
          if ($gapped[$ii+1][$jj]) $gph=$gapelongationpenalty; else $gph=$gapopenpenalty;
          if ($gapped[$ii][$jj+1]) $gpv=$gapelongationpenalty; else $gpv=$gapopenpenalty;
          $s=$lh[$ii][$jj]+$ld;
          $sh=$lh[$ii+1][$jj]+$gph;
          $sv=$lh[$ii][$jj+1]+$gpv;
          $sd = max($sh,$sv);
          if ($s >= $sd) {
            $lh[$ii+1][$jj+1] = $s;
            $gapped[$ii+1][$jj+1] = false;
          }
          else {
            $lh[$ii+1][$jj+1] = $sd;
            $gapped[$ii+1][$jj+1] = true;
          }
        }
    }
    #c---- maximum score is in the bottom right corner
      return $lh[$l1+1][$l2+1];
}

Also we can add array of symbol similarity etc...

But. This is a big difference - compare to words and find similar from other tens thousands (or more) words. It is must be useable over sql (with indexies).... (((((( Other question - what about other languages? --Baya 20:12, 20 March 2006 (EET)

Indeed. One way to do this is to look for identical words using the index (like you are doing now), and compare whether it is a true hit using this algorithm. It could be used to reduce false hits. Rob Hooft 23:38, 20 March 2006 (EET)

I can add comparing to sort (SA) similars over calculated compare index. --Baya 12:46, 21 March 2006 (EET)

[edit] Picking up brothers and sisters

The Similar search is a good idea, but it picks up brothers and siters too often. Could you do something to prevent this? For example require the name to be similar.--Bjwebb 22:20, 19 March 2006 (EET)

  • Some times brothers or sisters have similar names (((. I understand unconviniences of many falses on similar search - but, we need a usable information but not a depo of thousand similar records. Who can garantie that your far cousin will import information about you and will be know only one of your two names? After her import your father will have two James (1 - Benjamin James and 2 - James ). --Baya 15:39, 20 March 2006 (EET)
    • The thing is that the brothers/sisters it picks up, haven't really got similar names. For example, importing Ada Baddely picks up her borthers/sisters:
      • Fred Baddeley
      • Florry Baddeley
      • Harry Baddeley
      • Frank Baddeley
    • Whose names are not very similar. The way the importer is at the moment I am getting a message for virtually every message I import. Could you make it so that both the surname AND one of the other names (could be second name, so Benjamin James and James would be a match) are needed before it returns a similar result.--Bjwebb 17:09, 20 March 2006 (EET)
  • What about Fred and Fréd ? --Baya 17:32, 20 March 2006 (EET)
    • Could you code it so it will check how similar two names are. If they are only one or two characters different, return a result, if they are very different, like Frank and Ada, don't.--Bjwebb 17:36, 20 March 2006 (EET)
  • We now discuss way for it. --Baya 17:52, 20 March 2006 (EET)

[edit] False positives vs. False negatives

Wikipedia suffers from double entries as well. This is solved there in a different way: rather than making absolutely sure that nobody makes a duplicate, people are encouraged to merge articles. If rodovid would have a "merge persons" entry, that might make it less severe to have duplicates, and make the comparison less strict. Rob Hooft 23:38, 20 March 2006 (EET)

  • I think about it and, of course, I will do merge engine for sysops. Problems:
  1. if many persons added by script - it will be a big problem to merge two big trees manually.
  2. Into wikipedia they work with concrete interpretation of some fact, process, action etc. Genealogical data of person is not so simple for interpretetion by others, especially if data is very partial. --Baya 12:37, 21 March 2006 (EET)

I agree. On the other hand: the genealogical data is very well structured! It is a nice challenge! Rob Hooft 23:05, 21 March 2006 (EET)

  • I'm optimist :) but people makes so many undeliberate mistakes. Today in uk localisation I delete more then 15 records, due to user adds ansectors without relationships and then he found relation posibility and instead to add relations he add new persons.... Clan list helps me, but when clan will contain hundreds persons? I try prevent foolish actions, but today example put me in unique condition...
At this time I see only one way - continuous checking of all new records and mark similar for user/sysop checking... But it is take so many user/sysop time....
 :) I can't go to sleep - waithing for results of gedcom import enabling. --Baya 23:33, 21 March 2006 (EET)

[edit] Too many false negatives

The problem at the moment is that GEDCOM import is picking up brothers/sisters who have just been imported. It makes GEDCOM import very tedious. I think the best way to solve this is to not include people who you have just added in the same import in similar search. Could this be done?--Bjwebb 19:54, 2 April 2006 (EEST)

Using the database as it was before import (like I said in my previous comment) might not be possible/practical. I've just thought of another idea. If matching people are found in the Rodovid database via MoCh and FaCh could you check to see if these people could be the geditems brothers/sisters. That is, do not return people via MoCh and FaCh who have the same name as a geditems brothers or sisters. Could you possibly impolement this. It would greatly decrease the number of false positives.--Bjwebb 20:16, 2 April 2006 (EEST)

Personal tools