PHP Hidden Gem: similar_text()

Every once in a while you need to solve this unique problem. This problem that you usually only have to solve once or twice. So you start developing, but quickly wonder what the best solution is. You can write your own solution, but it pays off to at least search through PHP.net a bit to see if there is not already a solution for it in PHP.

Today, I was having such a situation. I am working on a hobby project where I aggregate feeds from several different sources. With the blogs I work it right now, it often happens that an author posts the same post to a few different sites. However, because of site formats and sometimes also quick edits an author makes on one site but not on the author, the article contents are usually not identical strings. So I needed something that would help me figure out whether or not two strings are nearly identical.

First I was Googling around a bit, trying to figure out what was a sane approach for this. I was thinking of simply piping it into diff and parsing the response, but that seemed highly inefficient. I also came by xdiff, a PECL extension which allows for some very cool functionality, but in the end I didn't really want to diff, I just wanted to figure out if strings were similar.

The fact that Google brought me to PHP.net though made me consider PHP might actually have a solution to my problem already. Then I remembered using soundex() before, which is sort of what I need, however I need it for more than just a single word. I decided to check the soundex() page anyway, to see if there were similar functions listed that would perhaps help me out. And indeed there were: levensthein and similar_text both seemed to do something approaching what I needed.

I first checked levensthein, however it allows checking strings of maximum 255 characters, too short when comparing full blog post bodies. So I went with similar_text, and indeed, that worked fine. My current isDuplicate() method is now:

static public function isDuplicate($item1, $item2)
{
  similar_text($item1->get('text' ), $item2->get('text' ), $perc);
  if ($perc > 75)
  {
    return true;
  }
}

I am still trying to figure out which percentage will catch the duplicates but not catch too many posts which are only similar but not actually duplicates, but with the above 75% I seem to catch quite a few duplicates so far.

So, next time you are in a similar situation, remember, there is an awesome function out there called similar_text(). And perhaps more importantly: Next time you need to implement some very specific functionality that you usually don't need, first try PHP.net (and Google) before writing your own solution.
Add comment

Comments

gravatar Bertrand: I have used levensthein() in the past, and following the exact same search schema. I hope there are a lot of remaining exotic functions like this in PHP ;)
July 10, 2009
gravatar Duane Gran: I wonder how this will behave on large data sets? The solution is a gem for sure, but you would need to perform n^2 calls to isDuplicate (unless you marked item2 as dirty in each test) to determine the answer.

You might be able to prune the comparison by seeing if the two strings are reasonably close in length, but it could still result in a lot of comparison calls.

Another approach to this might be with Lucene's "more like this" functionality. It is highly optimizes for text retrieval, but there is non-trivial indexing time.

I hope that above thoughts are helpful. I'm glad I know about this function now.
July 10, 2009
gravatar Jeremy Ashcraft: I use similar_text() extensively in an application used for mapping older text to updated/revised text. it works wonderfully. My threshold is only at 50%, but we only examine a few sentences at most. I remember being so happy when I discovered this function. It literally saved me a week of development time! :D
July 13, 2009

Php5_zce_logo

not tested in IE


Upcoming events

I will be speaking 08-10-2010: Symfony Day Cologne 2010
I will be speaking 09-10-2010: Symfony workshop

Tags

1337 2008 2010 4developers access modifiers accessibility AdaLovelaceDay09 advent agavi agile amsterdam apache apple article articles atk atkMetaNode audioscrobbler azure backwards compatibility barcelona bbc bbq beatstad belgium best practices bittorrent boards of canada book books bughuntday caching cake cal evans career cat cerf certificate cfp clear cms cologne common sense communities community conference conferences continuous integration contribute crisis css custom datetime DbFinderPlugin decorator decorators deployment devdays development directoryindex documentation download dpc dpc09 dpc10 DPC2008 dreamhost dv7 eclipse ed efficiency enterprise errors event events expertise ezcomponents facebook flickr framework frameworks freelance freeze frontend fun games germany getting real google googletalk graceful degradation hack hackers hidden gem hiphop howto hp html http ibuildings icann ide idm imovie indy ingewikkeld internet IPC ipc ipc08 javascript job jobeet john peel joomla kubuntu left on the web lighttpd lime linux live london loudblog m2ts mac malware mambo marjolein mediterra meeting meme meta methodology microsoft movie music mysql namespace namespaces netbeans netherlands nllgg odmarco open source opinion ORM osx paradiso pavilion pear performance personal pfc10 pfcongres pfcongrez photo php phpabstract phpazure phpBB phpbb phpbelgium phpbenelux phpbnl10 phpgg phpitalia phpnw phpnw08 phptek phptek09 phpuk2009 phpUnderControl phpunit php|architect php|tek podcast politics portability postcrossing presentation presentations private projects protected public qa recruiting refactoring review rewrite ruby on rails schedule scifi script security seven things sfdaycgn simplexml slides smfony software sogeti solar sound standard standards star trek static steer strings subversion symfony Symfony2 symfonycamp symfonyday symfonyUnderControlPlugin talk talks technology techportal tek09 telecommuting terratec terrorism testfest testing textpattern tips tld tomas training twig uncon unet usability usergroup validation vhost video vinyl virus warp weblogging wiki windows winphp women work workshop world world of warcraft wpi writing xml xpath xsd yara year youtube ZCE zemanta zend zend framework zend server zend studio Zend_Form
© 2004 - 2010 Stefan Koopmanschap + Powered by Symfony, photos powered by Flickr, links powered by Delicious, Shanghai smilies by Iconbuffet. Feeds: rss / atom. Left on the Web v4.4.0.1