PHP Hidden Gem: similar_text()

Every once in a while you need to solve this unique problem. This problem that you usually only have to solve once or twice. So you start developing, but quickly wonder what the best solution is. You can write your own solution, but it pays off to at least search through PHP.net a bit to see if there is not already a solution for it in PHP.

Today, I was having such a situation. I am working on a hobby project where I aggregate feeds from several different sources. With the blogs I work it right now, it often happens that an author posts the same post to a few different sites. However, because of site formats and sometimes also quick edits an author makes on one site but not on the author, the article contents are usually not identical strings. So I needed something that would help me figure out whether or not two strings are nearly identical.

First I was Googling around a bit, trying to figure out what was a sane approach for this. I was thinking of simply piping it into diff and parsing the response, but that seemed highly inefficient. I also came by xdiff, a PECL extension which allows for some very cool functionality, but in the end I didn't really want to diff, I just wanted to figure out if strings were similar.

The fact that Google brought me to PHP.net though made me consider PHP might actually have a solution to my problem already. Then I remembered using soundex() before, which is sort of what I need, however I need it for more than just a single word. I decided to check the soundex() page anyway, to see if there were similar functions listed that would perhaps help me out. And indeed there were: levensthein and similar_text both seemed to do something approaching what I needed.

I first checked levensthein, however it allows checking strings of maximum 255 characters, too short when comparing full blog post bodies. So I went with similar_text, and indeed, that worked fine. My current isDuplicate() method is now:

static public function isDuplicate($item1, $item2)
{
  similar_text($item1->get('text' ), $item2->get('text' ), $perc);
  if ($perc > 75)
  {
    return true;
  }
}

I am still trying to figure out which percentage will catch the duplicates but not catch too many posts which are only similar but not actually duplicates, but with the above 75% I seem to catch quite a few duplicates so far.

So, next time you are in a similar situation, remember, there is an awesome function out there called similar_text(). And perhaps more importantly: Next time you need to implement some very specific functionality that you usually don't need, first try PHP.net (and Google) before writing your own solution.
Add comment

Comments

gravatar Bertrand: I have used levensthein() in the past, and following the exact same search schema. I hope there are a lot of remaining exotic functions like this in PHP ;)
July 10, 2009
gravatar Duane Gran: I wonder how this will behave on large data sets? The solution is a gem for sure, but you would need to perform n^2 calls to isDuplicate (unless you marked item2 as dirty in each test) to determine the answer.

You might be able to prune the comparison by seeing if the two strings are reasonably close in length, but it could still result in a lot of comparison calls.

Another approach to this might be with Lucene's "more like this" functionality. It is highly optimizes for text retrieval, but there is non-trivial indexing time.

I hope that above thoughts are helpful. I'm glad I know about this function now.
July 10, 2009
gravatar Jeremy Ashcraft: I use similar_text() extensively in an application used for mapping older text to updated/revised text. it works wonderfully. My threshold is only at 50%, but we only examine a few sentences at most. I remember being so happy when I discovered this function. It literally saved me a week of development time! :D
July 13, 2009

Php5_zce_logo

not tested in IE


Upcoming events

I will be speaking 16-02-2010: Symfony Live
I will be attending 17-02-2010: Symfony Live
I will be speaking 26-02-2010: PHPUK Conference 2010
I will be speaking 10-03-2010: ConFoo 2010
I will be speaking 11-03-2010: ConFoo 2010
I will be attending 12-03-2010: ConFoo 2010

Tags

1337 2008 2010 4developers accessibility AdaLovelaceDay09 advent agavi agile amsterdam apache apple article articles atk atkMetaNode audioscrobbler backwards compatibility barcelona bbc bbq beatstad belgium best practices bittorrent book books bughuntday caching cake cal evans cat cerf certificate cfp clear cms cologne common sense community conference conferences continuous integration crisis css custom datetime DbFinderPlugin decorator decorators deployment devdays development directoryindex documentation download dpc dpc09 DPC2008 dreamhost dv7 eclipse ed efficiency enterprise event events expertise ezcomponents facebook flickr frameworks freeze frontend fun games germany getting real google googletalk graceful degradation hack hackers hidden gem hiphop howto hp html http ibuildings icann ide imovie indy internet IPC ipc ipc08 javascript jobeet john peel joomla kubuntu left on the web lighttpd lime linux live london loudblog m2ts mac malware mambo marjolein meeting meme meta methodology microsoft movie music mysql namespace namespaces netbeans netherlands nllgg odmarco open source opinion ORM osx paradiso pavilion pear performance personal pfcongrez photo php phpabstract phpBB phpbb phpbelgium phpbenelux phpbnl10 phpgg phpitalia phpnw phpnw08 phptek phptek09 phpuk2009 phpUnderControl phpunit php|architect php|tek podcast politics portability postcrossing presentation presentations public qa recruiting refactoring review rewrite ruby on rails schedule script security seven things sfdaycgn simplexml slides smfony software sogeti solar sound standard standards static steer strings subversion symfony symfonycamp symfonyday symfonyUnderControlPlugin talk talks technology techportal tek09 telecommuting terratec terrorism testfest testing textpattern tips tld tomas unet usability usergroup validation vhost video vinyl virus warp weblogging wiki women work world world of warcraft writing xml xpath xsd yara year youtube ZCE zemanta zend zend framework zend server zend studio Zend_Form
© 2004 - 2010 Stefan Koopmanschap + Powered by Symfony, photos powered by Flickr, links powered by Delicious, Shanghai smilies by Iconbuffet. Feeds: rss / atom. Left on the Web v4.4.0.1