PHP Hidden Gem: similar_text()

Every once in a while you need to solve this unique problem. This problem that you usually only have to solve once or twice. So you start developing, but quickly wonder what the best solution is. You can write your own solution, but it pays off to at least search through PHP.net a bit to see if there is not already a solution for it in PHP.

Today, I was having such a situation. I am working on a hobby project where I aggregate feeds from several different sources. With the blogs I work it right now, it often happens that an author posts the same post to a few different sites. However, because of site formats and sometimes also quick edits an author makes on one site but not on the author, the article contents are usually not identical strings. So I needed something that would help me figure out whether or not two strings are nearly identical.

First I was Googling around a bit, trying to figure out what was a sane approach for this. I was thinking of simply piping it into diff and parsing the response, but that seemed highly inefficient. I also came by xdiff, a PECL extension which allows for some very cool functionality, but in the end I didn't really want to diff, I just wanted to figure out if strings were similar.

The fact that Google brought me to PHP.net though made me consider PHP might actually have a solution to my problem already. Then I remembered using soundex() before, which is sort of what I need, however I need it for more than just a single word. I decided to check the soundex() page anyway, to see if there were similar functions listed that would perhaps help me out. And indeed there were: levensthein and similar_text both seemed to do something approaching what I needed.

I first checked levensthein, however it allows checking strings of maximum 255 characters, too short when comparing full blog post bodies. So I went with similar_text, and indeed, that worked fine. My current isDuplicate() method is now:

static public function isDuplicate($item1, $item2)
{
  similar_text($item1->get('text' ), $item2->get('text' ), $perc);
  if ($perc > 75)
  {
    return true;
  }
}

I am still trying to figure out which percentage will catch the duplicates but not catch too many posts which are only similar but not actually duplicates, but with the above 75% I seem to catch quite a few duplicates so far.

So, next time you are in a similar situation, remember, there is an awesome function out there called similar_text(). And perhaps more importantly: Next time you need to implement some very specific functionality that you usually don't need, first try PHP.net (and Google) before writing your own solution.
Add comment

Comments

gravatar prago cereal & multi grain mill: an additional perk is free parking, one thing that you can never expect in a city. how to make a traditional puerto rican thanksgiving day dinner that's why i like it with grandma's pie crust, which is not sweet. >> so that means no more soggy cereal no matter how long it takes you to finish eating.
March 5, 2013
gravatar Josh Groban Hidden Away Album Art: for more info kindly visit: given the talent and commitment of these individuals and the rising interest of younger people in the genre, i think it's safe to say that hip hop violin is here to stay. if lindsey stirling's performance on america's got talent leads people to talented groups such as nuttin' but stringz or black violin, then this exciting new genre may get the respect it deserves. dealers are hard to beat for looking for your student violin. they have the knowledge and the understanding that it takes to get you the instrument that will serve you best at this stage of your learning. plus when you are purchasing from a dealer, they don't want their reputation to be tarnished, so they make the effort to have satisfied customers. with the online auctions you can be buying a pig in a poke.
March 9, 2013
gravatar headphone mic splitter: i think monty alexander is remarkable. it is just a kid's game, but as diversions go it is fun (even for adults). most tablets are in this mode by default so this should not be an issue. the magic continued as they transcended into "presto", with its upbeat outlook of positivity in regards to viewpoint. their exhibit was big enough but needed a bit more oomph to it.
May 17, 2013
gravatar digital recorder board: one of small voice recorders is keychain digital voice recorder. sony icd-ax412f - voice recorder - flash 2 gb. ⇒text to speech and more… the mq71 digital voice pen recorder features: 'one click' recording feature (slide pocket clip down to record -- slide clip up to turn off).
May 23, 2013
gravatar Fax Number Structure: in contrast internet fax service employs a methodology which enables you to send or receive a number of documents across the network. make sure the service provider will lend you 24x7 support extended by a live person who is technically competent and can help you solve problems quickly. after setting up the fax console and completed the configuration setup,you're computer is now ready to send and receive faxes just as any fax machine would.the console pops open automatically and looks similar to the outlook express console. if you send your fax overseas, you need to insert the international code.
May 29, 2013

Php5_zce_logo

Tags

1337 2008 2010 2011 4developers access modifiers accessibility AdaLovelaceDay09 advent agavi agile alfred amsterdam amsterdamphp apache api apple article articles atk atkMetaNode audioscrobbler autoloading automation azure backwards compatibility barcelona barcodes bash bbc bbq beatstad belgium best practices bittorrent blogging blogs boards of canada book books bughuntday bundle caching cake cal evans calendar career cat cerf certificate cfp cilex clear cms cologne common sense communities community components composer conference conferences contest continuous integration contribute contribution crisis css curl custom d-day data migration datetime DbFinderPlugin decorator decorators deployment deps devdays development directoryindex directoryiterator docblox doctrine doctrine2 documentation download dpc dpc09 dpc10 dpc11 DPC2008 dreamhost drupal dv7 eclipse ed editors efficiency enterprise errors event events expertise ezcomponents facebook filter-branch filteriterator finland flickr fork framework frameworks free ticket freelance freeze frontend fun game games geoip germany getting real git github globiterator gnome-do google google calendar googletalk graceful degradation hack hackers hidden gem hiphop howto hp HR html http i386 ibuildings icann ide ideasofmarch idm imovie inclusivity indy ingewikkeld integration international php conference internet interview ipad IPC ipc ipc08 ipc10 ipc11se iterators iterm2 javascript jenkins jenkins-php job job openings jobeet john peel joomla joomladays kiva kubuntu launcher launchy left on the web libcurl libraries library lighttpd lime linktuesday linux live london loudblog m2ts mac magazines malware mambo manchester marjolein mediterra meeting meme meta methodology micro-financing microframework microsoft migration movie music mysql namespace namespaces netbeans netherlands newsfire nllgg northeastphp nos odmarco open source opinion ORM osx paradiso paris partnership pavilion pear pecl performance personal pfc10 pfc11 pfcongres pfcongrez pfz pfz.nl photo php PHP php5.3 phpabstract phpazure phpBB phpbb phpbelgium phpbenelux phpbnl10 phpday phpdoc phpdocumentor phpgg phpitalia phpnw phpnw08 phpnw11 phpnw12 phpstorm phptek phptek09 phpuk2009 phpUnderControl phpunit php|architect php|tek podcast politics portability postcrossing presentation presentations private projects protected prototype PSR-0 public python qa qr codes re2c recruiting refactoring review rewrite ruby on rails san francisco schedule scifi script security sensio seven things sexism sfdaycgn sflive2011 shell scripting silex simplexml slides smfony software sogeti solar sound speakers spl ssh standard standards star trek static steer strings stylesheets subversion symfony symfony live symfony2 Symfony2 symfonycamp symfonyday symfonylive symfonyUnderControlPlugin talk talks tech techademy technology techportal tek09 telecommuting terratec terrorism testfest testing textmate textpattern the right tool timeout tips tld todo tomas tools training twig uncon unet usability usergroup validation vhost video vim vinyl virus warp webinar weblogging webservices wiki windows winphp women wordpress work workshop world world of warcraft wpi writing wunderlist xml xpath xsd yara year youtube zc11 ZCE zemanta zend zend framework zend server zend studio zendcon Zend_Form zite
© 2004 - 2013 Stefan Koopmanschap + Powered by Symfony, photos powered by Flickr, links powered by Delicious, Shanghai smilies by Iconbuffet. Feeds: rss / atom. Left on the Web v4.4.0.1