PHP Hidden Gem: similar_text()

Every once in a while you need to solve this unique problem. This problem that you usually only have to solve once or twice. So you start developing, but quickly wonder what the best solution is. You can write your own solution, but it pays off to at least search through PHP.net a bit to see if there is not already a solution for it in PHP.

Today, I was having such a situation. I am working on a hobby project where I aggregate feeds from several different sources. With the blogs I work it right now, it often happens that an author posts the same post to a few different sites. However, because of site formats and sometimes also quick edits an author makes on one site but not on the author, the article contents are usually not identical strings. So I needed something that would help me figure out whether or not two strings are nearly identical.

First I was Googling around a bit, trying to figure out what was a sane approach for this. I was thinking of simply piping it into diff and parsing the response, but that seemed highly inefficient. I also came by xdiff, a PECL extension which allows for some very cool functionality, but in the end I didn't really want to diff, I just wanted to figure out if strings were similar.

The fact that Google brought me to PHP.net though made me consider PHP might actually have a solution to my problem already. Then I remembered using soundex() before, which is sort of what I need, however I need it for more than just a single word. I decided to check the soundex() page anyway, to see if there were similar functions listed that would perhaps help me out. And indeed there were: levensthein and similar_text both seemed to do something approaching what I needed.

I first checked levensthein, however it allows checking strings of maximum 255 characters, too short when comparing full blog post bodies. So I went with similar_text, and indeed, that worked fine. My current isDuplicate() method is now:

static public function isDuplicate($item1, $item2)
{
  similar_text($item1->get('text' ), $item2->get('text' ), $perc);
  if ($perc > 75)
  {
    return true;
  }
}

I am still trying to figure out which percentage will catch the duplicates but not catch too many posts which are only similar but not actually duplicates, but with the above 75% I seem to catch quite a few duplicates so far.

So, next time you are in a similar situation, remember, there is an awesome function out there called similar_text(). And perhaps more importantly: Next time you need to implement some very specific functionality that you usually don't need, first try PHP.net (and Google) before writing your own solution.
Add comment

Comments

gravatar Bertrand: I have used levensthein() in the past, and following the exact same search schema. I hope there are a lot of remaining exotic functions like this in PHP ;)
July 10, 2009
gravatar Duane Gran: I wonder how this will behave on large data sets? The solution is a gem for sure, but you would need to perform n^2 calls to isDuplicate (unless you marked item2 as dirty in each test) to determine the answer.

You might be able to prune the comparison by seeing if the two strings are reasonably close in length, but it could still result in a lot of comparison calls.

Another approach to this might be with Lucene's "more like this" functionality. It is highly optimizes for text retrieval, but there is non-trivial indexing time.

I hope that above thoughts are helpful. I'm glad I know about this function now.
July 10, 2009
gravatar Jeremy Ashcraft: I use similar_text() extensively in an application used for mapping older text to updated/revised text. it works wonderfully. My threshold is only at 50%, but we only examine a few sentences at most. I remember being so happy when I discovered this function. It literally saved me a week of development time! :D
July 13, 2009
gravatar amazon competitors: Its very difficult to get the 100% accuracy with these search function in PHP.Since PHP is an open source programming so we can get many such functions over google.I was using levensthein() function for comparing text but similar_text() looks more promising,I will defintily try this out and share the result here.Thanks a lot for sharing this function.
May 12, 2011
gravatar Craigslist New Jersey apts: It is very important for us to learn PHP because it creates opportunities for us in the field of IT. Thanks for providing info regarding this topic.
June 16, 2011
gravatar Craigslist Seattle: your performance is outstanding and you better understand the function of PHP.you solve this problem easily.i like to read informative blogs and this blog is also so good and helpful.thanks for taking time to discus this topic..
October 15, 2011
gravatar aircon repair : I am still very new to PHP programming. But somehow I was able to understand what the blog is trying to explain.
October 24, 2011
gravatar Craigslist San Jose: i think this issue is important.you explain it very well.i like to read informative blogs and this blog is also so good and helpful.thanks for taking time to discus this topic..
October 31, 2011
gravatar orlando accident lawyer: This works so good with php. I have been using it for a while and there is nothing better. Keep up the good work.
November 3, 2011
gravatar Bandages: I really like to appreciate you because your information is good for my work and I think more people need to read blogs like this.

December 15, 2011
gravatar mtvsxdp: Its very difficult to get the 100% accuracy with these search function in PHP.Since PHP is an open source programming so we can get many such functions over google.I was using levensthein() function for comparing text but similar_text() looks more promising,I will defintily try this out and share the result here.Thanks a lot for sharing this function.

December 23, 2011
gravatar Discount UGG Boots: cheap uggs australia She must, in addition to her there who find me? zs11fda

ugg kenly black Most unpromising people, nothing of the people, too lazy to famous people, strange in a strange society, apart from human beings, of course, is that they are mighty champions.

hot pink ugg boots I do not know. She said, "That I should not have their own outlet to four percent of the things to tell you."

www.uggsaustralia.me.uk You're like a girl. She said softly.
December 31, 2011
gravatar ARY Musik: The following article actually established my very own little brown eyes towards the several merchants which usually organizations currently have by means of internet marketing.i like to read informative blogs and this blog is also so good and helpful.thanks for taking time to discus this topic..

January 12, 2012
gravatar how to get big muscles fast: Building muscles is made much more straight forward with Matrix Supplements and here you can find out how and the basic science behind gaining muscles.
January 21, 2012
gravatar payday loan: So I needed something that would help me figure out whether or not two strings are nearly identical.
January 21, 2012
gravatar car insurance quotes: Hello you I do absolutely dig your nice article, I would feel very honored if you allow me to publish a adorable review on your incredible web blog in my Blog Site would you be OK with that?

January 25, 2012
gravatar Real Estate Bunbury: I will use this Duplicate() method for checking duplicacy.
January 27, 2012
gravatar Sharelord Strategy: I think you have mentioned problem solving technique in your blog. And definitely, it will be very helpful for anyone.
January 30, 2012
gravatar Cosmetic Dentist Brisbane: What is the default return type, use in this program?
January 30, 2012
gravatar Brisbane SEO: I have tried this script, but on my side it did not work.
January 30, 2012
gravatar Shutters Brisbane: Finding a new blog that’s interesting is always great – Thanks.

January 31, 2012
gravatar Herbies Spices: This is knowledgeable blog. You have good knowledge of PHP but I have doubt about some PHP commends so I want to discuss with you.
January 31, 2012
gravatar Garages and Sheds: I am doing PHP course and you have provided useful points, which will very helpful for me.
January 31, 2012
gravatar Valentines Day Hampers: What are the main features of similar_text() function? Please elaborate this.
January 31, 2012
gravatar jogos de meninas: Every already in a while you charge to break this different problem. This botheration that you usually alone accept to break already or twice.
thanks for this post...
January 31, 2012
gravatar Maths Tutor Brisbane: I want to know more information on PHP. Can you please tell me more about this?
January 31, 2012
gravatar Real Estate Adelaide: May I try this script in my website? Let me know please.
January 31, 2012
gravatar Blinds Online: This function is very helpful to develop a unique website. Thanks for sharing this useful information with us.

January 31, 2012
gravatar Buy Shed: I appreciate the time and effort you spent discussing this tool. Keep up the good work.
February 1, 2012
gravatar Removal Caloundra: This is very useful points about PHP. This information will add up to my knowledge.
February 1, 2012
gravatar uggs boots: Your first-class knowledge about this good post can become a proper basis for such people. nice one! :lol:
February 1, 2012
gravatar Real Estate Liverpool: This code seems to be very interesting. I will try this.
February 1, 2012
gravatar Real Estate Werribee: I have some doubts about PHP. But after reading your blog, I have cleared my all doubts.
February 1, 2012
gravatar Family law Brisbane: I'm PHP certified in Brisbane. I have used this strategy in my profession. Really, it's so interesting.
February 1, 2012
gravatar Commercial Vacuum Cleaners: This is brilliant! I hope to be reading more of your blog in the future.

February 2, 2012
gravatar Brisbane Flowers: Can you please tell me the benefit of PHP.net? I want to join that course.
February 2, 2012
gravatar michael kors watches: Very nice blog. Thanks for being considerate and posting this for us.

February 2, 2012
gravatar Banner Printing Brisbane: This is one of the better posts that I’ve read in a while.

February 2, 2012
gravatar Homes for Sale Rye: I think PHP.net is a great opportunity for making career in developing line. Thanks for sharing about the PHP.net in this article.
February 2, 2012
gravatar Solar Rebate Brisbane: This is the information I was searching about PHP. Thanks for sharing this information.
February 2, 2012
gravatar Discount UGGs: uggs bailey button triplet sale Area girl smiled, blazing and said: "Since ancient times, doing business, are taking their all, it does not enchant, and stones, you say, right?" rrt21dg
February 3, 2012
gravatar Gift shop: It’s valuable information. This blog is very useful and important for me about PHP. You have put complete information about it.
February 3, 2012
gravatar Male Underwears: I have tried this tool in my own website. But it does not work properly. Can you please send me more detail about this tool?
February 3, 2012
gravatar Registered Massage Therapists: I'm learning PHP. This information will be very useful in my study.
February 4, 2012
gravatar Pre Purchase Building Inspections: Thanks for sharing – I really found some good info here.

February 4, 2012
gravatar Personal injury lawyer Toronto: After reading your blog, I've used PHP in my business. And I think it’s very surprising. Thanks for sharing.
February 4, 2012
gravatar Medical malpractice lawyer Toronto : I'm doing study related with PHP. And your blog has increased my knowledge about PHP. Thanks for sharing.
February 4, 2012

Php5_zce_logo

Upcoming events

I will be speaking 06-02-2012: D-Day
I will be speaking 17-02-2012: Techademy Trainingday February
I will be speaking 23-02-2012: Zend Webinar: Git for Subversion Users

Tags

1337 2008 2010 2011 4developers access modifiers accessibility AdaLovelaceDay09 advent agavi agile alfred amsterdam apache api apple article articles atk atkMetaNode audioscrobbler automation azure backwards compatibility barcelona barcodes bash bbc bbq beatstad belgium best practices bittorrent blogging blogs boards of canada book books bughuntday bundle caching cake cal evans calendar career cat cerf certificate cfp clear cms cologne common sense communities community components conference conferences contest continuous integration contribute contribution crisis css custom d-day datetime DbFinderPlugin decorator decorators deployment devdays development directoryindex docblox doctrine documentation download dpc dpc09 dpc10 dpc11 DPC2008 dreamhost drupal dv7 eclipse ed editors efficiency enterprise errors event events expertise ezcomponents facebook finland flickr fork framework frameworks freelance freeze frontend fun game games geoip germany getting real git github gnome-do google google calendar googletalk graceful degradation hack hackers hidden gem hiphop howto hp HR html http i386 ibuildings icann ide ideasofmarch idm imovie indy ingewikkeld integration international php conference internet interview ipad IPC ipc ipc08 ipc10 ipc11se iterm2 javascript jenkins jenkins-php job job openings jobeet john peel joomla joomladays kiva kubuntu launcher launchy left on the web libraries library lighttpd lime linktuesday linux live london loudblog m2ts mac magazines malware mambo marjolein mediterra meeting meme meta methodology micro-financing microframework microsoft migration movie music mysql namespace namespaces netbeans netherlands newsfire nllgg nos odmarco open source opinion ORM osx paradiso paris partnership pavilion pear pecl performance personal pfc10 pfc11 pfcongres pfcongrez pfz photo php php5.3 phpabstract phpazure phpBB phpbb phpbelgium phpbenelux phpbnl10 phpday phpdoc phpdocumentor phpgg phpitalia phpnw phpnw08 phpnw11 phpstorm phptek phptek09 phpuk2009 phpUnderControl phpunit php|architect php|tek podcast politics portability postcrossing presentation presentations private projects protected prototype PSR-0 public python qa qr codes re2c recruiting refactoring review rewrite ruby on rails san francisco schedule scifi script security sensio seven things sfdaycgn sflive2011 shell scripting silex simplexml slides smfony software sogeti solar sound speakers spl ssh standard standards star trek static steer strings stylesheets subversion symfony symfony live Symfony2 symfonycamp symfonyday symfonylive symfonyUnderControlPlugin talk talks techademy technology techportal tek09 telecommuting terratec terrorism testfest testing textmate textpattern the right tool timeout tips tld todo tomas tools training twig uncon unet usability usergroup validation vhost video vim vinyl virus warp webinar weblogging webservices wiki windows winphp women wordpress work workshop world world of warcraft wpi writing wunderlist xml xpath xsd yara year youtube zc11 ZCE zemanta zend zend framework zend server zend studio zendcon Zend_Form zite
© 2004 - 2012 Stefan Koopmanschap + Powered by Symfony, photos powered by Flickr, links powered by Delicious, Shanghai smilies by Iconbuffet. Feeds: rss / atom. Left on the Web v4.4.0.1