PHP Hidden Gem: similar_text()
Every once in a while you need to solve this unique problem. This problem that you usually only have to solve once or twice. So you start developing, but quickly wonder what the best solution is. You can write your own solution, but it pays off to at least search through PHP.net a bit to see if there is not already a solution for it in PHP.
Today, I was having such a situation. I am working on a hobby project where I aggregate feeds from several different sources. With the blogs I work it right now, it often happens that an author posts the same post to a few different sites. However, because of site formats and sometimes also quick edits an author makes on one site but not on the author, the article contents are usually not identical strings. So I needed something that would help me figure out whether or not two strings are nearly identical.
First I was Googling around a bit, trying to figure out what was a sane approach for this. I was thinking of simply piping it into diff and parsing the response, but that seemed highly inefficient. I also came by
xdiff, a PECL extension which allows for some very cool functionality, but in the end I didn't really want to diff, I just wanted to figure out if strings were similar.
The fact that Google brought me to PHP.net though made me consider PHP might actually have a solution to my problem already. Then I remembered using
soundex() before, which is sort of what I need, however I need it for more than just a single word. I decided to check the soundex() page anyway, to see if there were similar functions listed that would perhaps help me out. And indeed there were:
levensthein and
similar_text both seemed to do something approaching what I needed.
I first checked levensthein, however it allows checking strings of maximum 255 characters, too short when comparing full blog post bodies. So I went with similar_text, and indeed, that worked fine. My current isDuplicate() method is now:
static public function isDuplicate($item1, $item2)
{
similar_text($item1->get('text' ), $item2->get('text' ), $perc);
if ($perc > 75)
{
return true;
}
}
I am still trying to figure out which percentage will catch the duplicates but not catch too many posts which are only similar but not actually duplicates, but with the above 75% I seem to catch quite a few duplicates so far.
So, next time you are in a similar situation, remember, there is an awesome function out there called similar_text(). And perhaps more importantly: Next time you need to implement some very specific functionality that you usually don't need, first try PHP.net (and
Google) before writing your own solution.
July 10, 2009 - tags: php, hidden gem, strings
Comments
Bertrand: I have used levensthein() in the past, and following the exact same search schema. I hope there are a lot of remaining exotic functions like this in PHP

Duane Gran: I wonder how this will behave on large data sets? The solution is a gem for sure, but you would need to perform n^2 calls to isDuplicate (unless you marked item2 as dirty in each test) to determine the answer.
You might be able to prune the comparison by seeing if the two strings are reasonably close in length, but it could still result in a lot of comparison calls.
Another approach to this might be with Lucene's "more like this" functionality. It is highly optimizes for text retrieval, but there is non-trivial indexing time.
I hope that above thoughts are helpful. I'm glad I know about this function now.
Jeremy Ashcraft: I use similar_text() extensively in an application used for mapping older text to updated/revised text. it works wonderfully. My threshold is only at 50%, but we only examine a few sentences at most. I remember being so happy when I discovered this function. It literally saved me a week of development time!
amazon competitors: Its very difficult to get the 100% accuracy with these search function in PHP.Since PHP is an open source programming so we can get many such functions over google.I was using levensthein() function for comparing text but similar_text() looks more promising,I will defintily try this out and share the result here.Thanks a lot for sharing this function.
Craigslist New Jersey apts: It is very important for us to learn PHP because it creates opportunities for us in the field of IT. Thanks for providing info regarding this topic.
Craigslist Seattle: your performance is outstanding and you better understand the function of PHP.you solve this problem easily.i like to read informative blogs and this blog is also so good and helpful.thanks for taking time to discus this topic..
aircon repair : I am still very new to PHP programming. But somehow I was able to understand what the blog is trying to explain.
Craigslist San Jose: i think this issue is important.you explain it very well.i like to read informative blogs and this blog is also so good and helpful.thanks for taking time to discus this topic..
orlando accident lawyer: This works so good with php. I have been using it for a while and there is nothing better. Keep up the good work.
Bandages: I really like to appreciate you because your information is good for my work and I think more people need to read blogs like this.
mtvsxdp: Its very difficult to get the 100% accuracy with these search function in PHP.Since PHP is an open source programming so we can get many such functions over google.I was using levensthein() function for comparing text but similar_text() looks more promising,I will defintily try this out and share the result here.Thanks a lot for sharing this function.
Discount UGG Boots: cheap uggs australia She must, in addition to her there who find me? zs11fda
ugg kenly black Most unpromising people, nothing of the people, too lazy to famous people, strange in a strange society, apart from human beings, of course, is that they are mighty champions.
hot pink ugg boots I do not know. She said, "That I should not have their own outlet to four percent of the things to tell you."
www.uggsaustralia.me.uk You're like a girl. She said softly.
ARY Musik: The following article actually established my very own little brown eyes towards the several merchants which usually organizations currently have by means of internet marketing.i like to read informative blogs and this blog is also so good and helpful.thanks for taking time to discus this topic..
how to get big muscles fast: Building muscles is made much more straight forward with Matrix Supplements and here you can find out how and the basic science behind gaining muscles.
payday loan: So I needed something that would help me figure out whether or not two strings are nearly identical.
car insurance quotes: Hello you I do absolutely dig your nice article, I would feel very honored if you allow me to publish a adorable review on your incredible web blog in my Blog Site would you be OK with that?
Real Estate Bunbury: I will use this Duplicate() method for checking duplicacy.
Sharelord Strategy: I think you have mentioned problem solving technique in your blog. And definitely, it will be very helpful for anyone.
Cosmetic Dentist Brisbane: What is the default return type, use in this program?
Brisbane SEO: I have tried this script, but on my side it did not work.
Shutters Brisbane: Finding a new blog that’s interesting is always great – Thanks.
Herbies Spices: This is knowledgeable blog. You have good knowledge of PHP but I have doubt about some PHP commends so I want to discuss with you.
Garages and Sheds: I am doing PHP course and you have provided useful points, which will very helpful for me.
Valentines Day Hampers: What are the main features of similar_text() function? Please elaborate this.
jogos de meninas: Every already in a while you charge to break this different problem. This botheration that you usually alone accept to break already or twice.
thanks for this post...
Maths Tutor Brisbane: I want to know more information on PHP. Can you please tell me more about this?
Real Estate Adelaide: May I try this script in my website? Let me know please.
Blinds Online: This function is very helpful to develop a unique website. Thanks for sharing this useful information with us.
Buy Shed: I appreciate the time and effort you spent discussing this tool. Keep up the good work.
Removal Caloundra: This is very useful points about PHP. This information will add up to my knowledge.
uggs boots: Your first-class knowledge about this good post can become a proper basis for such people. nice one!
Real Estate Liverpool: This code seems to be very interesting. I will try this.
Real Estate Werribee: I have some doubts about PHP. But after reading your blog, I have cleared my all doubts.
Family law Brisbane: I'm PHP certified in Brisbane. I have used this strategy in my profession. Really, it's so interesting.
Commercial Vacuum Cleaners: This is brilliant! I hope to be reading more of your blog in the future.
Brisbane Flowers: Can you please tell me the benefit of PHP.net? I want to join that course.
michael kors watches: Very nice blog. Thanks for being considerate and posting this for us.
Banner Printing Brisbane: This is one of the better posts that I’ve read in a while.
Homes for Sale Rye: I think PHP.net is a great opportunity for making career in developing line. Thanks for sharing about the PHP.net in this article.
Solar Rebate Brisbane: This is the information I was searching about PHP. Thanks for sharing this information.
Discount UGGs: uggs bailey button triplet sale Area girl smiled, blazing and said: "Since ancient times, doing business, are taking their all, it does not enchant, and stones, you say, right?" rrt21dg
Gift shop: It’s valuable information. This blog is very useful and important for me about PHP. You have put complete information about it.
Male Underwears: I have tried this tool in my own website. But it does not work properly. Can you please send me more detail about this tool?
Registered Massage Therapists: I'm learning PHP. This information will be very useful in my study.
Pre Purchase Building Inspections: Thanks for sharing – I really found some good info here.
Personal injury lawyer Toronto: After reading your blog, I've used PHP in my business. And I think it’s very surprising. Thanks for sharing.
Medical malpractice lawyer Toronto : I'm doing study related with PHP. And your blog has increased my knowledge about PHP. Thanks for sharing.