Scour: The socially search engine

I just started using Scour, a search engine that let's you vote and comment on the results.

Scour queries the top 3 major search engines: Google, yahoo and live to provide results so it's like using your preferred search engine with a social twist. You can vote up or down each result and comment on it and then scour uses this data ( votes , comments ) to provide better  relevancy.

The problem is that when people want to search they want results quick and once they get them they just leave so in order to encourage users to contribute scour rewards them with points that can be converted in money using visa gift cards.

The idea is that since the major search engines are making billions from search the user should get something ( more then just search results ) out of it too.

Once you signed up to scour you can start using it for your daily search just like you did with google, yahoo or msn . They even have a search bar plugin for internet explorer and firefox  and in the faqs you can find instructions about how you can make firefox use it as the default search engine instead of google. There is also a toolbar but apparently it's only for internet explorer or only for windows ( .exe ) .

As you keep searching, voting, commenting you accumulate points. For each search you get 1 point, for each vote you get 2 points and for each comment you get 3 points but a maximum of 4 points / search  and once you reach 6500 points you get a $25 visa gift card.

I like Scour both for the ideas of higher relevancy through votes and comments and also for rewarding users.

Scour is still in the beginning and there are some small problems with it like : try searching for : 'var/log' or the fact that it only displays 3 pages of results, but I'm sure they will be fixed and the search engine will improve over time.

Of course the whole idea of better relevancy will work only if more users will sign up , use it regularly and contribute.

Where’s the xml sitemap?

Someone contacted me through the contact form to ask me where is the xml sitemap generated by the xml sitemap module for pligg .

If he would have read my first post about this module I think he would have eventually figured out where it is but since that first post was written a long time ago let me answer that question in this post.

I will do this in a post instead of answering privately because maybe there are others that might run into the same problem and I hate answering the same question over and over.

The module doesn't generate a sitemap but a sitemap index ( that's basically just a list of sitemaps in xml ) and unless you're using cache the module will generate it every time someone goes to the sitemap's URL.

If you're not using friendly urls for sitemaps then the url to the sitemap will be :
http://yourpliggsite.com/module.php?module=xml_sitemaps_show_sitemap
If you want to use friendly urls for the sitemap you will have to configure it as described here

Last time I checked ( when I first created the module ) ask.com could not be pinged unless your sitemap url looked like a static url or/and was ending in .xml and this is why I created the module with this choice in mind. If you don't care about pinging ask.com or if ask.com changed it's policy ( can anyone check this ? ) then you don't need friendly urls for sitemaps.

For the future I would appreciate if such questions would be asked in the comments instead of private contact. I prefer the comments for answering questions about my posts or the code in my posts because this way others can benefit from my answers or others can contribute.

The contact form would be for private matters like asking for consultancy , business proposals or others that don't fit into the comments.

MySQL: counting results

You have a query and you want to display the results on a web page but because there are so many results you want to paginate the data so the user can have some links like "prev page, page 1, page 2, next page, last page" that you can see on a lot of sites these days. This is a common problem a web developer faces, it's not hard to solve but it is often not solved in the best way.

The pagination concept is based on the fact that you can retrieve just part of the results using a limit clause in the query and display them on a page. This usually makes the query faster and allows the user to easily navigate without crashing his browser or having to scroll long pages.

If you want to show the user the total number of results or you want to allow them to skip right to the last page then you need to count the total number of results that the query would return without the LIMIT clause.

How some do it?

I have seen some badly designed software that was just removing the LIMIT from the query running it and then calling mysql_num_rows() to count the rows. That may be ok if your table has just a few rows and the query returns quickly but if your table will grow to a few thousand rows or if your query joins several big tables you're going to get in troubles

So how can this be done better?

There is no way that would be best for any case but here is what you can do:

  1. if your query is simple enough to not use the "group by" or "having" clause  you can simply remove all fields in your query and replace them with "count(*)" this will be really fast especially if you have the right indexes set on the table(s) in the query
  2. if your query does use "group by" then modify the query to use SQL_CALC_FOUND_ROWS.

Here is an example of the second option that may be more general as it works with any query and I think it's preferable even if it may be slower then count(*)

We have this query:

  1.  

you would use that to display a list of ages and how many users have a certain age in your table, you want the list to have 10 results / page and your table is really big so it's very likely you will have more then one page to display.

As you can see this query already has a "count" and "group by" in it so you can't use count to get the total number of results.

If we modify this query like this:

  1.  

the query will return the exact results as the previous one but now if we do this :

  1.  

we will get the total number of rows that the last query would have returned without the LIMIT clause.

This is a lot faster then running the query without the limit and counting the results with mysql_num_rows because MySQL will to the counting internally and will not have to return the whole result set to the client .

Other ideas to improve performance

Fetch details for a record in separate queries. Let's say you have a query that joins several tables and you want to display details from all those tables in a single row in your list. The joins make your query slow because it will have to examine a lot of rows when doing the count .Try to remove as many of those joins as you can do the count and then for each row in your list just run separate queries to get the other details.This way you will examine just a few rows from the other tables because you'll do the extra queries only for the results you are currently showing on a page.

Enable mysql slow query log then watch it to see how long your queries take and how many rows are examined.

Use explain to see if your query is using the right indexes and create indexes where you think they will improve the performance. If the explain will show the query will use a temporary table make sure your temporary table can hold all data in memory, if you have enough
( check the tmp_table_size and max_heap_table_size variables )

Enable query cache so the server will just server the results from cache instead of doing all the work over and over for data that is unchanged.

There are a few other techniques I have found on the official mysql documentation site, but these presented here helped me a lot in working with lists and counting the results.

If you have other tips I'll be happy to see them in the comments.

MySQL and SSL

I have been setting up a few mysql servers with SSL support for replication .

I used the script provided in the the official mysql documentation  for creating the ssl certificates cause I needed to do it on more then one server and it made more sense to use it then actually creating each certificate one by one.

If you just read the documentation and create the certificate one by one you will be fine but if you use the script your CA certificate will expire after 30 days and after a month you'll be banging your head trying to find out why suddenly SSL connections don't work anymore.
You know your certificates should be valid for a year or more but why doesn't it work anymore ... running this command :

  1.  

reveals it ...

notBefore=Apr 17 12:20:10 2008 GMT
notAfter=May 17 12:20:10 2008 GMT

Ah .... there you go ... just 30 days for the cacert file ... insane...
The problem was actually reported by someone else in the comments on that documentation page but I was in a hurry ( yeah right ) and didn't go that far with reading it.
Note to self: always read the comments on those pages
So if you use that script make sure you modify it to make the CA valid for more then 30 days.
This line:

  1.  

Should be something like:

  1.  

That is if you want the CA cert to be valid for a year.

Problem transferring a mysql database with rsync

A little more then a year ago I wrote a post presenting three different methods to transfer a mysql database. The third method suggested in that post was copying the mysql database files directly from one server ( or location ) to another. This involved locking the tables with a read lock or even shutting down the mysql before the actual copy.

For my work I usually have a main system and a development system and each system have their own database so there is a need from time to time to copy the main database over the dev database but because the database is very big ( a lot of tables and some with a large size ) and not every table is changed I like using rsync to transfer only the changes especially when I'm transferring to remote locations because it saves bandwidth and is faster.

In this case where I found the problem I actually use the same mysql server to hold both main and dev database but I still use rsync to transfer just because it still is faster then cp.

So here is what I do: I stop the mysql server , run rsync -av /var/lib/mysql/main_db/ /var/lib/mysql/dev_db/ , the differences are transferred, I start the mysql server look at the dev_db and Boom! some of the tables are corrupt. The main database was fully functional before shutting down mysql , no tables were corrupt or needed a repair, and still don't need starting up mysql.

Maybe something even more interesting is that it's very likely noone was using any of the databases before mysql was shut down.

It seems that after the transfer I just have to "repair table table_name" for some of the tables in dev_db and the repair statements returns some message saying that the number of rows has changed. Of course since I don't want to go over each db and see if it actually needs a repair I chose to just repair all of them and I wrote a script for that.So I just run the bellow script after each transfer, just to make sure everything is ok:

  1.  

This script should also show you the messages returned by the repair statements. So you can see if there really was a problem. Make sure you set the correct db connection parameters and database name before you try it.

When observing this problem I was using rsync version 2.6.9 and mysql 5.0.44 on gentoo x86_64. The problem doesn't come up on every transfer and not on all tables. Could this be a problem with rsync or mysql?

I'm thinking that if this is a problem with rsync then... wow...that is a big problem. I was relying on rsync for transferring a lot of stuff ... what if it didn't transfer something and who knows what else it didn't transfer?.

If it's a mysql problem, maybe mysql doesn't update the row counts on the tables correctly before shutting down so the files were actually correctly transferred just not correctly stored by mysql. If the row count is the only problem here then it's not such a bit problem. I'm hopping this is the case ...

I wonder if this problem would show up when using something like cp for the transfer. If that would happen then it's clearly a mysql problem but I cannot test with cp at the moment as my db is very large and that means I would have to keep the tables locked too much which is just not an option on a system that was just "promoted" to production.

I'll come back with another post once I find out more about this problem but until then just make sure to check your tables after the transfer if you are using something like rsync to transfer the files directly.

Xml Sitemaps pligg module v0.9

This one is a quick release just like the previous one that fixes just one thing.

All previous versions had this problem that the urls were not urlencoded so those urls that contained special characters like those with an accent or diacritics were invalid and of course google would show an error on those sitemaps.

Version 0.9 makes escapes those urls so now those of you with such special characters in the urls can finally enjoy this module.

It seems like the modules is getting closer to version 1.0 . If you have any suggestion about some feature you would like in 1.0 or you found some other bug that needs fixed, don't hesitate to let me know about it.

download v 0.9 from the module's page

Xml Sitemaps pligg module v0.8

It seems my last version of the Xml Sitemaps module for pligg didn't really fixed the date format problem with the generated sitemaps.

Back when this module was created google had less strict rules about the date format in the lastmod section. My module generated a date and time string with this format YYYY-mm-ddTHH:MM:SS and it was ok but now it's only valid if it also contains the timezone offset in this format +/-HH:MM or if the string doesn't contain the time anymore.

So here is another update on this module that also contains the timezone offset so the sitemap is considered valid by google.

Download here

wordpress 2.5.1

I have just upgraded to wordpress 2.5.1. My upgrade routine worked well without any problem.

The new release seems to bring a few bug fixes to  some annoying issues and a security fix as well as some performance enhancements.

Unfortunately the automatic plugin updater still fails at times and doesn't save the ftp password and the media uploader still renames .tar.gz files in a stupid way.

vim arrows in MacOSX

I know vim gurus would criticize me for using arrows in vim's insert mode but it's really hard to give them up.

I have this problem when I connect from my linux box to a MacOSX or FreeBSD box over ssh. I find it one of the most annoying things when using vim. When you are in insert mode and hit one of the arrows to move around, instead of the expected action vim will just print A, B, C or D on a new line. This makes vim practically useless.

So either you are very careful and always exit the insert mode before you move or fix the keys.

I think it's hard to always remember to get out of insert mode and it's one extra operation you have to do that I find useless not to mention you will probably have to enter insert again a few seconds after that.

So here's the fix for the arrow keys.  Edit vimrc either the global vimrc ( I'm using vim from macports so my vimrc is /opt/local/share/vim/vimrc ) or ~/.vimrc like this:

$ vim ~/.vimrc
set t_ku= (now type Ctrl-V and press cursor up)
set t_kd= (now type Ctrl-V and press cursor down)
set t_kr= (now type Ctrl-V and press cursor right)
set t_kl= (now type Ctrl-V and press cursor left)

This solution was stolen from vim tips wiki. I posted it here to avoid looking for it again if I need it. It's the second time I am hit by this problem and every time I had to search through a few pages with solutions that didn't work for me

xml sitemap for pligg v0.7

This is a quick fix for a bug introduced in a previous version because I tried to make it compatible with php4 date().

This bug may have made your sitemap invalid because the lastmod date contained the timezone between the date and time.

This version also brings a new feature that could be useful for larger sites.

The cache

I noticed that on a site with over 20000 links it may take a lot of time to generate the sitemap index and sitemap files and will put some significant load on the server if google, yahoo or ask will try to access the sitemap every few minutes or hours ( depending on your site's posting/pinging frequency ), so I thought it would be nice to have some kind of cache.

The module will save generated sitemap index and sitemaps in pligg's cache directory ( this means it needs to be writable by the user running the webserver ) and if the cache has not expired yet the module will serve the sitemaps from the cache instead of trying to regenerate every time.

You can set the expire time ( TTL ) and the module will regenerate the sitemap if TTL seconds have passed since the last time it was modified.

You don't need to set any cron job to generate the sitemap files. It will only generate the sitemap if someone/something(google,yahoo,ask) tries to access it.

Another modification related to the caching system is that the site will only ping services if the cache has expired so make sure you set your cache's TTL accordingly.

Upgrade:

To upgrade to this version just download and unzip in your module's directory and then to pligg admin -> module management , disable and uninstall the module and then reinstall it so that you can see the new options ( "Use Cache" and "Cache TTL")

Download from module's own page