Pompos is a document analyzing tool used to index and classify the World Wide Web. This kind of program is also known as a web crawler.
The goal of Pompos is to collect as many documents as possible for the dir.com search engine. On this page you will find
the most Frequently Asked Questions about Pompos.
robots.txt file on my site?robots.txt rules?At what speed does Pompos retrieve my sites pages?
In order not to disturb the availability of the visited sites, Pompos has been configured to visit each page of a same site with a delay varying between 1 and 50 seconds. Nevertheless, given the nature of the Internet, the unavailibity of a part of the network may slow down the frequency of the visits. If you consider Pompos is affecting your web site in a significant way you can tell us to slow down the crawling speed by using the form at the end of this page. However you should know that we already take into account the time given to answer our requests as an indicator of your server load.
How to ask Pompos not to retrieve certain pages of my site?
The robots.txt file is a standard document that specifies if
Pompos can fully visit your web site or not. The robots.txt
syntax is defined by the Robot Exclusion Standard.
If you wish to treat Pompos differently from the other robots you can
define the rules with a User-Agent: starting with "Pompos". If this rule is not defined, Pompos will obey the User-agent: * directives.
robots.txt example: in the following example
all the robots are concerned by the exclusion of the
/stats/, /cgi-bin/ and /img/ directories.
User-agent:*
Disallow:/stats/
Disallow:/cgi-bin/
Disallow:/img/
Other example, this time defining the rules for Pompos on the /stats/, /cgi-bin/, /img/
and /tmp/ directories.
User-agent:pompos
Disallow:/stats/
Disallow:/cgi-bin/
Disallow:/img/
Disallow:/tmp/
User-agent:*
Disallow:/stats/
Disallow:/cgi-bin/
Disallow:/img/
Why does Pompos keep on asking for a robots.txt file on my site?
robots.txt is a standard document allowing or
disallowing robots to retrieve pages of a web site. If you want to learn
how to write your own robots.txt file please check
The
Robot Exclusion Standard.
If you just want to avoid seeing errors in your log files regarding this
file you can put an empty file named robots.txt at the root
of your site.
Why does Pompos try to retrieve nonexisting pages on my site? Or on a nonexisting domain?
The World Wide Web is made of many "broken" links or sites that do not exist anymore. When a site contains an incorrect link to your site, visitors will not be able to access the said document. In the same way Pompos will try to access this document from an old or incorrect link. This explains why you may see an error in your logs when Pompos tries to access the link. These access failures are usually reported by a 404 error in your server logs.
Why is Pompos retrieving pages on our private site?
It's sometimes impossible to keep a site "secret" even if you do
not publish a link to it. There are many reasons for this:
- as soon as a visitor from this "secret" site follows a link to another site, the "secret" site will appear in the referer of the visited site logs (transmitted by your own web browser). These logs
are sometimes made public by statistic pages.
- some domain lists are public such as newly registered domains (depending on
the registrar). In a similar way, sometimes companies hosting web sites
maintain a public list of the sites they host.
- you may not know of all links pointing to your site. You can use the link:
syntax in dir.com to find these links to your site, for example
link:www.free.fr or link:www.mysite.com/page.html (but not necessarily
all pages are indexed even if the links are followed).
Apache users should consider using htaccess to protect their data from being
accessed by unauthorized users/robots, see http://apache-server.com/tutorials/ATusing-htaccess.html
Why does Pompos do not obey my robots.txt rules?
Every time Pompos is visiting your site, it starts with the robots.txt
file first in order to obey your directives. This means that changes to the
robots.txt file will not be taken into account until the next
visit of your web site by Pompos. Please check the correct syntax of your
robots.txt file on http://www.robotstxt.org/wc/exclusion.html#robotstxt.
Most of the problems come from a misplacement of this file on the site. It
must be placed at the root of your web site, it will have no effect in any
other subdirectory. If you have followed all these rules and keep on
experiencing problems you can contact us by using the form at the end of this
page.
Why can I see different host names with the same Pompos signature on my site?
Pompos is used to collect millions of pages and requires many computers. These computers have been assigned different IP (Internet addresses). This means than more than one robot can crawl your web site at a time.
What kind of links does Pompos follow when visiting a site?
Our crawler follows all the links found in HREF tags of a document, no matter the extension (.html, .htm, .php, .asp and so on).
Pompos understands the revisit-after directive. This tag allows you to specify a frequency between two crawls,
thus saving resources and giving greater importance to recent pages in order to see them quickly up to date on
dir.com.
You can modify this interval either by
1) using a META tag in the page:
<meta name="revisit-after" content="10 days">2) or adding the following to the robots.txt file:
User-agent: Pompos revisit-after 2 months: /archives* revisit-after 100 mins: / revisit-after 7: /bulletin_hebdomadaire*the default unit being a day. The second option allows you to put the robot's name in case you want a different behaviour for the various search engines crawling your site.
I can't find an answer to my question on this page, what can I do?
You can ask your questions regarding Pompos by filling the following form: