Wednesday, June 24, 2015

Most of the common users or visitors use different 
available search engines to search out the piece of 
information they required. But how this information is 
provided by search engines? Where from they have 
collected these information? 
Basically most of these search engines maintain their 
own database of information. These database includes 
the sites available in the  webworld which ultimately 
maintain the detail web pages information for each 
available sites. Basically search engine do some 
background work by using robots to collect information
 and maintain the database. They make catalog of gathered information and then 
 
present it publicly or at-times for private use.In this article we will discuss about 
those entities which loiter in the global internet environment or we will  about web
 crawlers which move around in netspace.We will learn 
· What it’s all about and what purpose they serve ?
· Pros and cons of using these entities.
· How we can keep our pages away from crawlers ?
· Differences between the common crawlers and robots.
In the following portion we will divide the whole research work under the 
following two sections :
I. Search Engine Spider : Robots.txt.
II. Search Engine Robots : Meta-tags Explained.
I. Search Engine Spider : Robots.txt
What is robots.txt file ?
A web robot is a program or search engine software that visits sites regularly and 
automatically and crawl through the web’s hypertext structure by fetching a 
document, and recursively retrieving all the documents which are referenced. 
Sometimes site owners do not want all their site pages to be crawled by the web 
robots. For this reason they can exclude few of their pages being crawled by the 
robots by using some standard agents. So most of the robots abide by the ‘Robots 
Exclusion Standard’, a set of constraints to restricts robots behavior. 
‘Robot Exclusion Standard’ is a protocol used by the site administrator to control 
the movement of the robots. When search engine robots come to a site it will search
 for a file named robots.txt in the root domain of the site (http://www.anydomain.
com/robots.txt). This is a plain text file which implements ‘Robots Exclusion 
Protocols’ by allowing or disallowing specific files within the directories of 
files. Site administrator can disallow access to cgi, temporary or private 
directories by specifying robot user agent names.The format of the robot.txt file is 
very simple. It consists of two field : 
user-agent and one or more disallow field.
What is User-agent ?
This is the technical name for an programming concepts in the world wide 
networking environment and used to mention the specific search engine robot 
within the robots.txt file. 


http://2web-hosting.blogspot.com/

http://2web-hosting.blogspot.com/


For example :
User-agent: googlebot
We can also use the wildcard character “*” to specify all robots :
User-agent: * 
Means all the robots are allowed to come to visit.
What is Disallow ? 
In the robot.txt file second field is known as the disallow: These lines guide the 
robots, to which file should be crawled or which should not be. For example to 
prevent downloading email.htm the syntax will be:
Disallow: email.htm
Prevent crawling through directories the syntax will be:
Disallow: /cgi-bin/
White Space and Comments :
Using # at the beginning of any line in the robots.txt file will be considered as 
comments only and using # at the beginning of the robots.txt like the following 
example entail us which url to be crawled.
# robots.txt for www.anydomain.com
Entry Details for robots.txt :
1) User-agent: *
Disallow:
The asterisk (*) in the User-agent field is denoting “all robots” are invited. As 
nothing is disallowed so all robots are free to crawl through.
2) User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /private/
All robots are allowed to crawl through the all files except the cgi-bin, temp and
private file.
3) User-agent: dangerbot
 Disallow: /
Dangerbot is not allowed to crawl through any of the directories. “/” stands for 
all directories.
4) User-agent: dangerbot
  Disallow: /
        User-agent: *
 Disallow: /temp/
The blank line indicates starting of new User-agent records. Except dangerbot all 
the other bots are allowed to crawl through  all the directories except “temp” 
directories.
5) User-agent: dangerbot
 Disallow: /links/listing.html
 User-agent: *
 Disallow: /email.html/
Dangerbot is not allowed for the listing page of links directory otherwise all the 
robots are allowed for all directories except downloading email.html page.6)
User-agent: abcbot
 Disallow: /*.gif$ 
To remove all files from a specific file type (e.g. .gif ) we will use the above 
robots.txt entry.
7) User-agent: abcbot
 Disallow: /*?
To restrict web crawler from crawling dynamic pages we will use the above 
robots.txt entry.
Note : Disallow field may contain “*” to follow any series of characters and may 
end with “$” to indicate the end of the name.
Eg : Within the image files to exclude all gif files but allowing others from 
google crawling
 User-agent: Googlebot-Image
 Disallow: /*.gif$
Disadvantages of robots.txt : 
Problem with Disallow field:
Disallow: /css/  /cgi-bin/  /images/
Different spider will read the above field in different way. Some will ignore the 
spaces and will read /css//cgi-bin//images/ and may only consider either /images/ 
or /css/ ignoring the others.
The correct syntax should be :
Disallow: /css/
Disallow: /cgi-bin/
Disallow: /images/
All Files listing:
Specifying each and every file name within a directory is most commonly used 
mistake
Disallow: /ab/cdef.html
Disallow: /ab/ghij.html
Disallow: /ab/klmn.html
Disallow: /op/qrst.html
Disallow: /op/uvwx.html
Above portion can be written as:
Disallow: /ab/
Disallow: /op/
A trailing slash means a lot that is a directory is offlimits.
Capitalization:
USER-AGENT: REDBOT
DISALLOW:
Though fields are not case sensitive but the datas like directories, filenames are 
case sensitive.
Conflicting syntax:
User-agent: *
Disallow: /
#
User-agent: Redbot
Disallow:
What will happen ? Redbot is allowed to crawl everything but will this permission
override the disallow field or disallow will override the allow permission. 
II. Search Engine Robots: Meta-tag Explained:  
What is robot meta tag ?
Besides robots.txt search engine is also having another tools to crawl through web
 pages. This is the META tag which tells web spider to index a page and follow 
links on it, which may be more helpful in some cases, as it can be used on 
page-by-page basis. It is also helpful incase you don’t have the requisite 
permission to access the servers root directory to control robots.txt file. 
We used to place this tag within the header portion of html. 
Format of the Robots Meta tag :
In the HTML document it is placed in the HEAD section.
html 
head
META NAME=”robots” CONTENT=”index,follow”
META NAME=”description” CONTENT=”Welcome to…….”
title……………title
head
body
Robots Meta Tag options :
There are four options that can be used in the CONTENT portion of the Meta Robots.
These are index, noindex, follow, nofollow.
This tag allowing search engine robots to index a specific page and can follow all
the link residing on it. If site admin doesn’t want any pages to be indexed or any
link to be followed then they can replace “ index,follow” with “ noindex,nofollow”
According to the requirements, site admin can use the robots in the following 
different options :
META NAME=”robots” CONTENT=”index,follow”> Index this page, follow links
 from this page.
META NAME=”robots” CONTENT =”noindex,follow”> Don’t index this page but
 follow link from this page.
META NAME=”robots” CONTENT =”index,nofollow”> Index this page but don’t
 follow links from this page
META NAME=”robots” CONTENT =”noindex,nofollow”> Don’t index this page, 
don’t follow links from this page.

No comments:

Post a Comment