Search Engine Robots Or Web Crawlers

Search Engine Robots Or Web Crawlers



Most of​ the​ common users or​ visitors use different available search engines to​ search out the​ piece of​ information they required. But how this information is​ provided by search engines? Where from they have collected these information? Basically most of​ these search engines maintain their own database of​ information. These database includes the​ sites available in​ the​ webworld which ultimately maintain the​ detail web pages information for​ each available sites. Basically search engine do some background work by using robots to​ collect information and​ maintain the​ database. They make catalog of​ gathered information and​ then present it​ publicly or​ at-times for​ private use.

In this article we will discuss about those entities which loiter in​ the​ global internet environment or​ we will about web crawlers which move around in​ netspace. We will learn

· What it’s all about and​ what purpose they serve ?
· Pros and​ cons of​ using these entities.
· How we can keep our pages away from crawlers ?
· Differences between the​ common crawlers and​ robots.


In the​ following portion we will divide the​ whole research work under the​ following two sections :

I. Search Engine Spider : Robots.txt.
II. Search Engine Robots : Meta-tags Explained.


I. Search Engine Spider : Robots.txt

What is​ robots.txt file ?

A web robot is​ a​ program or​ search engine software that visits sites regularly and​ automatically and​ crawl through the​ web’s hypertext structure by fetching a​ document, and​ recursively retrieving all the​ documents which are referenced. Sometimes site owners do not want all their site pages to​ be crawled by the​ web robots. for​ this reason they can exclude few of​ their pages being crawled by the​ robots by using some standard agents. So most of​ the​ robots abide by the​ ‘Robots Exclusion Standard’, a​ set of​ constraints to​ restricts robots behavior.
‘Robot Exclusion Standard’ is​ a​ protocol used by the​ site administrator to​ control the​ movement of​ the​ robots. When search engine robots come to​ a​ site it​ will search for​ a​ file named robots.txt in​ the​ root domain of​ the​ site (http://www.anydomain.com/robots.txt). This is​ a​ plain text file which implements ‘Robots Exclusion Protocols’ by allowing or​ disallowing specific files within the​ directories of​ files. Site administrator can disallow access to​ cgi, temporary or​ private directories by specifying robot user agent names.

The format of​ the​ robot.txt file is​ very simple. it​ consists of​ two field : user-agent and​ one or​ more disallow field.


What is​ User-agent ?

This is​ the​ technical name for​ an​ programming concepts in​ the​ world wide networking environment and​ used to​ mention the​ specific search engine robot within the​ robots.txt file.
For example :

User-agent: googlebot

We can also use the​ wildcard character “*” to​ specify all robots :
User-agent: *

Means all the​ robots are allowed to​ come to​ visit.

What is​ Disallow ?

In the​ robot.txt file second field is​ known as​ the​ disallow: These lines guide the​ robots, to​ which file should be crawled or​ which should not be. for​ example to​ prevent downloading email.htm the​ syntax will be:

Disallow: email.htm

Prevent crawling through directories the​ syntax will be:

Disallow: /cgi-bin/

White Space and​ Comments :

Using # at​ the​ beginning of​ any line in​ the​ robots.txt file will be considered as​ comments only and​ using # at​ the​ beginning of​ the​ robots.txt like the​ following example entail us which url to​ be crawled.

# robots.txt for​ www.anydomain.com

Entry Details for​ robots.txt :

1) User-agent: *
Disallow:

The asterisk (*) in​ the​ User-agent field is​ denoting “all robots” are invited. as​ nothing is​ disallowed so all robots are free to​ crawl through.

2) User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /private/

All robots are allowed to​ crawl through the​ all files except the​ cgi-bin, temp and​ private file.

3) User-agent: dangerbot
Disallow: /
Dangerbot is​ not allowed to​ crawl through any of​ the​ directories. “/” stands for​ all directories.

4) User-agent: dangerbot
Disallow: /

User-agent: *
Disallow: /temp/

The blank line indicates starting of​ new User-agent records. Except dangerbot all the​ other bots are allowed to​ crawl through all the​ directories except “temp” directories.

5) User-agent: dangerbot
Disallow: /links/listing.html

User-agent: *
Disallow: /email.html/

Dangerbot is​ not allowed for​ the​ listing page of​ links directory otherwise all the​ robots are allowed for​ all directories except downloading email.html page.

6) User-agent: abcbot
Disallow: /*.gif$

To remove all files from a​ specific file type (e.g. .gif ) we will use the​ above robots.txt entry.

7) User-agent: abcbot
Disallow: /*?

To restrict web crawler from crawling dynamic pages we will use the​ above robots.txt entry.

Note : Disallow field may contain “*” to​ follow any series of​ characters and​ may end with “$” to​ indicate the​ end of​ the​ name.

Eg : Within the​ image files to​ exclude all gif files but allowing others from google crawling
User-agent: Googlebot-Image
Disallow: /*.gif$

Disadvantages of​ robots.txt :

Problem with Disallow field:

Disallow: /css/ /cgi-bin/ /images/
Different spider will read the​ above field in​ different way. Some will ignore the​ spaces and​ will read /css//cgi-bin//images/ and​ may only consider either /images/ or​ /css/ ignoring the​ others.

The correct syntax should be :
Disallow: /css/
Disallow: /cgi-bin/
Disallow: /images/

All Files listing:

Specifying each and​ every file name within a​ directory is​ most commonly used mistake
Disallow: /ab/cdef.html
Disallow: /ab/ghij.html
Disallow: /ab/klmn.html
Disallow: /op/qrst.html
Disallow: /op/uvwx.html

Above portion can be written as:
Disallow: /ab/
Disallow: /op/

A trailing slash means a​ lot that is​ a​ directory is​ offlimits.

Capitalization:

USER-AGENT: REDBOT
DISALLOW:

Though fields are not case sensitive but the​ datas like directories, filenames are case sensitive.

Conflicting syntax:

User-agent: *
Disallow: /
#
User-agent: Redbot
Disallow:

What will happen ? Redbot is​ allowed to​ crawl everything but will this permission override the​ disallow field or​ disallow will override the​ allow permission.

II. Search Engine Robots: Meta-tag Explained:

What is​ robot meta tag ?

Besides robots.txt search engine is​ also having another tools to​ crawl through web pages. This is​ the​ META tag which tells web spider to​ index a​ page and​ follow links on it, which may be more helpful in​ some cases, as​ it​ can be used on page-by-page basis. it​ is​ also helpful incase you don’t have the​ requisite permission to​ access the​ servers root directory to​ control robots.txt file.
We used to​ place this tag within the​ header portion of​ html.

Format of​ the​ Robots Meta tag :

In the​ HTML document it​ is​ placed in​ the​ HEAD section.
html
head
META NAME=”robots” CONTENT=”index,follow”
META NAME=”description” CONTENT=”Welcome to…….”
title……………title
head
body

Robots Meta Tag options :

There are four options that can be used in​ the​ CONTENT portion of​ the​ Meta Robots. These are index, noindex, follow, nofollow.

This tag allowing search engine robots to​ index a​ specific page and​ can follow all the​ link residing on it. if​ site admin doesn’t want any pages to​ be indexed or​ any link to​ be followed then they can replace “ index,follow” with “ noindex,nofollow”.
According to​ the​ requirements, site admin can use the​ robots in​ the​ following different options :

META NAME=”robots” CONTENT=”index,follow”> Index this page, follow links from this page.
META NAME=”robots” CONTENT =”noindex,follow”> Don’t index this page but follow link from this page.
META NAME=”robots” CONTENT =”index,nofollow”> Index this page but don’t follow links from this page
META NAME=”robots” CONTENT =”noindex,nofollow”> Don’t index this page, don’t follow links from this page.




You Might Also Like:




No comments:

Powered by Blogger.