Welcome Guest, you are in: Login

MUST Creative Engineering Laboratory

RSS RSS

Navigation



Technical Doc



Search the wiki
»

MUST Corp.

MUST Corp.

www.must.or.kr

 Microsoft CERTIFIED Partner Software Development, Web Development, Data Platform

 Microsoft Small Business Specialist

MCSD

Microsoft Certified IT Professional

Microsoft Certified Professional Developer

Page History: 6 methods to control what and how your content appears in search engines

Compare Page Revisions



« Older Revision - Back to Page History - Current Revision


Page Revision: 2010/06/27 13:31


1. Use a robots.txt robots exclusion file

User-agent: *
Disallow: /sales/
Disallow: /images/
User-agent: googlebot
Disallow: /sales
User-Agent: *
Allow: / 

Pattern matching

Some search engines support extensions to the original robots.txt specification which allow for URL pattern matching.
Pattern Character Description Example Search Engine Support
* matches a sequence of characters User-Agent: *

Disallow: /print*/
Google, Yahoo, Bing
$ matches the end of a URL User-Agent: *

Disallow: /*.pdf$
Google, Yahoo, Bing

References: Google, Yahoo!, Bing (Sad note: Microsoft Bing’s help system is awful – they don’t allow direct linking to a topic section). At the time of this writing, Ask does not officially support these extensions.

2. Use “noindex” page meta tags

Pages can be tagged using “meta data” to indicate they should not be indexed by search engines. Simply add the following code to any page you do not want a search engine to index:

<meta name="robots" content="noindex" />
Keep in mind that search engines will still spider these pages on a regular basis. The continue to crawl “noindex” pages in order to check the current status of a page’s robots meta tag.

<meta name="robots" content="noindex" />
<meta name="googlebot" content="index" />

3. Password protect sensitive content

Sensitive content is usually protected by requiring visitors to enter a username and password. Such secure content won’t be crawled by search engines. Passwords can be set at the web server level or at the application level. For server level logon setup, consult the Apache Authentication Documentation or the Microsoft IIS documentation.

4. Nofollow: tell search engines not to spider some or all links on a page

As a response to blog comment “spam”, search engines introduced a way for websites to tell a search engine spider to ignore one or more links on a page. In theory, the search engine won’t “follow”, or crawl, a link which has been “protected”. To keep all links on a page off-limits, use a nofollow meta tag:

<meta name="robots" content="nofollow" />
To specify nofollow at the link level, add the attribute rel with the value nofollow to the link:

<a href="mypage.html" rel="nofollow" />

5. Don’t link to pages you want to keep out of search engines

Search engines won’t index content unless they know about it. Thus, if no one links to pages nor submits them to a search engine, a search engine won’t find them. At least this is the theory. In reality, the web is so large, one can assume that sooner or later a search engine will find a page – someone will link to it.

6. Use X-Robots-Tag in your http headers

In solution 1 above, we noted that use of robots.txt explicitly exposes some of your site’s structure, something you may want to avoid. Unfortunately, solution 2, use of meta tags, only works for html documents – there’s no way to specify indexing instructions for PDF, odt, doc and other non-html files. In July 2007, Google officially introduced a solution to this problem: the ability to deliver indexing instructions in the http header information which is sent by the web server along with an object. Yahoo! joined Google by supporting this tag in December 2007. Microsoft first mentions x-robots-tag in a June 2008 blog post, although I don’t see their webmaster documentation updated. They do make one mention of X-Robots-Tag in their Bing guide for webmasters. The web server simply needs to add X-Robots-Tag and any of the Google or Yahoo! supported meta tag values to the http header for an object:

X-Robots-Tag: noindex

Each search engine provides for notification of copyright violations, a procedure to follow in the event the copyright violator proves non-responsive.

Automated Content Access Protocol

Several commercial publishing associations have united behind a project to allow for the specification of more granular restrictions on content use by search engines. The project, Automated Content Access Protocol, appears to be as much a desire to share in the profits that search engines accrue when presenting abstracts of a publisher’s content, rather than a response to limitations in the current robots.txt and meta tag solutions. At the time of this writing (February 2007), no search engines have yet announced support for this project.

Additional Search Engine Content Display Control

Several search engines also support ways for webmasters to further control the use of their content by search engines.

No archive

Most search engines allow a user to view a copy of the web page that was actually indexed by the search engine. This snapshot of a page in time is called the cache copy. Internet visitors can find this functionality to be really useful if the link is no longer available or the site is down. There are several reasons to consider disabling the cache view feature for a page or an entire website.
  • Web site owners may not want visitors viewing data, such as price lists, which are not necessarily up to date.
  • Web pages viewed in a search engine cache may not display properly if embedded images are unavailable and/or browser code such as CSS and JS does not properly execute.
  • Cached page views will not show up in web log based web analytics systems. Reporting in tagged based solutions may be incorrect as well as the cached view is on a third party domain, not yours.
    If you want a search engine to index your page without allowing a
    user to view a cached copy, use the noarchive attribute which
    is officially supported by Google, Yahoo!, Bing and Ask:
    {{<meta name="robots" content="noarchive" />}}
    

    Microsoft documents the nocache attribute, which is
    equivalent to noarchive, also supported by Microsoft; there is
    no reason to use it.

    No abstract option: nosnippet


    Google offers an option to suppress the generation of page abstracts,
    called snippets, in the search results. Use the following meta
    tag in your pages:
    {{<meta name="googlebot" content="nosnippet" />}}
    

    They note that this also sets the noarchive option. We would
    suggest you set it explicitly if that is what you want.

    Page title option: noodp


    Search engines generally use a page’s html title when creating a
    search result title, the link a user clicks on to arrive at a website.
    In some cases, search engines may use an alternative title taken from a
    directory such as dmoz, the open directory, or the Yahoo!
    directory
    . Historically, many sites have had poor titles –
    i.e. just the company name, or worse, “default page
    title
    “. Use of a human edited title from a well known directory was
    often a good solution. As webmasters improve the usability of their
    sites, page titles have become much more meaningful – and often better
    choices than the open directory title. The noodp
    metatag, supported by Microsoft, Google and Yahoo, allows a webmaster to
    indicate that a page’s title should be used rather than the dmoz
    title.
    {{<meta name="robots" content="noodp" />}}
    

    Similarly, Yahoo! offers a “noydir” option to keep Yahoo!
    from using Yahoo! Directory titles in search results for a site’s pages:
    {{<meta name="slurp" content="noydir">}}
    

    Bing Site Preview


    Microsoft’s Bing offers a thumbnail preview of most search results,
    what Bing calls Document Preview. This isn’t new, Live
    Search
    offered a preview
    of the first six search results in some geographies. Ask.com
    also offers a similar feature called binoculars.
    Bing’s preview can be disabled by specifying nopreview as
    a meta robots value for a page. Microsoft also notes
    support for x-robots-tag: nopreview in http headers, the
    first time I’ve noted Microsoft mentioning support for the x-robots-tag.
    Microsoft previously supported different methods to disable the
    thumbnail previews. They were the searchpreview robot in the
    robots.txt file,
    {{User-agent: searchpreview
    
    Disallow: /}}

    or by using a meta tag containing “noimageindex,nomediaindex”:
    {{<meta name="robots" content="noimageindex,nomediaindex" />}}
    

    This meta tag was used by AltaVista at one point; it is not known to
    be used by any of the other major search engines.

    Expires After with unavailable_after


    One problem with search engines is the delay which occurs from when
    content is removed from a website and when that content actually
    disappears from search engine results. Typical time dependent content
    includes event information and marketing campaigns.
    Pages removed from a website which still appear in search engine
    results generally result in a frustrating user experience – the Internet
    user clicks through to the website only to find themselves landing on a
    “Page not found” error page.
    In July 2007, Google introduced the “unavailable_after” tag which
    allows a website to specify in advance when a page should be removed
    from search engine results, i.e. when it will expire. This tag can be
    specified as a html meta tag attribute value:
    {{<meta name="robots" content="unavailable_after: 21-Jul-2037 14:30:00 CET" />}}
    

    or in an X-robots http header:
    {{X-Robots-Tag: unavailable_after: 7 Jul 2037 16:30:00 GMT}}
    

    Google says the date format should be one of those specified by the
    ambiguous and obsolete RFC 850. We hope Google
    clarifies what date formats their parser can read by referering to a
    current date standard, such as IETF
    Internet standard RFC 3339
    . We’d also like to see detailed page
    crawl information in Google’s Webmaster Tools. Not only could Google
    show when a page was last crawled, they could add expiration
    information, confirming proper use of the unavailable_after
    tag. At one point, Google did show an approximation of the number of
    pages crawled relative to the number specified in a sitemap, but that
    feature was removed. This is one case where Google should follow Yahoo’s
    example.

    Pro

  • A nice way to ensure search engine results are syncronized with current website content.

    Con

  • Old date specification RFC 850 is too ambiguous, thus subject to error.
  • unavailable_after support is currently limited to Google. We do hope the other major search engines embrace this approach as well.
    Added 2007-07-27.

    Crawl Delay


    While not directly related to content, I was asked about regulating
    crawling speed in a SEO class, so here’s the formal answer. Both Yahoo!
    and Microsoft’s Bing support teh robot exclusion protocol value crawl-delay.
    Yahoo cites
    a delay value in the form x.x where 5 or 10 is “high”. While Yahoo
    doesn’t specify the delay units, Microsoft uses seconds.
    {{
    
    User-agent: Slurp
    Crawl-delay: 0.5

    User-agent: msnbot
    Crawl-delay: 4
    }}{{}}

    Google does not support Crawl-delay nor will bots which
    are imposters. For Google, there is a setting which can be changed in
    Google’s Webmaster Tools for a site. Now that you know you can set a
    crawl delay, you probably shouldn’t. Search engine crawlers need to
    access your site’s contents to find any changes – new pages, deleted
    pages, changed pages. It is in your interest that they do this
    frequently. Except in rare occurrences, the major search engines won’t
    be hammering your site. Imposters, maybe, but they won’t look at the
    robots.txt content.

    Meta Tag Summary


    The following table summarizes the page level meta tags which can be
    used to specify how a search engine crawls a page. Positive tags, such
    as follow, are not listed as they are the default. Tags
    are case insensitive and can usually be combined
    .

Tag Description Search Engine Support
noindex Don’t index a page (implies noarchive and nocache) Google, Yahoo!, Bing, Ask
nofollow Don’t follow, i.e. crawl, the links on the page Google, Yahoo!, Bing, Ask
noarchive Don’t present a cached copy of the indexed page Google, Yahoo!, Bing, Ask
nocache Same as noarchive Bing
nosnippet Don’t display an abstract for this page. May also imply noarchive. Google
noodp Don’t use an Open Directory title for this page Google, Yahoo!, Bing
nopreview Don’t display site preview in search results Bing
noimageindex, nomediaindex Don’t crawl images / objects specified in this page Windows Live: uses this to disable a page preview thumbnail
unavailable_after: RFC 850 formats> Don’t offer in search results after this date and time. In reality, Google says:
This information is treated as a removal request: it will take about a day after the removal date passes for the page to disappear from the search results. We currently only support unavailable_after for Google web search results.
Google
notranslate Don’t allow Google to automatically translate a page. This one was introduced, apparently without thinking too much. The syntax takes the noun “google” instead of “robots”: (2008-10-14) Google

Posts

http://www.antezeta.com/blog/avoid-search-engine-indexing

http://www.antezeta.com/blog/x-robots-tag

http://www.antezeta.com/blog/robots-nocontent

http://www.antezeta.com/blog/sitemap-standard

http://www.antezeta.com/blog/bing-seo-recommendations

http://www.antezeta.com/blog/flash-problems

MUST Creative Engineering Laboratory

ImageImage Image Image

Image Image Image Image Image Image Image

Copyright © 2010 MUST Corp. All rights reserved. must@must.or.kr
This Program is released under the GNU General Public License v2. View the GNU General Public License v2 or visit the GNU website.