SEO Basics | Image Descriptions | Authority Ranking | Tweaking Robots.txt


SEO Basics: Tweak your Robots.txt & XML file(s) 

What is a robots.txt file?

On all well made sites there will be (generally) a chance for you to examine the /robots.txt file which will give you some basic ideas as to what content the webmaster wishes a search engine to focus on and which content to NOT index as well as, in many cases, the location of the XML sitemap and sometimes even messages to users. Note that robots.txt can exclude or include certain types of crawlers.

The basic command to tell a robot NOT to index a page is "Disallow" and appears as: 

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. 

You can make rules so that, per example: "User-agent: Googlebot" (Google's main crawler is called Googlebot but there are MANY different ones) is allowed to crawl given pages and have the "User-agent: * Disallow: /" site-wide for all other user agents with a single slash. In other words, you are allowing Google to crawl (optionally to not crawl) certain pages and have a block for all crawling from non Google crawlers.

You can also EXPLAIN why you are blocking a given tool:

# too many repeated hits, too quick
User-agent: Baidu
Disallow: /

Examples taken on 09.13.2018 for https://archive.org/robots.txt:

Sitemap: https://archive.org/sitemap/sitemap.xml

##############################################
#
# Welcome to the Archive!
#
##############################################
# Please crawl our files.
# We appreciate if you can crawl responsibly.
# Stay open!
##############################################

User-agent: *
Disallow: /control/
Disallow: /report/

See: Robots.txt instructions are directives only |  Test your robots.txt with the robots.txt Tester | Submit your updated robots.txt to Google


From support.google.com we read:

User agents in robots.txt

Where several user-agents are recognized in the robots.txt file, Google will follow the most specific. If you want all of Google to be able to crawl your pages, you don't need a robots.txt file at all. If you want to block or allow all of Google's crawlers from accessing some of your content, you can do this by specifying Googlebot as the user-agent. For example, if you want all your pages to appear in Google search, and if you want AdSense ads to appear on your pages, you don't need a robots.txt file. Similarly, if you want to block some pages from Google altogether, blocking the user-agent Googlebot will also block all Google's other user-agents.

But if you want more fine-grained control, you can get more specific. For example, you might want all your pages to appear in Google Search, but you don't want images in your personal directory to be crawled. In this case, use robots.txt to disallow the user-agent Googlebot-image from crawling the files in your /personal directory (while allowing Googlebot to crawl all files), like this:

User-agent: Googlebot
Disallow:

User-agent: Googlebot-Image
Disallow: /personal

To take another example, say that you want ads on all your pages, but you don't want those pages to appear in Google Search. Here, you'd block Googlebot, but allow Mediapartners-Google, like this:

User-agent: Googlebot
Disallow: /

User-agent: Mediapartners-Google
Disallow:

General robots questions

Robots.txt questions

Robots meta tag questions

X-Robots-Tag HTTP header questions


TOP 

What is a sitemap?

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.


Note: XML sitemap can include JUST the page URL (in example) and options to include the last modification, page priority or future page change attribute

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>
https://www.new-web-domains.com/palo_alto_seo/search-engine-optimizations-tips/basics/seo-ranking.html
</loc>
</url>
</urlset>

Now you are ready to create your sitemap.xml file. For this you may wish to use this search for “sitemap generator tool.

A good rule to follow is to set up a Webmaster Tools AKA Google Search Console. Here you set your preferred domain (www or non-www)

Submit to Google the location of your sitemap and (the content of your robots.txt as well). And Learn how Google discovers, crawls, and serves web pages

Note: your sitemap can provide valuable metadata associated with the pages you list in that sitemap: Metadata is information about a webpage, such as when the page was last updated, how often the page is changed, and the importance of the page relative to other URLs in the site. More via: https://support.google.com/webmasters/answer/156184


Note: You can do similar steps for to be indexed well on Bing by signing up and using: www.bing.com/webmaster/home/mysites

See More: XML tag definitions | Entity escaping | Using Sitemap index files | Other Sitemap formats | Sitemap file location
Validating your Sitemap | the Sitemaps protocol | Informing search engine crawlers

Frequently asked questions (opens to external site)

How do I represent URLs in the Sitemap?

Does it matter which character encoding method I use to generate my Sitemap files?

How do I specify time?

How do I compute lastmod date?

Where do I place my Sitemap?

How big can my Sitemap be?

My site has tens of millions of URLs; can I somehow submit only those that have changed recently?

What do I do after I create my Sitemap?

Do URLs in the Sitemap need to be completely specified?

My site has both "http" and "https" versions of URLs. Do I need to list both?


Navigation: SEO Basics | Image Descriptions | Authority Ranking | Tweaking Robots.txt | Legal


About this Made in Palo Alto SEO guide:

Content was last updated September 13th, '18 by Ardan Michael Blum - CEO at A. Blum Localization Services. 

See More: Localization.Company and do join our next Palo Alto SEO Meetup via: iterate.live

A. Blum Localization Services Palo Alto is located at 345 Forest Avenue, Suite 204, 94301, California, USA.