Understanding robots.txt: What to Allow and Disallow

The robots.txt file sits at the root of your domain and tells search engine crawlers which parts of your site they can and cannot access. It is one of the first files crawlers request, and mistakes here can either expose private pages or accidentally block your entire site from being indexed.

How robots.txt Works

When a search engine bot visits your site, it first checks yourdomain.com/robots.txt. The file contains rules (directives) that specify which URL paths are off-limits. A basic robots.txt looks like this:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

The User-agent: * line means the rules apply to all crawlers. You can also write rules for specific bots like Googlebot or Bingbot.

What to Disallow

Block pages that waste crawl budget or should not appear in search results:

Admin and login pages — /admin/, /wp-admin/, /login/
Internal search results — /search? pages create infinite crawl traps
Duplicate content paths — Print versions, filtered views, sorted versions of the same content
Staging and development paths — /staging/, /dev/
Shopping cart and checkout — /cart/, /checkout/
User-generated content areas — If they produce low-quality pages at scale

What to Always Allow

Never block these resources — doing so prevents Google from properly rendering and evaluating your pages:

CSS files — Google needs to render your page as users see it
JavaScript files — Required for rendering dynamic content
Images — Blocked images mean no image search traffic and poor rendering
Your main content pages — This sounds obvious, but overly broad disallow rules accidentally block important sections

Tip: robots.txt is not a security mechanism. It is a polite request that well-behaved crawlers honor, but anyone can view your robots.txt file and access the paths you have disallowed. Use authentication for truly private content.

Common Mistakes to Avoid

Disallow: / without a specific user-agent — This blocks your entire site from all crawlers
Blocking CSS/JS — Legacy advice from the early 2000s that now actively hurts SEO
Forgetting the sitemap directive — Always include your sitemap URL for faster discovery
Using noindex in robots.txt — Google no longer supports the noindex directive in robots.txt. Use meta robots tags instead.
Conflicting rules — More specific rules override general ones, but the logic can be confusing. Test thoroughly.

Generate a properly formatted robots.txt file with the Robots.txt Generator. It provides a template with common disallow patterns and reminds you to include your sitemap reference.

Speaking of sitemaps, create your XML sitemap alongside your robots.txt using the XML Sitemap Generator. The two files work together — robots.txt tells crawlers what to skip, and the sitemap tells them what to prioritize.

After deploying changes, verify that your important pages are still accessible by checking for redirect issues with the Redirect Code Generator — a redirect to a disallowed URL is a common and hard-to-diagnose problem.

Review your robots.txt after every major site change. A five-minute check can prevent months of invisible indexing problems.

Understanding robots.txt: What to Allow and Disallow