Log in

Accessify Forum - Accessibility Discussion

Latest Tweets

W3C Releases Unicorn, an All-in-One Validator http://ow.ly/18jtbB #accessibility #a11y #axs - Gary

3 days ago, RT: @mpaciello RT @w3c

@msmousette You’re welcome, Liz! – @dotjay

22/07/2010

@Elin012 Sorry for delay. The study has now ended. They were after native English-speaking, 18+, not visually or cognitively disabled.

22/07/2010

From @msmousette: “Many thanks to everyone who helped [with the web study] - they had a great response.” –@dotjay

22/07/2010

Native-English speakers: Able to help with a 15 min. accessibility web study? http://www.accessifyfo...@dotjay

21/07/2010

Read more...

MultiViews and Canonicalising URLs

  • Reply to topic
  • Post new topic

Home / Web Technology / MultiViews and Canonicalising URLs

Reply with quote I recently enabled MultiViews so I could remove the .html extension from the pages of Project Cerbera. Now I have two URLs which are equally valid for each PHP-augmented HTML file. For example, the Blog Archive:
  • http://projectcerbera.com/blog/archive.html
  • http://projectcerbera.com/blog/archive
Any ideas about how I can make the extensionless version the canonical one?

I tried this:
Code:
# Strip ".html" from end of any incoming request:
RedirectMatch permanent ^/(.*)\.html$ http://projectcerbera.com/$1
But this makes the server try to find an extensionless file which doesn't exist. It seems that MultiViews doesn't run on the result of a RedirectMatch?

Another way I found was Dave Shea's Rewrite woes...but I can't figure out what actually worked for him.
_________________
My CV type thing and my Life of Ben (Blog). Nigel Peck's Accessify Forum Requirements.


Last edited by Ben Millard on 14 Feb 2007 02:06 am; edited 1 time in total
Reply with quote unless i'm mistaken, i don't think there's any way around your dilemma. if you attempted a rewrite rule to redirect non-extensionless names to their extensionless equivalents, i think it would send the server in a loop (too many redirect warning in the browser etc)...
_________________
Patrick H. Lauke / webmaster / University of Salford
co-lead: WaSP Accesibility Task Force
take it to the streets ... WaSP Street Team
personal: splintered | photographia | redux
co-author: Web Accessibility - Web Standards and Regulatory Compliance
Reply with quote Beware of confusing GoogleBot, if you enable MultiViews. Shouldn't be a problem with .html files.
_________________
Jim O'Donnell
work: Royal Observatory Greenwich
play: eatyourgreens
Reply with quote I think this should do it:
Code:
RewriteEngine On
RewriteRule ^(.*)+\.html$ $1 [R=301,L]

<IfModule mod_headers.c>
 ErrorHeader unset Content-Location
</IfModule>
The last few lines get rid of the Content-Location header, which doesn't really work on the Web anyway, and which spoils your secret about the file extension... and Opera versions prior to 9 have a bug that will cause a redirect to the Content-Location URL whenever you follow an in-page link. (Just Header unset Content-Location doesn't work for some reason.)
_________________
Simon Pieters
Reply with quote EYG: The PHP I include() at the start of each page sends a header() for Content-Type.

Zcorpan: When I use that, all requests end up at http://projectcerbera.com/home/cerbera/public_html/. I tried changing it to this:
Code:
RewriteRule ^(.*)+\.html$ http://projectcerbera.com$1 [R=301,L]

<IfModule mod_headers.c>
 ErrorHeader unset Content-Location
</IfModule>
I tried some slight variants. They all produced the infinite redirect Redux warned about. Same thing happens whether Option MultiViews is on or off (plus or minus).

(EDIT) Slightly depressing piece of news is that Googlebot and some other spiders have visited while my URLs aren't canonicalised.

Turning off MultiViews and using a set of RewriteConds looks like a possibility:
  • Extensioned URLs would be redirected to extensionless URLs.
  • Extensionless URLs would be silently rewritten back to extensioned URLs.
Hopefully the silently rewritten URLs wouldn't end up being rewritten again, endlessly looping.
_________________
My CV type thing and my Life of Ben (Blog). Nigel Peck's Accessify Forum Requirements.
Reply with quote Well, after another day of playing with my test server and crawling around the Intertubes, I'm still nowhere with this. Each route I try, I end up with infinite redirects Redux predicted. Surely there's a way around this, just I'm too dumb to figure it out?

Doing a 301 Moved Permanently from /foo/bar.html to /foo/bar is easy:
Code:
# Redirect ".html" to extensionless URLs:
RewriteRule ^(.*)\.html$ /$1 [R=301,NC,L]
This triggers a new request which means the .htaccess is run again.

This new request for /foo/bar needs the extension added back on so the file /foo/bar.html is retrieved. I'm using this:
Code:
# Rewrite extensionless URLs to ".html" page:
RewriteCond %{REQUEST_URI} !^(.*)/$
RewriteCond %{REQUEST_URI} ^(.*)$
RewriteRule !(\.)(.*)$ %1.html [L,NC]
But this causes the .htaccess file to be checked again so the previous redirect happens and it loops between them. (Note that %1 uses the RewriteCond backreference after checking this isn't a directory. It's very hacky.)

Each of these samples works on their own. But I need both to work together for the URLs to be both technology-neutral and canonical. Is there a way to check if the request came from a redirect? Would moving the redirect into the PHP allow me to break the loop by detecting some sort of condition?
_________________
My CV type thing and my Life of Ben (Blog). Nigel Peck's Accessify Forum Requirements.
Reply with quote I had a problem with a single URL which I presume was caused by the same issues and couldn't get anything to work for a while. This following did what I wanted though:
Code:
RewriteEngine on
RewriteCond %{THE_REQUEST} "^GET /foo/bar.html"
RewriteRule (.*) http://www.example.org/foo/bar [R=301]

It avoids an infiniate loop since "THE_REQUEST" is what the UA asks for, not what the server translates it to on the second pass. I imagine you can do some regexp matching to make it work generally on multiple files.
Reply with quote Ok, this works for me, and will also get rid of www.
Code:
RewriteEngine On

# get rid of www.
RewriteCond %{HTTP_HOST} ^www\.example\.org$ [NC]
RewriteRule ^(.*)$ http://example.org/$1 [R=301,L]

# get rid of .html
RewriteCond %{THE_REQUEST} ^(GET|POST)\ (.+)\.html(&[^\ ]+)*\ HTTP/
RewriteRule ^(.+)\.html$ /$1 [R=301,L]

# get rid of Content-Location
<IfModule mod_headers.c>
 ErrorHeader unset Content-Location
</IfModule>


If you want it to work with query strings then use this instead:
Code:
RewriteEngine On

# get rid of www.
RewriteCond %{HTTP_HOST} ^www\.example\.org$ [NC]
RewriteRule ^(.*)$ http://example.org/$1 [R=301,L]

# get rid of .html
RewriteCond %{THE_REQUEST} ^(GET|POST)\ (.+)\.html(\?.*)?(&[^\ ]+)*\ HTTP/
RewriteRule ^(.+)\.html(\?.*)?$ /$1$2 [R=301,L]

# get rid of Content-Location
<IfModule mod_headers.c>
 ErrorHeader unset Content-Location
</IfModule>

_________________
Simon Pieters
Reply with quote Aha, that's the sort of thing I've been looking for, Chaos. Zcorpan, I dropped that in and it works perfectly on my test server (Apache 2.0.59).

However, I get infinite redirect loops when I use it on Project Cerbera (Apache 1.3.37 (seriously, I'm not making a "1337" joke)). Specifically, requesting a URL with a .html extension matching a file which exists.

Apparently, the API Phases in 1.3 are different to API Phases in 2.0. So my best guess is we're doing something in a way which is incompatible with Apache 1.3's API phases?

I tried removing everything from my .htaccess apart from the essentials (like enabling MultiViews). No errors. But then adding either of zcorpan's samples, Project Cerbera loops. So it's not something else in there causing the problem.

Thanks to everyone so far. It's all helpful and we're getting closer each time!
_________________
My CV type thing and my Life of Ben (Blog). Nigel Peck's Accessify Forum Requirements.


Last edited by Ben Millard on 16 Feb 2007 09:49 am; edited 1 time in total
Reply with quote Hmm.... try this:
Code:
RewriteCond %{REQUEST_URI} ^(.+)\.html(\?.*)?$
RewriteRule ^(.+)\.html(\?.*)?$ /$1$2 [R=301,L]


...or you need to specify "http://example.org/$1$2" instead of just "/$1$2"?
_________________
Simon Pieters
Reply with quote Yay, this works!
Code:
RewriteCond %{REQUEST_URI} ^(.+)\.html(\?.*)?$
RewriteRule ^(.+)\.html(\?.*)?$ http://projectcerbera.com/$1$2 [R=301,L]
Example URLs:
I removed all the .html's from my redirects to make old URLs work.

However, there's a slightly undesirable side-effect. When visiting a folder, like /tutorials/gta1/info/, you get redirected to /tutorials/gta1/info/index. I've gotten around this by adding a preceeding RewriteCond:
Code:
# get rid of .html
RewriteCond %{REQUEST_URI} !^(.+)index\.html(\?.*)?$
RewriteCond %{REQUEST_URI} ^(.+)\.html(\?.*)?$
RewriteRule ^(.+)\.html(\?.*)?$ http://projectcerbera.com/$1$2 [R=301,L]
This now seems to be working perfectly. Cool Could it be written more concisely?
_________________
My CV type thing and my Life of Ben (Blog). Nigel Peck's Accessify Forum Requirements.
Reply with quote Well, you could make "./index" redirect to "./", so that also gets canonicalized. How to do so I'll leave as an exercise to the reader. Wink


...just kidding. Smile

The whole shebang should look like this:
Code:
RewriteEngine On

# get rid of www.
RewriteCond %{HTTP_HOST} ^www\.projectcerbera\.com$ [NC]
RewriteRule ^(.*)$ http://projectcerbera.com/$1 [R=301,L]

# get rid of .html
RewriteCond %{REQUEST_URI} \.html
RewriteRule ^(.+)\.html(\?.*)?$ http://projectcerbera.com/$1$2 [R=301,L]

# redirect index to ./
RewriteCond %{THE_REQUEST} index
RewriteRule ^(.*)index\.html(\?.*)?$ http://projectcerbera.com/$1$2 [R=301,L]

# get rid of Content-Location
<IfModule mod_headers.c>
 ErrorHeader unset Content-Location
</IfModule>

_________________
Simon Pieters
Reply with quote When I drop that in, I get infinite redirect loops with index pages. Making both RewriteConds use %{THE_REQUEST} fixed it:
Code:
# Get rid of 'www.':
RewriteCond %{HTTP_HOST} ^www\.projectcerbera\.com$ [NC]
RewriteRule ^(.*)$ http://projectcerbera.com/$1 [R=301,L]

# Get rid of '.html':
RewriteCond %{THE_REQUEST} \.html
RewriteRule ^(.+)\.html(\?.*)?$ http://projectcerbera.com/$1$2 [R=301,L]

# Get rid of 'index' in directories:
RewriteCond %{THE_REQUEST} index
RewriteRule ^(.*)index\.html(\?.*)?$ http://projectcerbera.com/$1$2 [R=301,L]

# get rid of Content-Location
<IfModule mod_headers.c>
 ErrorHeader unset Content-Location
</IfModule>

Homepage:
  • / - stays as this. Pass.
  • /index - redirects to /. Pass.
  • /index.html - redirects to /. Pass.
A deeper index page:
Some non-index URLs:
Wewt and pwnage! So once again, we've proved Simon Pieters > Internet. Very Happy
_________________
My CV type thing and my Life of Ben (Blog). Nigel Peck's Accessify Forum Requirements.
Reply with quote Oops. ".html" or "index" in a query string results in a loop.

This should fix it:
Code:
# Get rid of 'www.':
RewriteCond %{HTTP_HOST} ^www\.projectcerbera\.com$ [NC]
RewriteRule ^(.*)$ http://projectcerbera.com/$1 [R=301,L]

# Get rid of '.html':
RewriteCond %{THE_REQUEST} \.html
RewriteCond %{THE_REQUEST} !\?.*\.html
RewriteRule ^([^\?]+)\.html(\?.*)?$ http://projectcerbera.com/$1$2 [R=301,L]

# Get rid of 'index' in directories:
RewriteCond %{THE_REQUEST} index
RewriteCond %{THE_REQUEST} !\?.*\index
RewriteRule ^([^\?]*)index\.html(\?.*)?$ http://projectcerbera.com/$1$2 [R=301,L]

# get rid of Content-Location
<IfModule mod_headers.c>
 ErrorHeader unset Content-Location
</IfModule>

_________________
Simon Pieters

  • Reply to topic
  • Post new topic

Display posts from previous:   

All times are GMT

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum