Print

Print


On Oct 16, 2009, at 3:11 AM, taliesin the storyteller wrote:

> Adam Walker wrote:
>> --- On Thu, 10/15/09, taliesin the storyteller <[log in to unmask] 
>> > wrote:
>>> From: taliesin the storyteller <[log in to unmask]>
>> I wrote:
>>>> Right now I'm just affraid that my front page is gonna go kerfutz
>>>> on the 26th since it was generated using one of Yahoo!'s tools
>>>> and includes a bajillion gifs stored at Geocities.
>
>>> Use a tool like wget (are probably some for windows too and wget  
>>> might already be on the dreamhost-account) to download/fetch that  
>>> page+all the images. The urls can be rewritten automatically by  
>>> wget itself.
>> Whunh?  How can it download ... ?  No what?  I'm confused.  You're  
>> saying that it can go find where Geocities has stored this or that  
>> particular gif and copy each of those gifs to my folders so I then  
>> have a copy?
>
> Yes.
>
>> I'd still have to re-write all the code pointing to the
>> gifs at www.geocities.com/blahblahblah though, right?  Or are you  
>> saying it re-does all of that too?
>
> Yes.
>
>> That's just a bit scary.  Then I'd just have to re-edit the text  
>> I've already changed since the move, and re-delete Yahoo!'s  
>> advertising code.  Or have I really misunderstood you?
>
> No. :)
>
> wget is a command-line tool though, this is how I do backups of my  
> site:
>
> wget -p -m -k http://taliesin.nvg.org/
>
> -m = makes a mirror of my publicly visible website (10 megs): it  
> used 1.2 seconds
> -k = rewrites links/strips out the http::/whatever/-part and just  
> leaves relative links: this took 0.09 seconds
> -p = also fetches all images, css etc.
>
> You might use -K (big K) in addition, to keep a backup of the  
> original files in addition to the rewritten ones.
>
> BUT: it does this by following links, so anything that doesn't have  
> a link pointing to it won't be copied. Anything that's hidden behind  
> a password won't get copied etc. Wget sees the same things googlebot  
> does, but no more.


Could someone help me with wget? Several years ago I made a local  
mirror of something using wget. It consisted of two index pages and  
all the individual entries those index pages linked to. Both the index  
pages and the entry pages had images in them.

Wget seems to have correctly downloaded the images, but only the index  
pages refer to them on the local filesystem; on the entry pages they  
point back at the web site that originally hosted them. (Curiously,  
the images are still available on the web, even though the HTML files  
are not.)

I tried just now to use wget to correct the image links. I put the old  
mirror in my Apache doc dir and pointed wget at the localhost URL with  
the parameters 'wget -e robots=off -p -m -k {URL}'. It didn't fix the  
images, though -- the ones on the index pages (correctly) refer to  
local files, but the ones on all the individual entry pages still  
refer to the original web site.

What am I doing wrong? I even tried doing individual entry pages, and  
the Wikipedia main page (without using -m since I don't want it  
recursive), and it still didn't download the images or rewrite the  
links for them. I'm using wget 1.1.14 from Ubuntu 9.04.

Thanks in advance for the advice!