arguing a point: Anything online is searchable

message from darrel on 23 Jul 2004
I need to present my argument to some managers that anything you post online
is indexable/searchable.

What they want is a way to post content publicly on the site, but then not
have things like google be able to index the content so it can be searched
by others.

My argument as to why that's a silly request:

1) Anything posted online, that's not behind a security wall (pwd/login) can
be seen by pretty much anyone else.

2) While google will obey robots.txt files, plenty of less polite search
bots/screen scrapers will gladly ignore it.

3) Alternative file formats (PDFs, Images) are also indexable. Things like
archive.org can grab the image and google can parse PDFs.

Any other for/against arguments?

-Darrel
 
Mad Dog replied to darrel on 24 Jul 2004
Google searches them. Haven't you ever done a search and have it turn up PDF
files? Happens all the time to me.

MD

E. T. Culling wrote:
 
darrel replied to darrel on 23 Jul 2004
Along these lines, I recently remember seeing an article/study (from MIT?)
showing how it's fairly easy to circumvent those 'proove you are human'
log-ins (those screens where you have to enter in a distorted/blurred series
of characters to 'prove' you are human).

Does anyone remember seeing that link?

-Darrel
 
Mad Dog replied to darrel on 23 Jul 2004
I'm curious how they'd do that, since those are graphics. Unless it has more
darrel wrote:
 
darrel replied to Mad Dog on 23 Jul 2004
I'll do some digging. It was pretty much just basic OCR with some fancier
algorhythms to account for the distortions. I think the test got the program
up to the 80% accuracy point.

-Darrel
 
darrel replied to Mad Dog on 23 Jul 2004
http://www.cs.berkeley.edu/%7Emori/gimpy/gimpy.html

-Darrel
 
Joe Makowiec replied to darrel on 24 Jul 2004
Without doing more than skim the article - does the name gimpy have
anything to do with TheGIMP?

Sure - MIT, Berkeley... I can see how you got them confused...

<gdr />
 
Dan Vendel *GOF* replied to darrel on 24 Jul 2004
Not only PDF files by Adobe software, but any PDF.
 
Murray *TMM* replied to darrel on 23 Jul 2004
If I upload a file into my domain, and if that file is not linked to any
other file, or if no other file links to it, how can the search engines find
it?
 
darrel replied to Murray *TMM* on 23 Jul 2004
Well, that's not really publicly available, is it? ;o)

Now, you could send the link to someone, and only they could see it, but
they might then post the link on their blog, and then someone else links to
it, etc. I'm trying to argue that there is simply no way to guarantee that
any file you upload to be made available to the public will not also
eventually be indexed by some search tool/archiver/screen scraper out there
somewhere and that while you can lessen the liklihood of that (like saving
all your documents as big JPG images with watermarks) you're still not
guaranteeing anything, and most likely making the content less accesssible
in the interim.

-Darrel
 
Murray *TMM* replied to darrel on 23 Jul 2004
I see what you mean.
 
Thierry Koblentz replied to Murray *TMM* on 23 Jul 2004
BTW, other files could link to it through JavaScript. I don't think SE bots
parse the location.href value.

Thierry

"Murray *TMM*" <forums@HAHAgreat-web-sights.com> wrote in message
news:cdrvrj$e3o$1@forums.macromedia.com...
 
darrel replied to Thierry Koblentz on 23 Jul 2004
True, but then that becomes less accessible, and a competant programmer
could probably make a bot that WOULD read the javascript links.

I'm trying to convince them that they should either make them publicly
available online and leave it at that, or just not post them at all. Trying
to do something 'in between' is somewhat futile and going to cause more
problems/waste more time than it's worth.

-Darrel
 
Mad Dog replied to darrel on 23 Jul 2004
First, a big point is that even were something online not to be searchable,
that doesn't mean it's not POTENTIALLY searchable. The rules change, spiders
change, someone could link to one of your pages and a bot follows that link.
Anything "could" be searchable, especially when you consider that virtually
every thing is being archived and has for years.

Now......what if those things were in password protected areas?

MD

darrel wrote:
 
darrel replied to Mad Dog on 23 Jul 2004
Exactly. Well put. That's worded much better than what I was going to use.
;o)

Right...that'd be the only way to guarantee that the files couldn't be
indexable outside of human intervention.

-Darrel
 
Thierry Koblentz replied to darrel on 23 Jul 2004
Hi Darrel,
I was just commenting Murray's post.
I'd agree with you, if it is "online", sooner or later it becomes *public*.

Thierry

"darrel" <notreal@hotmail.com> wrote in message
news:cds12c$fcd$1@forums.macromedia.com...
 
Joe Makowiec replied to Murray *TMM* on 23 Jul 2004
The only thing I can think of is if it's an easily guessed address (say,
mysite.invalid/contact.html). I don't think 'good' bots (Google,
almaden, etc) would find it, but others (address harvester bots) might
think to look in such places. But I don't recall any of my test pages,
which I leave for long periods of time in a /test/ directory, getting
touched by a search bot unless they're linked from somewhere.
 

Archived message: arguing a point: Anything online is searchable (Macromedia Dreamweaver)