If you want to find *sensitive documents using Google search (*documents with impacting information which someone does not want revealed, more or less), I’ve found that in addition to targeting queries to search for specific domains and file types, an alternative and potent approach is to restrict your results to files residing on an ftp server.
The rationale is that while many allow anonymous log-in and even more are indexed by Google, FTP servers are used more for uploading and downloading, storing files than viewing pages, and typically house more office-type documents (as well as software). As limiting your searches to ftp servers also significantly restricts the overall number of results to be returned, choice keywords combined with a query that tells Google to bring back files that have “ftp://” but NOT “http://” or “https://” in the url yield a high density of relevant results. This search type is easily executed:
A caveat one encounters before long using this method is that eventually Google will present you with a “captcha.” Many, many websites use captchas and pretty much everyone who uses the internet has encountered one. The basic idea behind a captcha is to prevent people from using programs to send automated requests to a webserver, they are a main tool in fighting spam by thwarting bots that mine the internet for email addresses and other data, and which register for online accounts and other services en masse. The captcha presents the user with a natural language problem which they must provide an answer to.
Google is also continuously updating its code to make it difficult to exploit Google “dorks,” queries using advanced operators similar to one used above (but usually more technical and specific). Dorks are mostly geared toward penetration testers looking for web application and other vulnerabilities, but the cracker’s tools can easily be adapted for open source research.
Unless you are in fact a machine (sometimes you’re a machine, in which case there are solutions), this should be easily solved; however lately, instead of returning me to my search after answering the captcha, Google has been sending me back to the first search page of my query (forcing me to somewhat start the browsing process again and to encounter another captcha). I’m calling it a Google Governor, as it seems to throttle searchers’ ability to employ high-powered queries.
The good news is that the workaround is really just smart searching. One thing you’ll notice upon browsing your results is that dozens of files from the same, irrelevant site will be presented. Eliminate these by adding -inurl:”websitenameistupid.com” (which tells Google NOT exactly “websitenameistupid.com” in the url). Further restrict your results by omitting sites in foreign domains (especially useful with acronym-based keyword searches): -site:cz -site:nk.
When you find an ftp site which looks interesting, copy and past the url into a client like Filezilla for easier browsing.
To give you an idea of the sensitivity of documents that can be found: One folder was titled “[Name] PW and Signature,” which contained dozens of files with passwords as well as .crt, .pem, and .key files; another titled “admin10” contained the file “passwords.xls.” This was the site of a Department of Defense and Department of Homeland Security contractor – the document contains the log-in credentials for bank accounts, utilities, and government portals. This particular document is of more interest to the penetration tester; for our purposes it serves as a meter for the sensitivity of the gigabytes of files that accompanied it on the server. The recklessness of the uploader exposed internal details of dozens of corporations and their business with government agencies.
*As of this writing, the FTP mentioned above is no longer accessible