|
Summary
eValid can be used
to assist in identifying orphan files on a server.
This page explains how this is done and makes recommendations
Background
An orphan file is one that exists on a website's server,
but which cannot be accessed by a user from a browser.
An orphan file
-- being by definition unavailable to a user via a browser request
-- cannot introduce any low-quality perception problems on the part of a user.
Orphan files arise through normal website maintenance, through oversights or errors by the webmaster, or from a variety of other causes. In normal operation, orphan files are not a problem, except that: (i) too many orphan files make website maintenance confusing; and, (ii) may increase the chance of errors.
eValid Solution
When eValid makes a complete site analysis of a website
in Browser Mode it will visit every page and image that can be reached.
On the server side,
the server operating system will record the file as having been accessed or visited.
By having high confidence that the files in your website actually were visited, you can have confidence that deleting orphan files won't cause any problems. However, is is good practice to run the complete eValid site analysis over again after candidate orphan files are removed -- just to confirm the analysis.
Usage Recommendations
To assure you visit all accessible files it is best to make the eValid site
analysis run using: (1) Full Browser Mode; (2a) without use of the cache if you have
manually deleted the cache; or (2b) delete cache on start of run; (3) using minimal
or no Excluded Files so that the search is as thorough as possible.
These may require more time but the result is more accurate.
UNIX-Based Servers
The "last accessed (used)" attribute for files in any particular folder can be
seen with the ls command using the u option.
You might try the commands ls -lust or ls -lrust (reverse order).
Consult the UNIX documentation for your machine with man ls for complete details on this command.
Locate all files not accessed with a specified time (the smallest interval is one day) with the command: find . -name "*" -atime +1 -print. The -atime +1 clause in this command causes the find command to report those files it finds which were last accessed more than +1 day ago. If the eValid search was completed less than one day ago such files are candidate orphan files.
Consult the UNIX documentation for your machine with man find for complete details on this command.
Windows-Based Servers
Windows servers also record the time at which any file was last accessed.
Use the Windows File Explorer command to display files.
Move to the folder at which you suspect there are orphan files.
Right-click on the menu bar to show the display options.
Click "Accessed" ON.
The display will now show each file in that folder in the order in which it was accessed.
If the eValid search was completed less than one day ago files which were not accessed within that time
are candidate orphan files.
eValid Site Analysis Searching Limitation
The eValid SiteMap engine examines the URL string to determine if it is "searchable".
The following are not searchable, but are included in eValid mappings:
Protocols: JAVASCRIPT, MAILTO, NEWSSuffixes: .gz, .tgz, .tar, .jar, .zip, .css, .xml, .pdf, .doc, .ppt, .gif, .png, .jpg, .jpeg
If an actual URL link exists within one of these types of files it is not visited by eValid because eValid does not scan these non-searchable files for possible links.
The section below is provides additional details
about orphan file identification procedures and related risks. |
Warnings And Cautions
A file on a website server is only truly an orphan file if
its removal from the server file system
will not cause any failure
evident to the user who views or uses the site through a browser.
There are certain technical problems with all non-browser approaches to identifying orphan files:
To accomplish this comparison normally requires either direct access to the server files or ftp access to the server files. [The dir command in ftp protocol sessions returns the complete folder filename contents.] For security reasons it may be unwise to provide either type of access.
Delivered pages may mention full URLs and it the dynamic server page generation method's responsibility to complete that URL as a complete HTML document from whatever pieces are required.
Consequently, file name matching as a method for identifying orphan files will generally identify many files as orphans that are essential to website operation, a significant error.
The conclusion [and caution] to be understood here is that server-side orphan file identification and removal needs to be coupled with systematic recheck of potentially affected page generation to be a completely reliable process. If removal of the file breaks the site [causes a page to download incorrectly] then the file is not an orphan.