eValid -- Orphan File Identification

eValid™ -- Automated Web Quality Solution
Browser-Based, Client-Side, Functional Testing & Validation,
Load & Performance Tuning, Page Timing, Website Analysis,
and Rich Internet Application Monitoring.

eValid -- Orphan File Identification
eValid Home

Summary
eValid can be used to assist in identifying orphan files on a server. This page explains how this is done and makes recommendations

Background
An orphan file is one that exists on a website's server, but which cannot be accessed by a user from a browser. An orphan file -- being by definition unavailable to a user via a browser request -- cannot introduce any low-quality perception problems on the part of a user.

Orphan files arise through normal website maintenance, through oversights or errors by the webmaster, or from a variety of other causes. In normal operation, orphan files are not a problem, except that: (i) too many orphan files make website maintenance confusing; and, (ii) may increase the chance of errors.

eValid Solution
When eValid makes a complete site analysis of a website in Browser Mode it will visit every page and image that can be reached. On the server side, the server operating system will record the file as having been accessed or visited.

By having high confidence that the files in your website actually were visited, you can have confidence that deleting orphan files won't cause any problems. However, is is good practice to run the complete eValid site analysis over again after candidate orphan files are removed -- just to confirm the analysis.

Usage Recommendations
To assure you visit all accessible files it is best to make the eValid site analysis run using: (1) Full Browser Mode; (2a) without use of the cache if you have manually deleted the cache; or (2b) delete cache on start of run; (3) using minimal or no Excluded Files so that the search is as thorough as possible. These may require more time but the result is more accurate.

UNIX-Based Servers
The "last accessed (used)" attribute for files in any particular folder can be seen with the ls command using the u option. You might try the commands ls -lust or ls -lrust (reverse order).

Consult the UNIX documentation for your machine with man ls for complete details on this command.

Locate all files not accessed with a specified time (the smallest interval is one day) with the command: find . -name "*" -atime +1 -print. The -atime +1 clause in this command causes the find command to report those files it finds which were last accessed more than +1 day ago. If the eValid search was completed less than one day ago such files are candidate orphan files.

Consult the UNIX documentation for your machine with man find for complete details on this command.

Windows-Based Servers
Windows servers also record the time at which any file was last accessed. Use the Windows File Explorer command to display files. Move to the folder at which you suspect there are orphan files. Right-click on the menu bar to show the display options. Click "Accessed" ON. The display will now show each file in that folder in the order in which it was accessed. If the eValid search was completed less than one day ago files which were not accessed within that time are candidate orphan files.

eValid Site Analysis Searching Limitation
The eValid SiteMap engine examines the URL string to determine if it is "searchable". The following are not searchable, but are included in eValid mappings:

Protocols: JAVASCRIPT, MAILTO, NEWS
Suffixes: .gz, .tgz, .tar, .jar, .zip, .css, .xml, .pdf, .doc, .ppt, .gif, .png, .jpg, .jpeg

If an actual URL link exists within one of these types of files it is not visited by eValid because eValid does not scan these non-searchable files for possible links.

The section below is provides additional details
about orphan file identification procedures and related risks.

Warnings And Cautions
A file on a website server is only truly an orphan file if its removal from the server file system will not cause any failure evident to the user who views or uses the site through a browser.

There are certain technical problems with all non-browser approaches to identifying orphan files:

WebSite Access Requirement. Some methods for identifying orphan files [but not those used in eValid] operate by comparing file names available on server folders with the possible membership within a static page delivered to the client browser.
To accomplish this comparison normally requires either direct access to the server files or ftp access to the server files. [The dir command in ftp protocol sessions returns the complete folder filename contents.] For security reasons it may be unwise to provide either type of access.
Static Sites. For 100% static sites this method (finding a filename string match within delivered HTML) can be reliable if used with extreme care. However, some files on a static site, e.g. name.cgi would not necessarily appear in any other file but could be essential to website operation. Caution is advised.
Dynamic Sites. Delivered HTML pages produced in a typical dynamically generated website do not necessarily relate to names of files or URLs that are mentioned in delivered pages.
Delivered pages may mention full URLs and it the dynamic server page generation method's responsibility to complete that URL as a complete HTML document from whatever pieces are required.
Consequently, file name matching as a method for identifying orphan files will generally identify many files as orphans that are essential to website operation, a significant error.

The conclusion [and caution] to be understood here is that server-side orphan file identification and removal needs to be coupled with systematic recheck of potentially affected page generation to be a completely reliable process. If removal of the file breaks the site [causes a page to download incorrectly] then the file is not an orphan.

eValid -- Orphan File Identification eValid Home

eValid -- Orphan File Identification
eValid Home