Monday, February 11, 2008

Speeding Up Web Page Loading

Speeding Up Web Page Loading - Part I (1)


As more and more businesses go online, just having a web presence is no longer enough to succeed. It takes a reliable, high-performance Web site that loads quickly too. After all, nothing makes an Internet user leave a site quicker than having to wait ages for a web page to load.

A previous post briefly identified the factors that determine how fast (or slow) your web pages load, namely:

* Size (of your web page)

* Connectivity (quality of your host's network connections and bandwidth)

* Number (of sites sharing your server).

This article will now discuss ways that webmasters can ensure their sites' pages load quickly and efficiently, by focusing on the first factor.

File size - the total of the file sizes of all the parts of your web page (graphics, music file, html, etc.) should be small enough to download quickly. A reasonably fast loading page is sized at around 50 - 70Kb, with up to 120Kb for more graphics intensive pages. You can optimize your file size by:

1. Reducing page weight:
* Eliminate unnecessary whitespace (use tools like HTML Tidy to automatically strip leading whitespace and extra blank lines from valid HTML source) and comments
* Cut down on extras (buttons, graphics) and don't put a lot of graphics and big midi files on the same page
* Move webrings from your homepage to their own page
* Reduce the file size of some of your graphics (use GifBot, an on-line gif reducer at Net Mechanic)
* Redesign pages so it works over 2 different pages instead of just one
2. Reducing the number of inline scripts or /Moving them into external files - inline scripts slow down page loading since the parser must assume that an inline script can modify the page structure. You can:
* Reduce the use of document.write to output content
* Use modern W3C DOM methods to manipulate page content for modern browsers rather than older approaches based on document.write
* Use modern CSS and valid markup - CSS reduces the amount of markup as well as the need for images in terms of layout. It can also replace images which are actually only images of text. Valid markups stop browsers from having to perform "error correction" when parsing the HTML and allows free use of other tools which can pre-process your web pages.
* Minimize CSS/script files for performance while keeping unrelated CSS/scripts in separate files for maintenance
* Use External HTML Loading - involves using an IFrame for Internet Explorer and Netscape 6, and then shifting that content via innerHTML over to a
tag. Benefits: keeps initial load times down to a minimum and provides a way to easily manage your content. Downside: we have to load content along with all the interface elements, which can severely impair the user experience of the page. A tutorial on externally loading HTML can be found here.
3. Minimizing the number of files referenced in a web page to lower the number of HTTP connections required to download a page
4. Reducing domain lookups (since each separate domain costs time in a DNS lookup) - be careful to use only the minimum number of different domains in your pages as is possible
5. Chunking your content - the size of the full page is less important if the user can quickly start acting on some information. How?
* Replace table-based layout with divs
* Break tables into smaller ones that can be displayed without having to download the entire page's content
o Avoid nesting tables
o Keep tables short
o Avoid using tables to lay out elements on the page
o Exploit several coding techniques:
+ split the page layout into multiple independent tables to preserve the browsers' ability to render each of them step-by-step (use either vertically stacked or horizontally stacked tables)
+ use floating tables or regular HTML codes that flow around the floating objects
+ use the fixed table-layout CSS attribute
* Order page components optimally - successive transmission of the DHTML code enables the browser to render the page during loading
o download page content first (so users get the quickest apparent response for page loading) along with any CSS or script required for its display;
o disable any DHTML features that require the page to complete loading before being used initially and only enable it after the page loads;
o allow the DHTML scripts to be loaded after the page contents to improve the page load's overall appearance
6. Specifying image and table sizes - browsers are able to display web pages without having to reflow content if they can immediately determine the height and/or width of your images and tables
7. Using software and image compression technology
* Use tools that can "compress" JavaScript by reformatting the source or obfuscating the source and reducing long indentifiers to shorter versions
* Use mod_gzip, a compression module using the standard zlib compression library, to compress output - compressing the data being sent out from the Web server, and having the browser decompress this data on the fly reduces the amount of data sent and increases the page display speed; HTTP compression results in 150-160% performance gain (sizes of web pages can be reduced by as much as 90%, and images, up to 50%)
8. Caching previously received data/reused content - make sure that any content that can be cached is cached with appropriate expiration times since caching engines reduce page loading time and script execution by performing optimizations and various types of caching; cuts down latency by as much as 20-fold, by preventing dynamic pages from doing any repetitive work, and reducing the turnaround time for each request
9. Choosing your user agent requirements wisely - specify reasonable user agent requirements for projects; basic minimum requirements should be based upon modern browsers which support the relevant standards
The next post will focus on the other two factors, as well as other ways that webmasters can speed up their web page loading.

Speeding Up Web Page Loading - Part II (2)


In Part I, we detailed how webmasters can speed up the loading of their web pages by optimizing their file sizes. Here, some additional tips to make pages load faster will be discussed.

Another factor to consider is the speed at which the pages are served. What happens is that servers get bogged down if too many web surfers ask for the same page at the same time, resulting in a slowdown in loading speed.

Although there is no way to predict exactly how many people will visit a site at once, it is always a good idea to choose web hosting companies that tune its servers to make sure that enough computing power is given to the sites that get the most hits.

You can opt for hosts, like LyphaT Networks, that use caching and/or compression software to maximize the performance of their servers and minimize page loading times.

Another consideration is your host's connectivity or speed of Internet connection and bandwidth. Bandwidth refers to the amount of data that can be transmitted in a fixed amount of time and this actually fluctuates while you are surfing. Different users also have different access to the Internet (some might use dial-up or a dedicated T-1) so it is up to you to keep your file sizes down so that no matter who is viewing your site, they get as quick a download as possible.

Some ways you can do this is by:

* Testing your page loading time with low bandwidth emulation - you can use the mod_bandwidth module for this if you're running an Apache Web server. This module enables you to set connection bandwidth limits to better emulate typical modem speeds.
* Pinging your site (reply time should be 100 ms) and then running tracert - each hop/transient point should be less than 100ms, and if it takes longer or times out, then it could be slow at that point.

You can check your results against the table shown at the Living Internet site on the number of seconds it takes to download data of various sizes at varying speeds of Internet connections.
* Using the HTML Toolbox program at Net Mechanic, or the Web Page Analyzer - 0.82, a free web-based analyzer that calculates page size, composition and page download time.

Tracking Web Site Traffic


When you establish an online presence, you're basically after one thing, to get your message across to Internet users. You don't set up a website just so people can ignore it, do you?

Whether or not you are running mission critical ecommerce sites or online marketing campaigns, as a webmaster, you're naturally curious about your site's visitors.

But first, it is important to distinguish what kind of visitors go to your site. According to Yari McGauley, in his article Web Tracking & Visitor Stats Articles, websites get two kinds: normal visitors (people) and robots (or any kind of automatic 'web crawling' or 'spidering' programs), ranging from Search Engines, to Link and Website Availability Checkers to Spam/Email Harvesters.

So how can you find out more information about your visitors? There are a number of ways.

1. install a counter at your site - a counter simply provides an indication of the number of visitors to a particular page; usually counts hits (a hit is a single request from a browser to a server), which is not a reliable indicator of website traffic since many hits are generated by a single page visit (both for the request itself, and for each component of the page)
2. use logfiles - if your server is enabled to do it (check with your web host) then every action on the server is logged in logfiles (which are basically text files describing actions on the site); in their raw form, logfiles can be unmanageable and even misleading because of their huge size and the fact that they record every 'hit' or individual download; you need to analyze the data

There are 2 ways this can be done:
* Download the logfiles via FTP and then use a logfile analyzer to crunch the logfiles and produce nice easy to read charts and tables
* Use software that runs on the server that lets you look at the logfile data in real-time

Some log file analysers are available free from the Web (ex. Analog), though commercial analyzers tend to offer more features, and are more user-friendly, in terms of presentation (ex. Wusage, WebTrends, Sane Solutions' NetTracker, WebTracker)
3. use a tracker - generally, each tracker will require you to insert a small block of HTML or JavaScript into each page to be tracked; gives some indication of how visitors navigate through your site: how many visitors you had (per page); when they visited; where they came from; what search engine queries they used to find your site; what factors led them to your site (links, ads etc).

Tracking tools also:
* provide activity statistics - which pages are the most popular and which the most neglected
* aggregates visitor traffic data into meaningful reports to help make website management decisions on a daily basis (ex. content updates)
4. third party analysis - services exist which offer to analyze your traffic in real time for a monthly fee; this is done by:
* placing a small section of code on any page you want to track
* information generated whenever the page is viewed is stored by the third party server
* server makes the information available in real time for viewing in charts and tables form

*OpenTracker is a live tracking system that falls somewhere between 3 and 4. You might notice, however, that tracking services will report lower traffic numbers than log files. Why? Because good tracking services use browser cookies as basis, and so, do not recognize the following factors as unique visits or human events:

* repeat unique visitors (after 24 hours)
* hits
* robot and spider traffic
* rotating IP numbers (i.e. AOL)

It also distinguishes how many unique visitors are from either: the same ISP, or corporate firewalls, large organizations. Otherwise all these users will be counted as the same visitor. Log analyzers, on the other hand, record all measurable activity and do not distinguish between human and server activities.

So why are web traffic statistics important? Because they help you fine-tune your web marketing strategy by telling you:

* Which pages are most popular, which are least used
* Who is visiting your site
* Which browsers to optimize your pages for
* Which banner ads are bringing the most visitors
* Where errors or bad links may be occurring in your pages
* Which search engines are sending you traffic
* Which keywords are used to find your site
* Which factors affect your search engine rankings and results
* Where your traffic is coming from: search engines or other web sites
* Whether your efforts to generate new customers and sales leads (such as newsletter signups and free product trials) working or not
* Which are your most common entry pages and exit pages

Broken Link Checkers


One of the basic things that webmasters need to master is the use of links. It's what makes the Internet go round, so to speak. Links are simple enough to learn and code. But sometimes, we make mistakes and end up with broken links (particularly if we're coding manually) or even dead ones (if we don't update content that often).

To an Internet user, there's nothing more frustrating than clicking on links that give nothing but error messages (alongside those pop-up ads, of course), and as a result, they may leave your site. That's not so bad if it's just a hobby site, but what if you're running e-commerce sites? Or if you're trying to get your website registered with search engines?

I know manually checking for broken/dead links can be time consuming, not to mention migraine-inducing. So what's your recourse? Automated link checkers of course! There are a number of them available online.

Here are some (latest versions), available either for free or under GPL, for your consideration:

* LinkChecker v1.12.2 - a Python script for checking your HTML documents for broken links
* Checkbot v1.75 - written in Perl; a tool to verify links on a set of HTML pages; creates a report summarizing all links that caused some kind of warning or error
* Checklinks 1.0.1 - written in Perl; checks the validity of all HTML links on a Web site; start at one or more "seed" HTML files, and recursively test all URLs found at that site; doesn't follow URLs at other sites, but checks their existence; supports SSI (SHTML files), the latest Web standards, directory aliases, and other server options
* Dead Link Check v0.4.0 - simple HTTP link checker written in Perl; can process a link cache file to hasten multiple requests (links life is time stamp enforced); initially created as an extension to Public Bookmark Generator, but can be used by itself as is
* gURLChecker v0.6.7 - written in C; a graphical web links checker for GNU/Linux and other POSIX OS; under GPL license
* JCheckLinks v0.4b - a JavaT application which validates hyperlinks in web sites; should run on any Java 1.1.7 virtual machine; licensing terms are LGPL with the main app class being GPL
* Linklint v2.3.5 - an Open Source Perl program that checks links on web sites; licensed under the Gnu General Public License
* LinkStatus v0.1.1 - written in C++; an Open Source tool for checking links in a web page; discontinued and forked and to KLinkStatus, a more powerful application for KDE (which makes it hard for Windows and Mac users to build); KLinkStatus v0.1-b1 at KDE-Apps.org
* Xenu's Link SleuthT v1.2e - a free tool that checks Web sites for broken links; displays a continuously updated list of URLs sortable by categories; Platform(s): Win 95/98/ME/NT/2000/XP
* Echelon Link Checker - a free CGI & Perl script from Echelon Design; you simply edit a few variables at the top of the script, set a url to the page you want, and it'll go to that page, get all the links, and check each link to see if its "dead" or not; allows you to set what word or words define a dead page, such as 404 or 500; Platforms: All
* Link Checker (CMD or Web v1.4) - CMD version can check approximately 170 links in about 40 seconds; CGI version takes about a minute and 10 seconds; very accurate; scans for dead links (not just 404 errors but any error that prevents the page from loading Platform(s): ALL
* phplinkchecker - a modified freeware version of the old PHP Kung Foo Link Checker; reports the status (200, 404, 401, etc.) of a link and breaks the report down showing useful stats; used for finding broken links, or working links on any page; can be easily modified for any specific use Platform(s): Unix, Windows

You can also have your URL's links checked (for free) at the following sites:

* 2bone's LinkChecker 1.2 - allows site owners to quickly and easily check the links on their pages; allows users to add their link to 2bone's links section; added (as of Jan 2004) an option to see all results returned on a single page or use the quicker 10 links per results page
* Search Engine Optimising - via its Website Broken Links Checker Platform(s): All
* Dead-Links.com - via its Free Online Broken Link Checker from Dead-Links.com; spider-based technology and super fast online analysis

With all these resources available at no cost to you, there's really no reason why you should still have those broken and dead links around.

Caching Web Site for Speed


MarketingTerms.com defines caching as the 'storage of Web files for later re-use at a point more quickly accessed by the end user,' the main objective of which is to make efficient use of resources and speed the delivery of content to the end user.

How does it work?

Well, Guy Provost offers a more detailed explanation of How Caching Works, but simply put, a web cache, situated between the origin Web servers and the client(s), works by saving for itself a copy of each HTML page, image and file (collectively known as objects), as they are requested, and uses this copy to fulfill subsequent requests for the same object(s), instead of asking the origin server for it again.

Advantages:

* if planned well, caches can help your Web site load faster and be more responsive by reducing latency - since responses for cached requests are available immediately, and closer to the client being served, there is less time for the client to get the object and display it, which will result in
* users visiting more often (since they appreciate a fast-loading site)
* can save load on your server - since there are fewer requests for a server to handle, it is taxed less and so reduces the cost and complexity of that datacenter (which is why web-hosting companies with large networks and multiple datacenters offer caching servers at various datacenters in their network; caching servers automatically update themselves when files are updated, which takes the load off the central server or cluster of servers)
* reduces traffic/bandwidth consumption - since each object is only gotten from the server once, there are fewer requests and responses that need to go over the network
* you don't have to pay for them

There are some concerns with its use, however:

* webmasters in particular fear losing control of their site, because a cache can 'hide' their users from them, making it difficult to see who's using the site
* could result in undercounts of page views and ad impressions (though this can be avoided by implementing various cache-busting techniques to better ensure that all performance statistics are accurately measured)
* danger of serving content that is out of date, or stale

There are two kinds:

* Browser Caches
o client applications built in to most web browsers
o let you set aside a section of your computer's hard disk to store objects that you've seen, just for you, and will check to make sure that the objects are fresh, usually once a session
o settings can be found in the preferences dialog of any modern browser (like Internet Explorer or Netscape)
o useful when a client hits the 'back' button to go to a page they've already seen & if you use the same navigation images throughout your site, they'll be served from the browser cache almost instantaneously
* Proxy Caches
o serve many users (clients) with cached objects from many servers
o good at reducing latency and traffic (because popular objects are requested only once, and served to a large number of clients)
o usually deployed by large companies or ISPs (often on their firewalls) that want to reduce the amount of Internet bandwidth that they use
o can happen at many places, including proxies (i.e. the user's ISP) and the user's local machine but often located near network gateways to reduce the bandwidth required over expensive dedicated internet connections
o many proxy caches are part of cache hierarchies, in which a cache can inquire from neighboring caches for a requested document to reduce the need to fetch the object directly
o although some proxy caches can be placed directly in front of a
particular server (to reduce the number of requests that the server
must handle), they are called differently (reverse cache, inverse
cache, or httpd accelerator) to reflect the fact that it caches
objects for many clients but from (usually) only one server

Hacking Attacks - Prevention


The first three steps are suggested by security consultant Jay Beale in his interview with Grant Gross, when asked how administrators can protect themselves from system attacks.

1. Harden your systems (also called "lock-down" or "security tightening") by

* Configuring necessary software for better security
* Deactivating unnecessary software - disable any daemons that aren't needed or seldom used, as they're the most vulnerable to attacks
* Configuring the base operating system for increased security

2. Patch all your systems - Intruders can gain root access through the vulnerabilities (or "holes") in your programs so keep track of "patches" and/or new versions of all the programs that you use (once the security hole is found, manufacturers usually offer patches and fixes quickly before anyone can take advantage of the holes to any large extent), and avoiding using new applications or those with previously documented vulnerabilities.

3. Install a firewall on the system, or at least on the network - Firewalls refer to either software (ex. ZoneAlarm) and/or hardware (ex. Symantec-Axent's Firewall/VPN 100 Appliance) that block network traffic coming to and leaving a system, and give permission to transmit and receive only to user-authorized software. They work at the packet level and can not only detect scan attempts but also block them.

You don't even need to spend a lot of money on this. Steve Schlesinger expounds on the merits of using open source software for a firewall in his article, Open Source Security: Better Protection at a Lower Cost.

At the very least, you should have a packet-filtering firewall as it is the quickest way to enforce security at the border to the Internet.

EPLS offers the following suggestions/services for Stopping Unauthorized Access, using firewalls:

* Tighten the Routers at your border to the Internet in terms of packets that can be admitted or let out.
* Deploy Strong Packet Filtering Firewalls in your network (either by bridge- or routing mode)
* Setup Proxy Servers for services you allow through your packet-filtering firewalls (can be client- or server-side/reverse proxy servers)
* Develop Special Custom Made Server or Internet services client and server software

4. Assess your network security and degree of exposure to the Internet. You can do this by following the suggestions made by EPLS.

* portscan your own network from outside to see the exposed services (TCP/IP service that shouldn't be exposed, such as FTP)
* run a vulnerability scanner against your servers (commercial and free scanners are available)
* monitor your network traffic (external and internal to your border firewalls)
* refer to your system log - it will reveal (unauthorized) services run on the system and hacking attempts based on format string overflow usually leave traces here
* check your firewall logs - border firewalls log all packets dropped or rejected and persistent attempts should be visible

Portmapper, NetBIOS port 137-139 and other dangerous services exposed to the Internet, should trigger some actions if you check all the above.

Also, more complex security checks will show whether your system is exposed through uncontrolled Internet Control Message Protocol (ICMP) packets or if it can be controlled as part of DDoS slaves through ICMP.

5. When using passwords don't use

* real words or combinations thereof
* numbers of significance (eg birthdates)
* similar/same password for all your accounts

6. Use encrypted connections - encryption between client and server requires that both ends support the encryption method

* don't use Telnet, POP, or FTP programs unless strongly encrypted passwords are passed over the Internet; encrypt remote shell sessions (like Telnet) if switching to other userIDs/root ID
* use SSH (instead of Telnet or FTP)
* never send sensitive information over email

7. Do not install software from little known sites - as these programs can hide "trojans"; if you have to download a program, use a checksum, typically PGP or MD5 encoded, to verify its authenticity prior to installation

8. Limit access to your server(s) - limit other users to certain areas of the filesystem or what applications they can run

9. Stop using systems that have already been compromised by hackers - reformat the hard disk(s) and re-install the operating system

10. Use Anti-Virus Software (ex. Norton Anti-Virus or McAffee) and keep your virus definitions up-to-date. Also, scan your system regularly for viruses.

Some of the ways by which Web hosting providers' Security Officers Face Challenges, are discussed by Esther M. Bauer. These include:

* looking at new products/hacks
* regularly reviewing policies/procedures
* constant monitoring of well known ports, like port 80, that are opened in firewalls
* timely installation of patches
* customized setup of servers that isolate customers from each other - "In a hosting environment the biggest threat comes from inside - the customers themselves try to break into the system or into other customers' files"
* investment in firewall, VPN devices, and other security measures, including encrypted Secure Sockets Layer (SSL) communication in the server management and account management systems
* installation of secure certificates on web sites
* purchase and deployment of products according to identified needs
* monitoring suspicious traffic patterns and based on the customer's service plan, either shunting away such traffic as bad, or handling it through a content-distribution system that spreads across the network

Subconscious Mind!

What if I told you that there was a part of your mind that is always working, even when you are asleep? This part of your mind is known as...