Skip to content | Skip to institutional links

Common menu bar links

Government of Canada Web Archive

FAQ

What is the Government of Canada Web Archive (GCWA)?
How do I find the GCWA by navigating on Library and Archives Canada's website?
What in Library and Archives Canada's mandate allows it to copy and archive public websites, and then make them publicly accessible?
How does the acquisition and archiving of websites support other parts of its legislation, namely legal deposit, and/or the disposition of government records?
Is the GCWA freely accessible to the public via the Internet?
Are there other organizations also harvesting and archiving websites?
What is Canada's role in the IIPC?
Are there any plans to extend this form of archiving to other areas of the Web?
Are the contents of the GCWA googled?
Are Intranets and Extranets included in the GCWA crawls?
How do I know if I am viewing an archived website or the live website?
How do I report a website which is missing from the Government of Canada Web Archive?
Switching between English and French pages on an archived site does not always work properly. Sometimes I get the equivalent page in the other language, but sometimes a completely different page is presented. Why is that?
Are departments, etc informed that they have been "crawled"?
Can I view archived websites in the Government of Canada Web Archive if I don't have JavaScript enabled?
What can government departments do to support this program?
Can I link my department's site to your site, in order to refer people who are interested in older versions of the website back to your site?

Missing Content, Error Messages, and Crawling

Why are some websites missing?
Why is my organization's name not listed on the Department List?
Are websites crawled completely?
Part of our website is there, but there are some gaps. Why is that?
We currently have a section of the site which provides downloadable documents for a charge. Because you need a password, etc. to obtain these materials, are there issues with respect to LAC harvesting and making them available to the public for free?
Would it be possible to obtain information on the spider you are using to crawl the site so that we can exclude it from our statistical analysis and reporting tools?
Why aren't the images showing?
Why aren't some audio and video features working?
Your site states that a crawl was launched in December 2005. I looked in the archive for a document which I am certain existed at the original website in December 2005 but the document is not there. Why?
I received the message "Sorry, no documents with the given URL were found in this archive." Why does that happen?
I received an error message but the green banner at the top said "Archive time: 2006-01-16 16:59:55". Is the page in the archive or not?
I was looking at the different versions captured for a number of web pages. Some have many versions while others have only one or two. Why is that?
Do FLASH sites work?
Does the crawler capture dynamic site content?
Do forms work?

Removing Content from the Archive

I checked our website in the Government of Canada Web Archive and found that some links present me with error pages. There doesn't seem to be any content. Can these links be removed?
The information in the archived version of our website is outdated. Can the content be removed or changed?

What is the Government of Canada Web Archive (GCWA)?

The Government of Canada started using the Web to convey information to Canadians in the mid 90's. Much information published that way has either been lost or changed, or been collected in part in scattered sources.

The GCWA contains snapshots of Canadian government websites. LAC started the first round of snapshots in late Dec. 2005/early Winter 2006.

Resources (human and technical) permitted, LAC intends to crawl GoC websites twice a year. Each subsequent crawl will be added to the GCWA.

Although each snapshot generally contains all the content in a website, it is possible that any material of short lived duration (i.e. Less than 6 months) added to and removed from a website between crawls is regrettably not covered by the crawl.

How do I find the GCWA by navigating on Library and Archives Canada's website?

On Library and Archives Canada's website, the Government of Canada Web Archive is located in the "Politics and Government" collection.

Directions:

1. Start on the main LAC webpage (http://www.collectionscanada.gc.ca/index-e.html). In the left hand column, scroll down to "What we have". Click on "On our website".

2.That opens an "On our website" page (http://www.collectionscanada.gc.ca/website/index-e.html) which gives you several Browse options

A) If you click on "Browse by topic" you will get to the header "Politics and Government". When you click on "Politics and Government", an alphabetical list of web sites will be displayed, one of which is the "Government of Canada Web Archive"

B) If you click on "Browse alphabetically" and choose G you will get to the header "Government of Canada Web Archive". A click on P for "Politics and Government" is not an option here. Of course, it is always possible to bookmark the GCWA directly:

http://www.collectionscanada.gc.ca/webarchives/index-e.html

What in Library and Archives Canada's mandate allows it to copy and archive public websites, and then make them publicly accessible?

Section 8(2) of the Library and Archives of Canada Act gives LAC the mandate to sample publicly accessible websites that it deems of interest to Canadians.

(1) The Librarian and Archivist may do anything that is conducive to the attainment of the objects of the Library and Archives of Canada, including (a) acquire publications and records or obtain the care, custody or control of them;

(2) In exercising the powers referred to in paragraph (1)(a) and for the purpose of preservation, the Librarian and Archivist may take, at the times and in the manner that he or she considers appropriate, a representative sample of the documentary material of interest to Canada that is accessible to the public without restriction through the Internet or any similar medium.

In addition, the Preamble to the LAC Act clarifies what lays behind LAC's desire to collect, preserve and make accessible archived web information in particular, that of the GoC:

Preamble

WHEREAS it is necessary that
(a) the documentary heritage of Canada be preserved for the benefit of present and future generations;

(b) Canada be served by an institution that is a source of enduring knowledge accessible to all, contributing to the cultural, social and economic advancement of Canada as a free and democratic society;

(c) that institution facilitate in Canada cooperation among the communities involved in the acquisition, preservation and diffusion of knowledge; and

(d) that institution serve as the continuing memory of the government of Canada and its institutions;

How does the acquisition and archiving of websites support other parts of its legislation, namely legal deposit, and/or the disposition of government records?

Section 10 of the Library and Archives of Canada Act gives Library and Archives Canada (LAC) the mandate to require publishers (including the Government of Canada) to deposit all forms of publications found on the Internet, usually embedded in websites themselves, with LAC. LAC currently keeps these in a publicly accessible database separate from the one where websites are kept.

In the case of publishers in Canadian federal departments, agencies, commissions and the like, the onus to deposit publications with LAC falls on each department, etc. Because the harvesting tool used for the ingest of public websites is so powerful in its ability to acquire GoC publications embedded in these publicly accessible websites, LAC is investigating whether this tool can stand in for the more time and resource consuming deposit required of each department.

E-records that fall under Section 12-13 of the LAC Act are transferred and/or disposed of in accordance with certain processes established by LAC. A distinction however is made between e-records and e-publications. The latter which are covered by the harvesting and/or legal deposit provisions of the Act, are considered to be public because they occur on a public source; e-records are not found on the Internet. For further information, consult the Digital Collection Development Policy http://www.collectionscanada.gc.ca/collection/003-200-e.html

Is the GCWA freely accessible to the public via the Internet?

As of Nov. 20, 2007 the Internet public has access to this archive. The contents of the collected websites have been indexed and can be searched by departmental name index, by an URL index, and by full text searching.

Are there other organizations also harvesting and archiving websites?

The Internet Archive in the USA has been archiving websites from around the world for a number of years. Government of Canada websites have been included in their publicly accessible web archive since its inception.

Around the world similar archiving is also being undertaken. As an active member on the International Internet Preservation Consortium, the LAC works with other national institutions such as the Bibliothèque nationale du France, Library of Congress, the British Library and many others. All are engaged in the same process recognizing the importance of acquiring, archiving, and preserving information taken from the Web for the benefit of their citizens.

What is Canada's role in the IIPC?

Library and Archives Canada is a founding member of the International Internet Preservation Consortium (http://www.netpreserve.org). The goal of this organization is to collect, preserve and ensure long-term access to Internet content from around the world through the collaborative development of common tools and techniques for developing web archives. Library and Archives Canada has implemented this first significant Canadian web archive through the use of these open source tools. See also Technical details on the GCWA site.

Are there any plans to extend this form of archiving to other areas of the Web?

Library and Archives Canada has been doing selective web archiving for over a decade. Selected websites and web publications have been acquired on a one-by-one basis and have been made accessible through LAC's online catalogue AMICUS and archived in the LAC Electronic Collection. Examples of these sites include the Federal elections 2006, Canada's Digital Collections Schoolnet sites and the Olympics.

The GCWA is the first example of developing an archive of a complete web domain (ie. The Government of Canada), using powerful Open Source harvesting, indexing and viewing software and an application that provides access to the archived websites on LAC's site.

Moving forward with this initiative requires that LAC resolve several issues (eg. Public access permissions; the method of display; technical improvements in the crawl and archiving processes, storage improvements).

Are the contents of the GCWA googled?

No. Google only indexes the main page of the GCWA. It does not google archived websites contained in the GCWA. LAC has blocked Google and similar Internet crawlers from crawling the contents.

Are Intranets and Extranets included in the GCWA crawls?

GoC Intranets and Extranets are not included in the crawls undertaken to add web content to the GCWA. The emphasis in this archive is on gathering information that has been made publicly accessible on the Web.

However, LAC will consider requests from departments to include information kept on their Intranet or Extranet site provided a) the department gives full and clear permission to make that content public for each successive crawl; b) the department understands it is information for public consumption, and c) a crawl of that portion of the department.s domain is technically feasible.

How do I know if I am viewing an archived website or the live website?

A bright green banner is displayed at the top of each archived web page. The date on which the archived website was harvested appears in this banner.

Due to certain inconsistencies, the "live site" rather than the archived one may occasionally be displayed. If the green banner does not appear at the top of a website display, then the site which you are viewing is not an archived website in the GCWA, but is, in fact, a "live" website.

It is important to note, however, that this green banner is only displayed on the archived web pages themselves and is not displayed on archived documents such as .pdf, WORD, etc., documents. You can always confirm that the information which you are viewing is part of the GCWA by looking at the URL address of the displayed website/document. If it begins with www.collectionscanada.gc.ca/webarchives, then you can be sure that it is an archived document.

How do I report a website which is missing from the Government of Canada Web Archive?

Please advise Library and Archives Canada of any websites have not been included in the Government of Canada Web Archive by sending an e-mail to web-archives-web@lac-bac.gc.ca which indicates both the name of the department (or agency) and all associated URLs.

Switching between English and French pages on an archived site does not always work properly. Sometimes I get the equivalent page in the other language, but sometimes a completely different page is presented. Why is that?

In most cases, the content of archived sites is available in both official languages, but navigating between English and French does not always work. We recommend that people perform a new search when looking for information in the other language.

Are departments, etc informed that they have been "crawled"?

Each time Library and Archives Canada harvests a website, the crawler leaves a "calling card" behind, informing the server/system administrator that the site was crawled and inviting the administrator who reviews the access log to visit the following pages:

English: http://www.collectionscanada.gc.ca/011/001/index-e.html
French: http://www.collectionscanada.gc.ca/011/001/index-f.html

If the administrator wishes to have further information, they can contact the web archiving team at web-archives-web@lac-bac.gc.ca.

Can I view archived websites in the Government of Canada Web Archive if I don't have JavaScript enabled?

JavaScript is required to enable the display of the archived web sites. As a result, it is possible that the archived sites will not be viewable when utilizing assistive technologies to access the GCWA.

There will be changes in this particular piece of functionality soon, as the IIPC has developed a W3C compliant Open Source viewer.

What can government departments do to support this program?

Departments should extensively review the GOC Web Archive, Department List and URL List at least once a year to ensure that we have captured all applicable URLs and that their site in the archive is complete and functioning properly. Forward any discrepancies to web-archives-web@lac-bac.gc.ca.

Let LAC crawlers into your department's website to crawl the contents.

Can I link my department's site to your site, in order to refer people who are interested in older versions of the website back to your site?

All departments should feel free to link their current website back to the Government of Canada Web Archive. That way people coming to the department's current website may be referred where appropriate (for example, to find older information) to Library and Archive Canada's web archive where they may continue their search.

The normal Internet protocol simply requires a department to notify LAC at web-archives-web@lac-bac.gc.ca of their intention to do so to allow LAC to keep track of these links. LAC strongly recommends that departments link only to the main page http://www.collectionscanada.gc.ca/webarchives/index-e.html of the Government of Canada Web Archive rather than particular documents or information sources embedded within any website in the Archive.

LAC does not guarantee the persistence of any URLs linking to pages and documents embedded within the archived websites. Such links from the department's current website may become broken over time.

Another consideration that departments need to be aware of is that the application to view the content of the GCWA requires the use of Javascript. As a result, it is possible that the archived sites will not be viewable when utilizing assistive technologies to access the GCWA. There will be changes in this particular piece of functionality soon, as the International Internet Preservation Consortium (IIPC) has agreed to develop a W3C compliant Open Source viewer.

Each department may word explanatory text to accompany the link to the GCWA at http://www.collectionscanada.gc.ca/webarchives/index-e.html in whatever fashion they wish. However, some examples are provided below:

Example 1

Library and Archives Canada archives older versions of this website in its Government of Canada Web Archive (http://www.collectionscanada.gc.ca/webarchives/index-e.html)

Bibliothèque et Archives Canada conserve dans ses Archives du Web du gouvernement du Canada (http://www.collectionscanada.gc.ca/archivesweb/index-f.html) les versions antérieures de ce site Web

Example 2

Versions of this website for previous years starting with 2006 are available in the Government of Canada Web Archive (http://www.collectionscanada.gc.ca/webarchives/index-e.html) at Library and Archives Canada

Des versions antérieures de ce site Web (2006- ) sont disponibles dans les Archives du Web du gouvernement du Canada (http://www.collectionscanada.gc.ca/archivesweb/index-f.html) de Bibliothèque et Archives Canada

An example of text referring people to the Government of Canada Web Archive can be found at Infrastructure Canada's website:

http://www.infrastructure.gc.ca/links-liens/index_e.shtml
http://www.infrastructure.gc.ca/links-liens/index_f.shtml

Missing Content, Error Messages, and Crawling

Why are some websites missing?

Each crawl has a "seed list" -- a list of URLs that the crawler will harvest. The seed lists are compiled from a variety of sources. The Government of Canada seed list was compiled primarily from the Government URL Registry and notifications from individual government organizations.

It is important to note that the Department List is not an exhaustive list of all the organizations included in the Archive. In order to ascertain if a certain organization's website is in the archive, it is necessary to perform a keyword or URL search. If the organization is not found in the search, and it is a federal government organization, please advise us by email web-archives-web@lac-bac.gc.ca and we will ascertain whether it should be included in a subsequent crawl.

Why is my organization's name not listed on the Department List?

The names on the Department List include departments, agencies, etc for which LAC was able to locate an Internet address from various key sources. Archived material from smaller organizations should be found by performing a keyword search with the organization's name. If the URL for your organization is not found anywhere on the Department List or URL List, please advise us by email to web-archives-web@lac-bac.gc.ca.

Are websites crawled completely?

We attempt to crawl in their entirety all areas of a website that are publicly accessible at the time of the crawl. All content captured is archived in the Government of Canada Web Archive.

LAC crawlers are programmed not to include any content from a department's Intranet or Extranet. In addition the crawler is stopped from proceeding further at registration points, for instance, at any point where the website requires a user to identify with a password or login. Due to technical limitations, the crawler also stops at databases that normally require the user to enter search terms, and at online webforms.

Part of our website is there, but there are some gaps. Why is that?

LAC compiles the "seed list" (the URLs) for a harvest from a variety of sources. We strive to have a complete list and we rely upon organizations to review the archive and inform us of any omissions.

Departments should extensively review the GOC Web Archive, Department List and URL List to ensure that we have captured all applicable URLs and that their site(s) in the archive is complete and functioning properly. Forward any discrepancies to web-archives-web@lac-bac.gc.ca. Please review the FAQs section before sending your email as the issues you are reporting may have already been identified.

We currently have a section of the site which provides downloadable documents for a charge. Because you need a password, etc. to obtain these materials, are there issues with respect to LAC harvesting and making them available to the public for free?

The crawler used by Library and Archives Canada only follows links on websites, it cannot interact with a website that requires interaction (ie. Passwords) and therefore we would not have been able to harvest the documents.

Would it be possible to obtain information on the spider you are using to crawl the site so that we can exclude it from our statistical analysis and reporting tools?

In order to remove our crawler from your statistical analysis and reporting tools, you will need to exclude any http access entries with the word "heritrix" (the name of the crawler used by LAC).

Why aren't the images showing?

Menu images using "mouseover" scripting may not be captured because of crawler limitations. However, the links will usually work.

Why aren't some audio and video features working?

Audio and video features may not work because they are located on another server that was not included on the "seed" list of the original crawl. We do not take data from servers that are not registered to the Government of Canada.

Your site states that a crawl was launched in December 2005. I looked in the archive for a document which I am certain existed at the original website in December 2005 but the document is not there. Why?

The archive contains snapshots of individual web pages at various points in time. The time it takes from the launch of a crawl to its completion could be up to 3 months and during that span of time 1 or more snapshots would have been captured. The page to which you refer may have certainly existed during the time frame of the crawl. However, the page was probably taken down prior to the crawler visiting the site.

The specific date that a web-page was captured is indicated by the "Archived" date on the green banner at the top of the archived web page being viewed.

You should also examine other versions of a relevant web page to determine if the document in question was available at another point in time. To access these other versions, click on "View other versions of this page" on the green banner

The other versions will be listed by date captured.

Searching tip: If you have a bookmark for the original document, you can enter it into the search box, e.g. http://www.canada.gc.ca/document.pdf

I received the message "Sorry, no documents with the given URL were found in this archive." Why does that happen?

Attempting to view pages that the crawler has not captured will result in the message : "Sorry, no documents with the given URL were found in this archive."

There are various reasons why the crawler has not captured a given page

Collection policy:

The page is on a server which is not part of the seed list for the crawl. [See "Why is my organization's name not listed on the Department List"]
the page was part of an Intranet or Extranet which the crawler is programmed to exclude

Technical barriers:

The crawler is not able to capture content that is:

behind a login page
inside a database which requires text be keyed into a search box or only accessible through a webform
accessed via drop-down boxes
blocked from crawler access by a robots.txt file

I received an error message but the green banner at the top said "Archive time: 2006-01-16 16:59:55". Is the page in the archive or not?

The error message was generated by the original website.

The crawler captures the website in its condition at the time of the crawl. If a server at the website was down or a link was faulty, the crawler will capture the error message that the website presented. In such cases, the green banner will appear at the top of the screen indicating that the crawler successfully captured the webpage as it existed at that point in time, albeit, an error page.

I was looking at the different versions captured for a number of web pages. Some have many versions while others have only one or two. Why is that?

The number of versions captured is dependent on the number of times the crawler was "sent" to a specific page. Initially, the crawler is sent out once. However, if a particular page is referenced by other sites, it will have more versions. For example, if 10 organizations have a link to the Prime Minister's home page, each link will result in a new capture of the home page and, thus, a new version.

Do FLASH sites work?

FLASH content was captured and is available for viewing. However, if the FLASH file provided the navigation for the Web site, and the FLASH file is the only way to navigate internal to the site, the site will not be accessible. The FLASH file information that sends the user into the site cannot be corrected to redirect to the archived site. The user will either be directed to the live site or to a "404 Not Found" if that site is no longer available.

Does the crawler capture dynamic site content?

Information on a site that has dynamic content may be available if an explicit link to that information was crawled. If information was available in the site, but the user had to specify the parameters to view the information, such as a select box, that information will not be available in the web archive collection.

Do forms work?

Many Web sites use embedded programs to perform form functions such as sending form information or conducting a search query within a site with dynamic content. Currently in the Web Archive, there have been no changes to Web site coding. If a form or search query is used, it is possible that the program will be activated by a current/existing site and the user will be directed to the actual site in response to the query.

Removing Content from the Archive

I checked our website in the Government of Canada Web Archive and found that some links present me with error pages. There doesn't seem to be any content. Can these links be removed?

There are various technical reasons for error pages or missing content [Please see the section "Missing content and error messages"]. We are currently developing policies regarding "editing" the archive. Therefore at the present time, we are not removing any content or links.

The information in the archived version of our website is outdated. Can the content be removed or changed?

The purpose of the archive is to preserve a precise representation of a website at a certain point in time for historical purposes. As such, we do not remove any content from it.

Please note that we make every effort to alert users to the fact that information may be outdated. A bright green banner appears at the top of the screen and states the date that the page was archived.

Newer versions of your website will be captured in future crawls and added to the archive. However, keep in mind that the version in our archive will always be dated as it takes approximately 6 months from the time we launch a crawl to the time the content is available in the archive.

Search tip: to verify if the page you are viewing is the most current in the archive, click "View other versions of this page" found in the green banner.