Web Archiving
The long-term preservation and access to the most fragile documents (archived Czech websites) is the second important priority of the NDK project. These documents are published solely in the digital form which makes them very dependable on the technology; they are prone to degradation and technical obsolescence. In the scope of the project we will harvest approximately 4 billions of files.
The amounts of data
The harvested data will be stored in the Arc or WARC format, in the files of about 100 MB. The project expects running two complete harvests of the Czech web per year. The project will produce 173TB = 1730000 file = 1572 files a day (5 years, working 220 in days) = 0.15 TB per day. Ingest of the data into the long-term preservation repository will not take place every day but in batches. The actual approach to the long-term preservation of this type of data has yet to be decided, so far the data from web archiving are only backed up, but no active preservation is in place.
More about the technology used for the web archiving and legal conditions of the access to the archived content can be found on the portal of the project: http://webarchiv.cz/
