![]() Once the workflows have been established, this is an exception, and this is why the staff time can be neglected.Ĭonclusion: Up to one hour for a 1000-PDF-Files folder ExifTool error messages For the Digital Preservation Team, there is half an hour on top, if something has to be examined further, which was always necessary at the beginning. ![]() Once the workflow has been established, including all needed scripts, by our experience, a bulk of 1000 PDF files needs between 30 and 60 minutes, if there are not too many Errors or unknown error messages. Of course, the manual examination (done by the acquisition team) which is necessary for some Errors (see next chapter) and the documentation need some time as well, especially when the workflow has only recently established and when the acquisition team runs into a new error message which has not been examined and documented yet. Therefore, it takes more time-consuming to change the settings of the script to the folder which has to be examined.ĭepending on the size of the PDF files, ExifTool needs between one minute (if the PDF files have an average size of 1 MB) and ten minutes (if the PDF files have an average size of 32 MB). The third case is within the scope of this blogpost, because (as it is the case with most tools) not every error message has an impact on the ability of the PDF file to be archived. Tool throws an Error for the examined PDF files.The decryption will be the topic of another blogpost. So far we have experienced good cooperation from the publishers in this case. PDF is encrypted: In this case, we contact the data provider and ask for a replacement with a decrypted PDF or for permission to crack the password protection.PDF has no error message: no action necessary.It’s much pickier and gives lots of false alarms and error messages we can ignore for now and fix later on ( link to iPRES paper which examines this).It’s much slower and we are usually dealing with more than 10,000 PDF files.We have decided against JHOVE for two reasons: Shortly after acquiring the content, the teammates unpack all the ZIPs and check for completeness.Īfterwards, all the PDF files go through a rough validation check, using ExifTool, Grep and PDFInfo. We have established a workflow for now, as following: ![]() Preferably fast and automatically, as staff time is always scarce. We needed a good post-acquisition workflow to check the data quality. The ZBW has been operating a Digital Archive since 2015 and experience has shown that the success rate has been very bad in the past, when asking for replacement more than a few weeks after acquisition of the data. If corrupted objects are detected so late, it’s usually too late so ask for a replacement. There can be months, sometimes years, between acquisition and the ingest into the Archive. The longterm-archiving is last in the object processing pipeline. It’s necessary to check for completeness and integrity, of course, which takes some time. The colleagues from the responsible acquisition team get the content (usually PDF files) in large ZIP-Files. We are allowed to host and to ingest into the Digital Archive, but usually this goes very slowly. I pick one example for this blogpost: The national and alliance licences. What if it’s much slower?īut there are other platforms, other departments which acquire digital content and not everything is ingested the night after data acquisition. We can give a feedback the next day and our workmates can ask the data producer for a replacement.Īs this check is done directly after the acquisition, the data producer usually is easy to reach and a replacement is easily on the way. If a PDF (EconStor usually hosts PDF files) is broken, our tools notice that. Of course we do all the mandatory checking, including format check, validation checks and extraction of metadata. As this is all automated, it literally happens during our sleep. They put the digital content on their digital preservation platforms, for example the Open Access Repository EconStor, and we archive during the next night. What’s the opposite of collateral damage? Collateral use? That’s what our Digital Preservation workflows are for some other departments in our library. ![]() Collateral use of Digital Preservation for other departments ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |