ancient-ai

Training AI Models to Read 19th Century Archival Documents

Purpose

The goal of our project is to utilize Artificial Intelligence and Machine Learning to accurately transcribe historical, handwritten documents. The timeline of our project will include determining which tools and approach will best achieve this goal.

Week 1

For our initial meeting and consultation the project sponsor, we worked to solidify the scope of work that was to be accomplished.
There were many questions that would have to be answered before we could fully determine the scope of work, and whether that work would be achievable within our timeline. Despite what would seem like an easy task, there were many questions that needed to be answered:

At the end of our first week, we would have to answer many questions to hopefully narrow down the scope and feasibility of our project.

Week 2

For the second week, we dove into the research of several applications and how they could be useful in accomplishing our goal. One example is Transkribus, which is a popular platform for transcribing archival documents from photos. *** Of note, transcribing is the act of turning audio or video into type. Technically, the term for converting cursive handwriting to type text is referred to as “translating.” Due to how awkward that sounds, we will stick to using the term “transcribing” to define the conversion of handwritten cursive into typed text. ***

Transkribus

Transkribus.org is one of the front-running apps used for transcribing documents from handwriting to searchable typed text. One of our tasks was to first understand and present the legal and privacy rules that the app operates under. The goal of this is to ensure that there are no hidden agendas or nefarious uses of the data that is being uploaded to the application’s servers.

The Transkribus project operates as a CO-OP. The statues of that CO-OP describe their mission statement but make no mention of the legal ownership or use of the documents that are uploaded. There are operational processes, member rights, and cooperative functions that are mentioned. This document specifically does not mention the Intellectual Property rights over the data or documents uploaded to their system.
The next document reviewed was the Privacy Policy. The Privacy Policy does describe how the data is handled and stored. Namely:

  1. READ-COOP collects data transferred to Transkribus, including uploaded images, recognized text, ground truth data, trained recognition models, and metadata.
  2. The data is hosted on servers in Germany and Austria.
  3. Users maintain control over the data they upload. The policy states that the data must not contain Personal Data unless the user signs a Data Processing Addendum.
  4. Uploaded data is temporarily stored only for the duration necessary to complete text recognition and is deleted promptly thereafter.
These are important points that we will make known to the sponsor and may determine the direction in which the data (or what data specifically) is processed.

Finally, the General Terms and Conditions more expressly document the relationship between user and platform and we surmise the following bullet points:

  1. READ-COOP stores uploaded material only to the extent necessary to provide its services and improve its products. The platform does not guarantee permanent storage unless otherwise agreed upon.
  2. Documents submitted via Transkribus or APIs are stored only temporarily for processing purposes. For APIs, uploaded data is automatically deleted shortly after processing is completed.
  3. Ownership: Users retain ownership of all uploaded content, including handwritten material, processed results, and training data.
  4. Licensing: Users grant READ-COOP a limited, non-exclusive, worldwide, royalty-free license to store, modify, and process user content for the sole purpose of providing and improving its services.
  5. Users maintain all intellectual property rights for their material and processed outputs unless explicitly stated otherwise.
  6. The policy explicitly states that "User Content remains yours," reaffirming user ownership over scanned and uploaded documents.
  7. Users can share custom-trained models with others via the platform, with two sharing options:
    o With Training Data- Makes both the model and the underlying training data public.
    o Without Training Data- Only the model is shared, and the training data is kept private.
  8. Users can withdraw consent for sharing at any time.

Handwriting OCR

Another AI transcript app that we found during our research is Handwriting OCR.

According to their website frequently asked questions page, "Handwriting OCR is a document automation service that specialises in digitizing documents containing handwriting. It uses a form of Optical Character Recognition (OCR) developed especially for reading handwriting (Source)."

Highlights about Handwriting OCR:

  1. Supports a wide range of file formats such as PDF, JPG, PNG, GIFT, HEIC, and TIFF.
  2. Supports multiple languages for processing documents.
  3. AI Models are pre-trained on a diverse set of public domain and licensed datasets to ensure privacy and confidentiality of user data.
  4. Handwriting OCR has a comprehensive API that allows users to integrate their services directly into applications.

In terms of the legal and privacy rules that Handwriting OCR operates under, here are some key points we found during our research (Source):

  1. Handwriting OCR have explicitly stated that any information uploaded to their services belongs to the users and that they only use the user's uploaded data to deliver their OCR services.
  2. Users have full control over and data that is uploaded to their services. Users are able to delete data at any time. Handwriting OCR will automatically delete processed documents after 7 days by default but if the user wants to either shorten or lengthen the time data is deleted after processing, they are able to adjust that as well. Once data is deleted, it is immediately and permanently removed from Handwriting OCR systems.
  3. Handwriting OCR will only retain user uploaded data as long as necessary to provide their services, emphasizing that users can delete their data at any time.
  4. Any data that is in transit and stored uses industry-standard encryption, implementing rigorous access controls and security protocols. Any non-EU customer data is stored in the US.
  5. Handwriting OCR is also in compliance with several data protection standards and regulations such as GDPR (General Data Protection Regulation) and HIPAA (Health Insurance Portability and Accountability Act).
  6. If a user requests support regarding any of their documents and grants Handwriting OCR permission to view and access their documents, only then will Handwriting OCR view and access a users documents.

Sources

Our plan is to propose options to the sponsor of viable ways to achieve the goal of transcribing archival documents using AI.

Lessons Learned

Information to follow...

Github Project Webpage