This project is a collaboration between the DataLab and UC Davis Library. We worked on a Sloan grant-funded project to extract historical price data from an archive of wine catalogs held in the UC Davis library. The primary goal of the project was to create a database of historical price information that could help wine economists study wine markets over time. A secondary goal was the development of open-source table-extraction software for images, building on the Rtesseract package, an R interface to the tesseract OCR (Optical Character Recognition) system.
Using our resulting R package for table extraction in a Google Cloud architecture using docker and integrated with a postgres database, we were able to extract about 365 thousand prices from an even larger set of historical data than we had originally intended.
A presentation of our work entitled “Mining Historic Realia: Automatic Generation of Historic Wine Pricing”, was accepted and given at the 13th Annual Conference of the American Association of Wine Economists.