How Dropbox is using machine learning to index text from billions of images

People have stored more than 20 billion image and PDF files in Dropbox. Of those files, 10-20% are photos of documents—like receipts and whiteboard images—as opposed to documents themselves. These are now candidates for automatic image text recognition. Similarly, 25% of these PDFs are scans of documents that are also candidates for automatic text recognition.

From a computer vision perspective, although a document and an image of a document might appear very similar to a person, there’s a big difference in the way computers see these files: a document can be indexed for search, allowing users to find it by entering some words from the file; an image is opaque to search indexing systems, since it appears as only a collection of pixels. Image formats (like JPEG, PNG, or GIF) are generally not indexable because they have no text content, while text-based document formats (like TXT, DOCX, or HTML) are generally indexable. PDF files fall in-between because they can contain a mixture of text and image content. Automatic image text recognition is able to intelligently distinguish between all of these documents to categorize data contained within.