光学字体排印特征识别中的应用
Abdelwahab Zramdini and Rolf Ingold 阿卜杜勒Zramdini和罗尔夫英戈尔德
Abstract—A new statistical approach based on global typographical features is proposed to the widely neglected problem of font recognition. It aims at the identification of the
typeface, weight, slope and size of the text from an image block without any knowledge of the content of that text. The recognition is based on a multivariate Bayesian classifier and operates on a given set of known fonts. The effectiveness of the adopted approach has been experimented on a set of 280 fonts. Font recognition accuracies of about 97 percent were reached on high-quality images. In addition, rates higher than 99.9 percent were obtained for weight and slope detection. Experiments have also shown the system robustness to document language and text content and its sensitivity to text length.
摘要-一种新的统计方法在全球印刷提出了基于特征的字体识别广泛忽视的问题。其目的是从一个图像块没有任何该文本的内容,知识在字体,重量,坡度和文本的大小确定。承认是基于多元贝叶斯分类和某一已知字体进行操作。该方法的有效性通过试验已在280个字体集。大约百分之97的字体识别的准确度就高品质的图像。此外,利率高于百分之99.9,获得了重量和斜坡检测。实验也表明,系统的鲁棒性文件的语言和文字内容,其敏感度文本的长度。
Index Terms—Optical font recognition, typographical features, font models, multivariate Bayesian classifier, document analysis, OCR.
指数计算,光学识别字体,排版功能,字体模式,多元贝叶斯分类,文件分析,光学字符识别。
1 INTRODUCTION
A considerable amount of research has been dedicated to optical character recognition (OCR) of printed texts. Early OCR systems, called monofont reading systems were able to read a single font, sometimes even specific fonts that were designed for optical reading purposes (OCR-A and OCR-B fonts). The tendency of recent developments was oriented toward omnifont recognition methods, which aim at recognizing characters of any font and style [1], [2]. Some of the currently available OCR products are only able to distinguish two or three font styles such as italic, bold, seriffed, or sanserif. Results of such tools are, however, still not very accurate. To our knowledge, there has been no serious study of the optical font recognition (OFR) problem, which can be addressed through two complementary approaches [3]: the a priori approach, in which characters of the analyzed text are not yet known, and the a posteriori approach, where the content of the given text is used to recognize the font. 1引言
阿大量的研究一直致力于光学字符识别(OCR)的印刷文本。早期的光学字符识别系统,称为monofont读数系统能够读取单一字体,有时甚至是为特定的目的而设计的光学读取系统(OCR -甲和OCR - B字体)字体。最近的事态发展趋势是面向omnifont识别方法,它在承认任何字体和样式[1]字符的目标,[2]。当前可用的光学字符识别部分产品只能够区分,
如斜体,粗体,seriffed,或sanserif两三字体样式。这些工具的结果,但是,仍然不是很准确。据我们所知,目前还没有对光学字体识别认真研究(OFR)的问题,可以通过两个[相辅相成的办法解决3]:先验的方法,在其中分析的文本字符尚未可知,后验做法是,如果给定文本的内容是用来识别的字体。
Only a few works have addressed the automatic typeface recognition, with focus on the
identification of some font attributes, such as slope and weight for OCR, document analysis, and image editing purposes [4], [5], [6], [7], [8]. Morris has examined the applicability of human vision models to typeface discrimination; he used Fourier amplitude spectra of images to extract global feature vectors used by a Bayesian classifier [9]. Khoubyari and Hull presented an algorithm that identifies the predominant font in a document [10]. 只有少数作品也讨论了对某些字体属性,如坡度和光学字符识别,文件分析和图像编辑的目的[4],重量,重点是确定自动识别字体,[5],[6],[7 ],[8]。莫里斯研究了人类视觉模型的适用性字体歧视;他用傅立叶振幅谱的图像整体特征提取由贝叶斯分类[9用于矢量]。 Khoubyari和赫尔提出的一种算法,用于标识文档中的主要字体[10]。
In this paper, we present a novel contribution to the a priori OFR approach. Our goal is to discriminate the font from a given piece of text among a set of several hundred known fonts constituting the so-called font model base. In our system called ApOFIS (A priori Optical Font Identification System), global typographical features are extracted from the text image and used by a multivariate Bayesian classifier.
本文提出了一种先天的氧自由基的方法小说的贡献。我们的目标是歧视的字体从文本之间获得数百已知构成所谓的字体示范基地字型件。在所谓的ApOFIS(先验光学字体识别系统)的系统,全球印刷特征提取的文本图像和多变量贝叶斯分类使用。
The rest of this paper presents our approach to OFR. In Section 2, features used by the classifier are briefly presented and their power to font discrimination is highlighted. In Section 3,
experimental results are discussed. They show the relevance of our approach on printed and then scanned documents. Appendices A and B present formally a classification of connected components and the used features.
本文的其余部分提出了我们的做法氧自由基。在第2节,使用的功能分类,并简要介绍其权力,字体突出的歧视。在第3,实验结果进行了讨论。他们显示了我们对印刷,然后扫描文件的方法的相关性。附录A和B现在正式的连接组件的分类和使用功能。
2_ THE APOFIS APPROACH TO OPTICAL FONT RECOGNITION
In this section, typographical notions and the features used in ApOFIS are presented. 在本节中,印刷理念和ApOFIS使用的特点提出的。
2.1 Typographical Study and the ApOFIS Approach
2.1排印研究和ApOFIS方法
Features used to model fonts have been derived from global typographical properties of text lines. Subsequently, these properties are presented in order to justify the features used by ApOFIS. The
notion of font is also specified.
用于模拟功能已经从字体的文本行全球印刷特性产生。后来,这些属性介绍,为了证明所ApOFIS使用的功能。概念的字体也指定。
2.1.1 Font Specification and Identification Attributes
2.1.1字体属性的规范和鉴定
Typographically, a font is a particular instantiation of a typeface design, often in a particular size, weight and style [11]. Typefaces are distinguished by their writing style (cursive, typesetter), shape of serifs, x-height proportion, character spacing (fixed, proportional, with/without kerning), and loop axes, etc. [12], [11]. Within ApOFIS, a font is fully specified by five attributes: typeface (Times, Courier, Helvetica, ...), weight (light, regular, demi, bold, heavy), slope (roman, italic), width (normal, expanded, condensed), and size.
印刷上,字体是一个特殊字体设计实例,往往在一个特定的尺寸,重量和风格[11]。字体指的是在写作风格(草书,排版),对衬线形状,x -高度的比例,字符间距(固定比例,与没有字距/)和环轴等[12],[11]。内ApOFIS,字体完全指定五个属性:字体(时报,速递,海尔维希,...),重量(轻,定期春秋,勇敢,重),斜坡(罗马,斜体),宽度(正常,扩大,凝结)和大小。
2.1.2 Typographical Structure of Text Lines
2.1.2排印行文字的结构
As shown in Fig. 1, text line images are composed of three typographical zones: the upper, central, and lower zones, which are delimited by four virtual horizontal lines. While the upper and lower zones depend on the text content, the central zone is always occupied regardless of the characters that occur. The height of the central zone is commonly called x-height, and its proportion in the text height differs from one typeface to another.
如图所示。 1,文本行图像由三个印刷区域:上,中,低区,是由四个虚拟水平线分隔。虽然上,下区的文本内容取决于中央区始终占据不管发生的字符。中央区的高度俗称x -高度,其高度在文本到另一个不同的比例从一个字体。
Within printed Latin texts, characters are separated by two kinds of spaces: character- and word-spaces. The former is an intrinsic aspect of typeface design. We assume that their values depend exclusively on the typeface nature (with proportional/fixed spacing) and character sequences, where negative spaces may occur with italic style or in case of kerning.1
Character-spaces have, therefore, to be preserved in typeface discriminations. The latter depend exclusively on formatting parameters, such as margins and justification mode. They must be ignored during feature extraction, since they do not provide any relevant information on the font. 内印刷拉丁文字,字符分隔两种空间:性格及文字空间。前者是一个字体设计内在的方面。我们假定它们的值取决于字体的性质(与比例/固定间距)和字符序列,其中负完全的空间可能会出现斜体风格或特征的kerning.1案件位,因此,在字体的歧视保存。后者完全取决于格式的参数,如利润和理由模式。他们必须特征提取过程中忽略,因为它们不提供任何有关资料的字体。
2.1.3 The ApOFIS Approach
2.1.3 ApOFIS方法
ApOFIS aims at font identification from images of text lines, which are assumed to be
homogeneously typeset, i.e., with the same font. Practically, ApOFIS works as a multivariate Bayesian classifier based on feature distributions that are estimated independently of the content, structure, and language of texts, but taking into account the influence of the text length.
ApOFIS旨在从文本行,这是假设的图像识别标志的字体排版均匀,即用同样的字体。实际上,ApOFIS工程作为一个多元的贝叶斯估计的功能,独立的内容,结构,分布为基础的分类,和对案文的语言,但考虑到文本的长度影响。
2.1_ Feature Extraction
2.1_特征提取
The ApOFIS prototype uses eight global features extracted from connected components and from horizontal and vertical projection profiles of text lines. Features have been carefully selected in order to discriminate fonts regardless of the text content and structure.
该ApOFIS原型使用从连接部件及行文字水平和垂直投影剖面全局特征提取的8。功能已被精心挑选,以歧视字体不管文本内容和结构。
1.Character spacing within words can be adjusted for improved appearance or to fit text to a specific width.
1字符间距之内的话可以调整或改进的外观,以适应特定的文本的宽度。
Fig. 1. Typographical structure of text lines. 图 1。排印结构行文字。
Fig. 2. Four typographical lines from vertical projection profiles. 图2。四名垂直投影剖面印刷线。
Fig. 3. Typographical and morphological classifications of connected components.
图 3连接的部件排印和形态分类。
Most of the feature vectors have been shown to follow normal laws [3]
, so that learning
consists of the evaluation of the multivariate normal density parameters and . 特征向量的大部分已被证明按正常法律[3]参数 和 .
The feature extraction process, which assumes skew-free images, performs three steps: 1) determination of the typographical structure of the text line; 2) classification of connected components;
3) calculation of features using the typographical structure of the text and the classified connected components.
萃取过程的功能,它假定斜的图像,执行三个步骤: 1)确定的文本行排印结构; 2)分类连通;
3)计算功能使用文本的排印的结构和分类连接的组件。
2.2.1 Determination of the Typographical Structure
,因此,学习包括评价的多变量正态密度
2.2.1测定排印结构
The typographical structure of text lines is used to classify connected components and to delimit the features extraction area. It is determined from the vertical projection profile, VP, as shown in Fig. 2. Each component VP[i] represents the sum of black pixels of the scanline i. The ul and bl scanlines, which estimate the upperline and the baseline, correspond to the main peaks of VP, such that:
1u1i if ito,toboto&maxVPi1VPi2 1b1i if itoboto,bo&maxVPi1VPi
2该行文字排印结构是用来区分连通,并划定开采区域的特点。它决心从垂直投影轮廓,副总
裁所示,图 2.每个组件的副总裁[i]代表的扫描线岛黑色像素总和UL和基本法扫描线,这
估计upperline和基准,对应的副总裁,主要山峰,使得:
1u1i if ito,toboto&maxVPi1VPi2 1b1i if itoboto,bo&maxVPi1VPi
22.2.2 Classification of Connected Components
2.2.2分类连接组件
Within images, characters can be located by the rectangular envelopes of their connected
components, which may, however, correspond to linked or broken characters. In order to extract some of the features, typographical and morphological classifications of connected components are
performed.
在图片,文字,可以找到他们的,连通的长方形信封可能,但是,对应于链接或破碎的字符。为了提取特征,连通打字和一些形态进行分类。
As illustrated in Fig. 3, we use the positions of connected components within the typographical zones of the text line to distinguish between six typographical classes (Full, High, Deep, Short, SuperSc., SubSc.). Similarly, we distinguish between six morphological classes (Wide, Large, Squared, Tall, Thin, Small) using the dimensions of connected components, especially the width to height ratio (see Appendix A).
如图所示。 3,我们使用在文本行的排印区连接的组件的位置来区分6印刷类(全,高,深,短,SuperSc。,SubSc。)。同样,我们区分6形态类(宽,大,平方,高,薄,小)使用连接元件的尺寸,尤其是宽度高度比例(见附录A)。
The assessment of the proposed typographical and morphological classifications, has not been performed through exhaustive tests, but on the basis of empirical measurements. In contrast with OCR applications based on connected components, which assume a nearly 100 percent accurate component preclassification [13], [14], ApOFIS can tolerate some misclassifications, since it assumes that features are computed from relatively long strings.
拟议的印刷和形态分类评估,还没有进行,通过详尽的测试,但在经验测量的基础。在与有关的组成部分,而承担了近百分之一百准确,部分预分类[13],[14的光学字符识别应用对比],ApOFIS能够容忍一些分类错误,因为它假定的特点是用比较长的字符串计算。
2.2.3 Features and Their Contribution to Font Discrimination
2.2.3特征及其贡献字体歧视
The features selection consists of searching for global aspects of the text, which allow the text reader to distinguish visually between font weights, slopes, sizes, and typefaces. In this section, we only give an intuitive description of the selected features, with a special focus on their power to discriminate font attributes. A detailed and formal description of each feature is given in Appendix B.
选择的功能包括对案文,允许读者的文本字体重量区分,斜坡,大小视全球搜索方面,和字
体。在本节中,我们只是提供了一种直观的所选功能的描述,与对他们的权力,歧视,特别注重字体属性。每个功能的详细描述,并给出正式在附录B
Weight and Slope Detection
重量和边坡检测
As shown in Fig. 4 , the font weight is reflected by the density of black pixels in the text line
image. One can easily notice that bold texts have a higher density than regular ones. Therefore, the density is taken as weight discrimination feature, at least when the typeface is known.
如图所示。 4,字体重量是体现在文本行中的黑色像素图像密度。人们可以很容易发现,有一个大胆的案文比普通高密度。因此,密度作为体重歧视的功能,至少在该字体是众所周知的。
One can also observe from the horizontal profile that roman texts are characterized by a set of upright and tall peaks. For italic texts, the peaks are less tall, rounded, and boarder. Taking the squared values of the profile derivative has been shown to be relevant for slope discrimination. 人们也能看到从横向配置的罗马文本由正直和峰高的特点设置。文本为斜体,山峰不太高大,圆,和寄宿生。注意到文件导数的平方值已被证明是斜坡有关歧视。
Furthermore, vertical stems width and horizontal stems height within characters rely on the typeface design and on the font weight and size. The estimation of the stem’s width and height allows us to distinguish not only between regular/bold, but also between roman/italic for the same typeface. 此外,纵向和横向宽度茎茎内的字符高度依赖的字体设计和字体重量和大小。该干的宽度和高度的估计使我们之间的区别不仅经常/大胆,而且还与罗马/同一字体倾斜。
Fig. 4. Effects of font style on the horizontal projection profile and its first derivative.
图4。字体风格对水平投影轮廓及其一阶导数。
Fig. 5. Difference between scanlines for seriffed and sanserif texts computed from connected components extremities.
图。 5。扫描线之间的差异从两端连接的组件计算seriffed和sanserif文本。
Typeface Detection
字体检测
Serifs are the most obvious features that distinguish seriffed from sanserif typefaces. They are mainly located at the end of the character main strokes. As shown by Fig. 5, serifs are exhibited by computing the difference between consecutive scanlines, especially from regions around the top and bottom of connected components. The density of the resulting image has shown its relevance in seriffed/ sanserif discrimination.
衬线是最明显的区别sanserif字体seriffed功能。它们的主要分布在主杆结束字符。按图所示。 5,衬线展出通过计算连续扫描线之间的差异,特别是来自各地的顶部和底部的连接部分地区。由此而产生的图像密度的相关性表明在seriffed / sanserif歧视。
On the other hand, the intercharacter spacing mode (fixed, proportional) is a fundamental aspect of typeface design. The character spacing mode changes from one typeface to another. It is revealed by the average pixel distances between the rectangular envelopes of connected components within words. Character spacing is also significantly influenced by the text slope, with even negative values for italics.
在另一方面,字符间的间隔模式(固定比例)是字体设计的基本方面。字符间距从一种字体模式变化情况。它透露的平均像素组件之间的连接距离的长方形信封内的话。字符间距也大大影响了文本坡度为斜体甚至负值。
Size Detection
尺寸检测
The text size is obviously revealed by character heights and widths. The text height is globally characterized by the x-height, Xheight, and the total text height. These measures depend, however, on the text content and structure. A more sophisticated measure was defined and based on a
normalized height of the connected components. It uses the typographical class of each component defined in Section 2.1.2.
文字的大小,显然是揭示字符的高度和宽度。文本高度是全局特征的x -高度,Xheight,总文本的高度。依靠这些措施,但是,在文字内容和结构。一个更复杂的措施是明确和在连接的元件高度标准化的基础。它使用了2.1.2节中定义的每个组成部分印刷类。
Similarly, the average width of connected components that are squared and located in the central
zone, e.g., those corresponding to characters a, c, and u, estimates the character width.
同样,连通,平均宽度是平方和位于中心地带,例如,那些相应的字符,c和ü,估计字符宽度。
3 EXPERIMENTS 3实验
This section presents the results of various classification experiments and discusses the strengths and weaknesses of the ApOFIS approach to font recognition. The classifier uses a font model base (FMB), which presently includes 280 font models representing 10 typefaces combined with seven sizes (8, 9, 10, 11, 12, 14, 16pt) and four styles (regular, bold, italic, bold-italic).2 Each font model has been estimated from feature vectors of about 100 text lines of about six cm length each. Texts were arbitrarily taken from English documents, produced by a 300-dpi laser printer and scanned again at 300 dpi.
本节介绍了各种分类试验,并讨论了优势和ApOFIS字体识别方法的弱点的结果。选粉机使用的字体示范基地(白普理基金),其中目前包括7个代表280大小(8,9,10,11,12,14,16pt)和4个风格相结合10字体字体模式(正常,粗体,斜体,粗体斜体)0.2每种字体模型估计由大约六厘米长各约100个文字行特征向量。文本被任意取自英文文件,由一个300 dpi的激光打印机和扫描在300 dpi再次生产。
2. This means that 28 fonts have been considered for each typeface. 2。这意味着28字体已经为每个字体考虑。
The test set contained at least 100 French text lines for each font. Images were produced under the same conditions as for learning.
测试设置中,至少100每个字体法国行文字。根据图像制作了作为学习同样的条件。
3.1_ Classification Results
3.1分类结果
Table 1 lists the average recognition rates of fonts and font attributes for each typeface. The classification was performed among the 280 font models of the FMB.
表1列出了每个字体和字体的字体属性的平均识别率。该分类是进行中的白普理基金280字体模式。
For each font, the measured recognition accuracy is expressed as an average rate of the processed lines. The typeface recognition accuracy corresponds to the average of its 28 font rates and is therefore estimated from about 3,000 text lines. We distinguish two kinds of accuracies; one is obtained for fonts considered as a “whole,” where each attribute misclassification leads to a recognition error, the other concerns the individual font attributes, i.e. the family, weight, slope, and size.
对于每种字体,识别精度的测量是对作为加工线的平均增长率。在字体识别的准确率相当于其28字体利率的平均数,因此,从3000文本行估计。我们区分两种类型的准确度,一个是作为一个“整体考虑字体获得”,其中每个属性分类错误导致承认错误,其他涉及个人的字
体属性,即家庭,重量,斜坡和大小。
3.1.1 Attributes Discrimination
3.1.1属性歧视
When focusing on the “font” recognition, the results show that the classifier achieved a good performance with an overall recognition rate near 97 percent. The hardest fonts to recognize were the seriffed one’s (95.6 percent) since they are characterized by very complex character shapes. 在谈到“字体”承认为重点,结果表明,分类与邻近取得百分之97的整体识别率良好表现。最难的字体承认是seriffed一个人的(百分之95.6),因为他们的性格非常复杂的形状特征。
Typewriter fonts, for which we obtained an accuracy of 98.2 percent, appear to be the easiest to recognize, probably because of their fixed-pitch aspect. A few recognition failures were registered for the Lucida-Sans and Times fonts.
打字机字体,我们获得了准确的百分之98.2,似乎是最容易识别的,可能是因为他们的固定摊位方面。承认失败少数录得的龙力,国界和时代的字体。
The classification results show also that the classifier is very robust in slope detection with an overall accuracy of 99.97 percent. It has, however, demonstrated a few failures, especially for some Courier fonts. This can be explained by the small x-height of that typeface, which is
reflected by very short vertical stems. The lowest slope recognition rate remains higher than 97 percent.
分类结果还表明该分类器是非常强大的检测与斜坡的99.97百分之整体精度。但它也表现出了一些故障,特别是一些快递字体。这可以解释该字体,这是很短的垂直茎反映小x -高度。最低斜坡识别率仍高于百分之97。
The identification of the font weight is inherently more complex since typefaces may exist in many weights. The considered typefaces were either light or demi, or regular and bold (however, four weights have been used with the Helvetica typeface: regular, bold, black, and heavy3). The system succeeded in weight discrimination regardless of the large size of the FMB, with a lowest rate of
字体重量鉴定本来就更加复杂,因为可能存在的字体,在许多重。所审议的字体要么光或黛咪,或经常和大胆的(不过,四重被海尔维希字体的使用:普通,粗体,黑和heavy3)。该系统成功地不论体重歧视的白普理基金规模很大,与最低的
3. In fact, we considered the Helvetica-Black typeface as a variant of Helvetica. 3。事实上,我们认为海尔维希黑色作为变异的Helvetica字体。
TABLE
表1
AVERAGE RECOGNITION RATES OF FONTS AND FONT ATTRIBUTES WHEN THE CLASSIFICATION IS PERFORMED AMONG THE 280 FONTS
字体和字体属性平均识别价格进行分类时,表演了其中的280字体
seriffed Palatino New-Century-Schlbk Lucida-Bright Times sanserif Helvetica-Black Avant-Garde Helvetica Lucida-Sans typewriter Courier Lucida-Sans-Typewriter average typefaces PL NC LB TM typefaces HB AG HV LS typefaces CR LT Font 97.46 96.92 95.27 92.87 99.84 99.58 99.44 91.26 99.30 97.16 96.91 Typeface 97.82 97.92 95.81 93.72 99.99 99.77 99.68 91.57 99.97 97.20 97.35 Size 98.25 97.67 95.30 93.59 99.88 99.66 99.61 99.04 99.35 99.88 98.22 Weight 99.78 99.63 99.94 99.88 99.95 99.85 99.97 99.95 100 100 99.90 Slope 99.86 99.92 100 99.94 100 99.95 100 100 99.95 100 99.97 typefacesFont字型 97.46 96.92 95.27 92.87 99.84 99.58 99.44 91.26 99.30 97.16 96.91 Typeface字体 97.82 97.92 95.81 93.72 99.99 99.77 99.68 91.57 99.97 97.20 97.35 Size大小 98.25 97.67 95.30 93.59 99.88 99.66 99.61 99.04 99.35 99.88 98.22 Weight重Slope边量 99.78 99.63 99.94 99.88 99.95 99.85 99.97 99.95 100 100 99.90 坡 99.86 99.92 100 99.94 100 99.95 100 100 99.95 100 99.97 seriffed Palatino New-Century-Schlbk Lucida-Bright Times sanserif Helvetica-Black Avant-Garde Helvetica Lucida-Sans typewriter打字 Courier Lucida-Sans-Typewriter average 字体 PL NC LB TM typefaces HB AG HV LS typefaces CR LT TABLE 2 表2
TYPEFACE CONFUSION MATRIX WHEN THE CLASSIFICATION IS PERFORMED AMONG THE 280 FONTS
字体混乱矩阵进行分类时,表演了其中的280字体
LB NC seriffed LB 95.81 0.67 sanserif PL 0.91 0.89 typewriter LS 0 0 NC 1.86 97.92 TM 1.36 0 AG 0 0 HV 0 0 HB 0 0 LT 0 0 CR 0 0 FER 4.19 2.08 PL TM AG HV HB LS LT CR ERTCF 1.09 2.31 0 0 0 0 0 0 4.07 0.81 2.95 0 0 0 0 0 5.62 97.82 0.90 0 0 0 0 0 0 2.70 0 93.72 0 0 0 0 0 0 1.97 0 0 99.77 0 0 0 0 0 0 0 0 0 99.68 0 0 0 0 0 0 0 0 0 99.99 0 0 0 0 0 0 0 0 91.57 2.61 0 2.61 0 0 0 0 0 8.12 97.20 0 8.12 0 0 0 0 0 0 0 99.97 0 2.18 6.28 0.23 0.32 0 8.43 2.80 0 2.65 97 percent observed for one Lucida-Sans font. In spite of the high accuracy observed for Helvetica, the weight-detection problem remains open when many weights for the same typeface are
considered. We can, however, claim that if typeface is correctly identified, then its weight is also. Furthermore, the weight has to be considered as a relative measurement, since the weight of a given bold typeface may be lighter than a regular font of another typeface.
97百分之观察一龙力,Sans字型。在高精度尽管观察海尔维希,重量检测问题仍然开放时,许多相同的字体重量得到考虑。不过,我们可以,声称如果字体是正确,那么它的重量也。此外,重量也应被视为一个相对测量,因为某一粗体的重量可能比另一字体常规字体打火机。
In spite of the presence of small and consecutive sizes, the system shows a good size
discrimination power with an average accuracy rate of 98.22 percent. Similarly, except for
Lucida-Sans and Times, typefaces are relatively well discriminated, with an accuracy rate of 97.35 percent. In fact, the identification of typeface and size is very tedious since
1)_ typeface discrimination, especially from short texts, is a domain reserved to skilled typographers and
2)_ even for a known typeface, one point difference in size is hardly noticeable even by well advised readers.
由于规模小和连续尽管存在,该系统显示的是百分之98.22的平均准确率一个良好的大小歧视的权力。同样,除龙力,国界和时间,字体有较高的歧视,以一个准确率高达百分之97.35。事实上,确定的字体和大小是非常乏味,因为
1)字体歧视,特别是从短期文本,保留和熟练的排印域
2)即使是已知的字体,大小点之一,几乎没有明显的区别是甚至奉劝读者。
In addition, recognition rates have shown that size and typeface misclassifications were often tightly coupled. This suggests that the a prior knowledge of the typeface will certainly enhance size recognition and vice-versa. Classification results have shown that the size knowledge has improved the typeface discrimination, with rates increasing from 93.72 percent to 98.77 percent for Times.
此外,识别率表明,尺寸和字体分类错误往往是紧密结合在一起。这表明,在字体事先知识必将增进大小承认,反之亦然。分类结果表明,知识的规模有所改善,由93.72上升到98.77百分之为百分之率时报字体歧视。
3.1.2 Typeface Confusion
3.1.2字体混乱
Table 2 shows the typeface confusion matrix, which is read as follows. Each [fi, fj] entry gives the percentage of effective fonts fi, which were classified as the top-choice fonts fj. In the last column, the misclassification rates (FER) are given. In the last row, the error rates for top-choice fonts
(ERTCF) are listed. The table indicates the “noisy” typefaces, which affect recognition rates of the other typefaces, e.g. Lucida-Bright and New-Century. Finally, the overall error rate is given in the bottom-right entry of the matrix, which indicates that 2.65 percent of the 30,000 text lines were misclassified.
表2显示字体混乱矩阵,这是如下。每一[音响,巧的]项给出了有效的字体音响,其中作为第一首选字体巧的划分比例。在最后一栏的分类错误率(FER)来给出。在最后一行,为顶级选择字体(ERTCF)的错误率也已列出。该表显示的“噪音”字体,这影响到其他字体的识别率,例如:青冈亮度和新世纪。最后,总误差率给出了自下而上的矩阵,这表明,百分之2.65的30000行文字输入的错误分类的权利。
The matrix shows the power of ApOFIS in discriminating between the seriffed, sanserif, and typewriter typefaces, where misclassifications mainly occur within the same family. A
discrimination rate of 99.65 percent was obtained between seriffed and sanserif. The matrix shows, however, that the system failed in discriminating between Lucida-Sans (LS) and
Lucida-Sans-Typewriter (LST). The same behavior that originates from the fact that LST is a
fixedwidth typeface highly stylized to look like LS was also observed in Morris’s experiments [9]. 矩阵显示之间seriffed,sanserif歧视的ApOFIS权力,打字机字体,在分类错误主要是在同一个家庭的发生。一种歧视率百分之99.65,获得与seriffed和sanserif。矩阵显示,然而,该体系没有歧视之间的龙力,国界(LS)和龙力,国界,打字机(乐善堂)。相同的行为,从事实,乐善堂是一个fixedwidth字体高度程式化的样子储蓄起源,也看到了莫里斯的实验[9]
3.2_ Effects of Text Length
3.2文本长度的影响
The discussion so far has not considered text length. In the previous experiments, all text entities, used to create the FMB and to assess the classifier, have similar lengths. The study on the influence of text length is of importance because document analysis may require font identification from fragments of various lengths, e.g., words, lines, or even paragraphs.
迄今为止的讨论也没有考虑文本的长度。在前面的实验中,所有的文字实体,用于创建白普理基金和评估分类,有类似的长度。在对文本的长度影响的研究,是因为文件的分析,可能需要从不同长度的片段字体识别,例如,文字,线条,甚至段落的重要性。
In the following experiment, the classifier was applied on texts of four different lengths using the original FMB (generated from text lines of a default length L). The recognized lines were either broken into smaller entities of length 1/4 L and 1/2 L , or merged4 together to build new entities of length 2L and 4L. Table 3 shows recognition rates for these text lengths. The results confirm the intuitive prediction that recognition accuracy enhances with the length of the text line. While the weight and slope detection remains robust to short texts, size, and typeface are, as expected, obviously less accurate.
在下面的实验中,采用分类是对四个不同长度的使用原始的白普理基金文本(从默认长度为L行文字生成)。公认的系统都进行分解成较小的实体长度为1 / 4 L和1 / 2 L或merged4共同构筑长度2L和4升新的实体。表3显示了这些文字长度的识别率。结果证实了直观的预测,识别的准确率与文本行的长度提高。虽然重量和斜坡检测依然强劲,短期文本,大小和字体是如市场预期,显然不太准确。
The classifier has also been applied on single words of a limited FMB of 84 fonts. Experiments have confirmed the fact that the size and typeface attributes are the hardest to identify on short texts. A text length modeling was used to automatically adapt the classifier to text length. An improvement of recognition results was observed [3].
该分类还适用于84个字体有限的白普理基金单个词。实验证实了这一事实的大小和字体属性是最难确定的短文。一个文本长度模型是用来自动分类,以适应文本长度。识别结果的一个改进观察[3]。
TABLE 3 表3
EVOLUTION OF RECOGNITION RATES FOR FOUR TEXT LENGTHS USING THE DEFAULT FMB
演化承认的四个文本长度使用DEFAULT白普理基金价格
1/2L L 1/4L Weight Slope Size Typeface Font
95.54 96.16 77.31 75.32 64.15 98.41 99.0 90.73 89.11 84.62 99.90 99.97 98.22 97.35 96.91 2L 100 100 99.44 98.87 98.23 4L 100 100 99.54 99.51 99.02 TABLE 4 表4
TYPOGRAPHICAL AND MORPHOLOGICAL CLASSES OF CONNECTED COMPONENTS
连接元件的排印和形态类别的 自己绘制表格
4_ CONCLUSIONS
4结论
This paper has addressed the problem of optical font recognition (OFR), which has been widely ignored by the scientific community so far. The aim of the developed system5 is to analyze a text line image and to identify the typeface, the font style, and size from a given set of already learned fonts. We have adopted a statistical approach based on the extraction of a few well-selected global features from a medium resolution image of a scanned text. 本文件讨论了光学字体识别(氧自由基),已广泛应用于科学界迄今忽视的问题。所开发的系统的目的是分析一个文本行的形象,以确定字体,字体样式和大小,从给定的已学到一套字体。我们通过了一个统计方法对提取的数间从一文中分辨率扫描图像选定的全球特征。
The method has been extensively tested on more than thirty thousand two-column formatted text lines extracted from scanned documents using a set of 280 distinct fonts. The experimental results
are extremely encouraging: the measured overall recognition rate was close to 97 percent. The classifier obtained accuracy rates even higher than 99.9 percent for the more practical problem of identifying the font style of a given typeface. The method can be considered as applicable on short texts of about ten characters. 该方法已被广泛的测试,在超过30000两列的格式文本行使用280个不同字体的设置扫描的文件提取出来。实验结果非常令人鼓舞:测得的整体识别率接近百分之97。获得的分类准确率甚至高于百分之99.9为确定某一字体字形更实际的问题。该方法可被视为对大约10字短文适用。
The accuracy of the results suggests the use of such a tool, not only for logical structure
recognition, for which the system was originally planed, but also to improve OCR accuracy by combining it with monofont OCR systems [15]. The latter are known to have better performances than omnifont ones.
结果的准确性建议相结合monofont光学字符识别系统,它[15这种工具的使用,不仅逻辑结构的承认,该系统的最初刨,而且还要提高光学字符识别的准确性,]。后者被称作比omnifont有较好的表现。
4. In case of merging, feature vectors were computed as averages from those of the default lines of length L.
4。在合并的情况下,特征向量的计算为从默认的长度线的平均湖
5. ApOFIS is available as a C++ library that can be downloaded from our web site 5。 ApOFIS可以作为一个C + +库,可从我们的网页下载
Finally, the reported results have been applied on 300 dpi images of rather good quality obtained by scanning of laser printed pages. Some recent experiments have shown that the method is still applicable on slightly degraded documents such as those obtained by first- or second- generation photocopies, provided that the font model base has been built under the same conditions. In
practice, such an assumption may only be hardly satisfied; therefore, we have to consider a more realistic approach, which consists of adapting the font model base automatically according to the type of degradation.
最后,该报告的结果已在300通过激光扫描获得相当良好的质量分辨率的图像应用打印页。最近的一些实验表明,该方法仍然适用于如第一或第二代所得的影印略有退化的文件,只要字体示范基地已在相同条件下兴建。在实践中,这样的假设只能是难以满足的,因此,我们要考虑一个比较现实的办法,这是适应字体模型的基础包括自动根据降解类型。
APPENDIX A
CLASSIFICATION OF CONNECTED COMPONENTS 附录A
连接元件的分类
Within images, characters are located by the envelopes of their connected components (cc), which are defined by their top t(cc), bottom b(cc), left l(cc), and right r(cc) coordinates. Their heights are defined as h(cc) = b(cc) - t(cc) and their widths as w(cc) = r(cc) - l(cc).
在图片中,字符是由位于其连接部分的信封(副本),这是指的是以最吨(副本),底部乙(抄送),左升(毫升),和右R(CC)的坐标。他们的身高被定义为高(毫升)=乙(抄送) -吨(毫升)和瓦特的宽度(毫升)= r(下毫升) -升(毫升)。
Typographical classes are practically determined using the positions of t(cc) and b(cc) from the typographical lines. A tolerance factor ε, empirically fixed to ε=
to bo , is introduced
to consider position fluctuations within the line. Six typographical classes (T(cc)) have been defined as illustrated in Table 4 and Fig. 3.
排印类是几乎确定使用吨(副本的位置)和b(副本)从印刷线。阿糖耐量因子ε, 经验固定为ε= 示3.
The morphological classification is based on the cc dimensions with classes distinguished by the ratio r = w(cc) h(cc) as shown in Fig. 3. Finally, six morphological classes (M(cc)) have been defined as illustrated in Table 4. Frontiers between these classes were fixed empirically through a statistical analysis of the ratio r for various fonts.
形态分类的基础是,在比r =瓦特班杰出的cc尺寸(毫升)小时(抄送如图所示)。 3。最后,6个形态类(M(下毫升))已经确定,如表4所示。这些类之间的边界是固定的经验通过对各种字体比r的统计分析。
APPENDIX B
FEATURES DESCRIPTION 附录B
功能的说明
Features are extracted from the horizontal projection profile of text images and from connected components. Spaces between words are ignored and replaced by a fixed value corresponding to a median character-space. Each space between two successive connected components bigger than b1-u1 is assumed to be a wordspace and therefore ignored.
特征提取的文本图像和连通水平投影轮廓。单词之间的空格被忽略,由一个固定的值相当于中位数字符的空间所取代。每两个连续大于连通空间被假定为1 wordspace,因此被忽略。
, 考虑引入线的位置内波动。六印刷类(T(下毫升))被定义为表4和图所
因篇幅问题不能全部显示,请点此查看更多更全内容