This is a memorandum for newspaper scrapper.
Figure 1: scenery of newspaper scrapping.
----------------------------------
PREFACE
I am deeply attached to newspaper scrapping. Since 2001, I have been reading the Daily Yomiuiri every day.This is for my English reading study. At that time I start reading it, I had to study for the entrance exam of Graduate School.
Until that time, I had worked at a construction consultant company,
running domestically, without using English on business. So I have a lot of blank of English use.
Description above is the motivation of scrapping newspaper articles every day. At graduate school days, I was not close to business world,
so I felt a lack of business information. Since 2004, I have started reading the Nihon Keizai Shimbun, so-called Nikkei in Japanese.
After leaving University and entered to a company, I have continued to scrap the article of newspaper until now.
Around 2008, the articles with spreading over notebook exceed 70 books.
So I have decided to transit to digital world from analog scrapping world.
This is the memorandum of my work path into digital archiving.
----------------------------------
TIMELINE SUMMARY
2001: Started the reading the Daily Yomiuri.
2004: Started the reading Nihon Keizai Shimbun, so-called Nikkei.
2007: Started the reading The Nikkei Weekly.
2007: Bought the A3 flat bed scanner with auto-feeder function.
2007: Transition to digital scrapping
----------------------------------
ROUTINE WORK
I always get up at 5:00 a.m. or so, because my blacky cat calls me in order to ask to me feed breakfast. So I start to checking newspaper around 5:30 a.m.
First I read newspaper with using red pencil marking as shown below.
Figure 2: Checking interested articles.
As target information, I always search and select science and environmental article, because I formally majored in that disciplinary. In addition to above, I also check my business field information. Colorred line distinguishing other article area is very important for scanning work.
Subsequently I start up computer, after checking newspaper.
Figure 3: My computer and my scanner.Figure 4: My flat-bed scanner. This scanner was obtained with about 80,000 yen, used device market. This 'Offirio ES-7000H', of EPSON, supports A3-wide scanning domain. Additionally, auto-feeder is implemented.
Scanning procedure is as follows:
1. start up Linux OS and login system.
2. turn on flat bed scanner.
3. nautilus open.
4. gnome terminal open.
5. scanning software start up.
6. scanning.
Figure 5: Desktop screen shot of scanning work.
Figure 5 indicates desktop screenshot at scanning process.
I always tune the scanning area range with using hand-drawn
marked lines.
Figure 6: Scanning an article. [Torikomi] button, in Japanese, means [Scan] in English.
After clicking [Scan] buttom, Dialogue ask file name and its directory path. Filename is defined by following manner.
20111231_The_Daily_Yomiuri_01.jpg
20111231_The_Daily_Yomiuri_02.jpg
.....
20111231_The_Daily_Yomiuri_0n.jpg
As you can see, file name is not allowed to use blank string.
because these file are to be imported shell scripts for automated job.
And segmentalized number of file name's tail is counted up normally.
----01.jpg
----02.jpg
Sometimes, I scan multiple area for same articles.
Some scanning process cases are summarised in Figure 7.
Figure 7: Flow chart of scanning process that depends on target article area.
Pattern A [ (1) - (2) ] ... normal sequence
A procedure through (1) to (2) are summarized in following,
1. trimming rectangular frame.
2. trimming convex, concave area.
3. tune the color configuration.
4. inserting the newspaper name and stamp of issued date.
5. convert file from jpg to eps
6. page file creation with using latex layout
7. convert file from dvi to pdf
Figure 8: Scanning scenary.
Figure 9: Trimming image file with using gthumb image file viewer.
Pattern B [ (3) - (4) ] ... joining GIMP
A procedure through (3) to (4) are summarized in following,
1. trimming rectangular frame in individual file
2. trimming convex, concave area in individual file
3. tune the color configuration in individual file
4. joining two- or multiple- image file with using GIMP
5. inserting the newspaper name and stamp of issued date.6. convert file from jpg to eps
7. page file creation with using latex layout
8. convert file from dvi to pdf
Figure 10: Desktop scenary of joining image file. First, frame size is expanded for appending image.
Figure 11: Desktop scenary of appending and shifting image file.
Figure 12: Joined two images into single file.
Pattern C [ (5) - (6) ] ... pdftk joining
A procedure through (5) to (6) are summarized in following,
1. trimming rectangular frame in individual file
2. trimming convex, concave area in individual file
3. tune the color configuration in individual file
4. inserting the newspaper name and stamp of issued date, in individual file
5. convert file from jpg to eps, in individual file
6. page file creation with using latex layout
7. convert file from dvi to pdf
8. joining two- or multiple- files with using pdftk
Inserting newspaper name and stamp of issued date
Stamp inserting shell script is shown below.
------------------------------------------------------
#!/bin/bash
#
# InsertCredit.sh
#
#
nPage=0
TmpFile="./tmp.jpg"
CiteFile="./cite.jpg"
for inputfile in ./2011*.jpg ; do
echo " "
echo "now converting : "${inputfile}
cp ${inputfile} ${TmpFile}
### 画像ファイルから横幅情報検出
FigSize=`identify ${inputfile} | awk '{printf($3)}'`
FigSize=`echo ${FigSize} | sed -e "s/x/ /"`
YokoSize=`echo ${FigSize} | awk '{printf($1)}'`
TateSize=`echo ${FigSize} | awk '{printf($2)}'`
echo "Figure Size:"${FigSize}" Yoko Size:"${YokoSize}" Tate Size:"${TateSize}
### ファイル名から新聞日付の検出
nYear=`echo ${inputfile} | cut -c 3-6`
nMonth=`echo ${inputfile} | cut -c 7-8`
nDay=`echo ${inputfile} | cut -c 9-10`
### 英語ロケールに変更
export LANG=en
DateStampE=`date --date ${nYear}/${nMonth}/${nDay} +"%A, %B %d, %Y"`
### 念のため日本語ロケールに戻す
export LANG=ja_JP.UTF-8
DateStampJ=`date --date ${nYear}/${nMonth}/${nDay} +"%Y 年 %B %d 日 %A"`
### ファイル名から新聞名を検出
nPaper=`echo ${inputfile} | cut -c 11-19`
if [ ${nPaper} = "_nikkei_0" ]; then
# echo "日本経済新聞(朝刊)"
StampPaper="日本経済新聞(朝刊)"
DateStamp=`echo ${DateStampJ}`
elif [ ${nPaper} = "_nikkei_e" ]; then
# echo "日本経済新聞(夕刊)"
StampPaper="日本経済新聞(夕刊)"
DateStamp=`echo ${DateStampJ}`
elif [ ${nPaper} = "_The_Dail" ]; then
# echo "The Daily Yomiuri"
StampPaper="The Daily Yomiuri"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_The_Japa" ]; then
# echo "The Japan Times"
StampPaper="The Japan Times"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_Internat" ]; then
StampPaper="International_Herald_Tribune"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_Asahi_00" ]; then
StampPaper="朝日新聞"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_Asahi_e0" ]; then
StampPaper="朝日新聞(夕刊)"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_AsahiEle" ]; then
StampPaper="朝日小学生新聞"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_Yomiuri_" ]; then
StampPaper="読売新聞"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_The_Wall" ]; then
StampPaper="The Wall Street Journal"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_Taipei_T" ]; then
StampPaper="Taipei Times"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_The_Chin" ]; then
StampPaper="The China Post"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_kobe_000" ]; then
StampPaper="神戸新聞"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_Shizuoka" ]; then
StampPaper="静岡新聞"
DateStamp=`echo ${DateStampE}`
elif [ ${nPaper} = "_ShanghaiD" ]; then
StampPaper="Shanghai Daily"
DateStamp=`echo ${DateStampE}`
fi
### 最終的に挿入する文字列設定
charAnnot=`echo " "${DateStamp} --- ${StampPaper} `
echo ${charAnnot} > ./DateStamp.txt
### 使用するフォントの格納先指定
### 以下の URL などを参照
### http://www.imagemagick.org/Usage/fonts/
#FontPath="/usr/share/fonts/sazanami-fonts-gothic/sazanami-gothic.ttf"
FontPath="/usr/share/fonts/vlgothic/VL-PGothic-Regular.ttf"
TateSize=`echo ${TateSize}*0.05 | bc`
OutputSize=`echo ${YokoSize}"x"${TateSize}`
echo "-------------------> Yoko Size:"${YokoSize}" Tate Size:"${TateSize}
strSize=`echo ${TateSize}*0.25 | bc`
echo " pointsize:"${strSize}
convert -size ${OutputSize} -font ${FontPath} -pointsize ${strSize} \
-fill navy label:@DateStamp.txt ${CiteFile}
convert -append ${TmpFile} ${CiteFile} ${inputfile}
done
rm -f ${TmpFile}
rm -f ${CiteFile}------------------------------------------------------
Image Converting from JPEG to PDF
PDF creation script from JPEG is shown below.
------------------------------------------------------
#!/bin/bash
#
# Jpeg2PDF.sh
#
#
######################################################################
thresh_square=1100000
#thresh_aspect_ratio=0.7
#thresh_aspect_ratio=1.28
thresh_aspect_ratio_yoko=0.7
thresh_aspect_ratio_tate=1.28
######################################################################
### tmp.tex
TemplateFile="./tmp.tex"
echo "%%% TemplateFile.tex %%%" > ${TemplateFile}
echo "\documentclass[a4paper,10pt]{jsarticle}" >> ${TemplateFile}
echo "\usepackage{graphicx}" >> ${TemplateFile}
echo "\usepackage{float}" >> ${TemplateFile}
echo "\usepackage{ascmac}" >> ${TemplateFile}
echo "\pagestyle{empty}" >> ${TemplateFile}
echo "%%%%%% TEXT START %%%%%%" >> ${TemplateFile}
echo "\begin{document}" >> ${TemplateFile}
echo "" >> ${TemplateFile}
echo "\begin{figure}[t]" >> ${TemplateFile}
echo "\begin{center}" >> ${TemplateFile}
echo " \includegraphics[width=5.0cm,clip]{./input.eps}" >> ${TemplateFile}
echo "\end{center}" >> ${TemplateFile}
echo "\end{figure}" >> ${TemplateFile}
echo "" >> ${TemplateFile}
echo "\end{document}" >> ${TemplateFile}
######################################################################
FortranFile="./RotateCheck.f"
echo "!!------ RotateCheck.f-------------------" > ${FortranFile}
echo " program RotateCheck" >> ${FortranFile}
echo " real*4 YokoSize, TateSize" >> ${FortranFile}
echo "!!---" >> ${FortranFile}
echo ' open(99,file="RotateCheck.ctl",status="old")' >> ${FortranFile}
echo " read(99,*) !skip" >> ${FortranFile}
echo " read(99,*) thresh_square" >> ${FortranFile}
echo " read(99,*) thresh_aspect_ratio_yoko" >> ${FortranFile}
echo " read(99,*) thresh_aspect_ratio_tate" >> ${FortranFile}
echo " read(99,*) YokoSize" >> ${FortranFile}
echo " read(99,*) TateSize" >> ${FortranFile}
echo " close(99)" >> ${FortranFile}
echo "" >> ${FortranFile}
echo " square=YokoSize*TateSize" >> ${FortranFile}
echo " !! aspect_ratio=TateSize/float(YokoSize)" >> ${FortranFile}
echo " aspect_ratio=TateSize/YokoSize" >> ${FortranFile}
echo "" >> ${FortranFile}
echo " if(square < thresh_square)then" >> ${FortranFile}
echo " if(aspect_ratio > thresh_aspect_ratio_yoko)then" >> ${FortranFile}
echo " !angle=0.0" >> ${FortranFile}
echo " nType=1" >> ${FortranFile}
echo " else" >> ${FortranFile}
echo " !angle=90.0" >> ${FortranFile}
echo " nType=2" >> ${FortranFile}
echo " end if" >> ${FortranFile}
echo " else" >> ${FortranFile}
echo " if(aspect_ratio > thresh_aspect_ratio_tate)then" >> ${FortranFile}
echo " !angle=0.0" >> ${FortranFile}
echo " nType=3" >> ${FortranFile}
echo " else if(aspect_ratio > thresh_aspect_ratio_yoko" >> ${FortranFile}
echo " & .and. aspect_ratio < thresh_aspect_ratio_tate )then" >> ${FortranFile}
echo " !angle=0.0" >> ${FortranFile}
echo " nType=4" >> ${FortranFile}
echo " else if(aspect_ratio < thresh_aspect_ratio_yoko )then" >> ${FortranFile}
echo " !angle=0.0" >> ${FortranFile}
echo " nType=5" >> ${FortranFile}
echo " else" >> ${FortranFile}
echo " !angle=90.0" >> ${FortranFile}
echo " nType=5" >> ${FortranFile}
echo " end if" >> ${FortranFile}
echo " end if" >> ${FortranFile}
echo "" >> ${FortranFile}
echo ' write(*,"(i2)") nType' >> ${FortranFile}
echo "" >> ${FortranFile}
echo " stop" >> ${FortranFile}
echo " end program" >> ${FortranFile}
echo "" >> ${FortranFile}
rm -f ./RotateCheck
gfortran -O2 -Wall -o RotateCheck RotateCheck.f
######################################################################
nPage=0
for inputfile in ./*.jpg ; do
echo " "
echo "now converting : "${inputfile}
### 画像ファイルから各拡張子ファイル用意
FileName=${inputfile%.jpg}
FileNameString=`basename ${FileName}`
FileEPS=${FileName}.eps
FileTEX=${FileName}.tex
FileDVI=${FileName}.dvi
FilePS=${FileName}.ps
FilePDF=${FileName}.pdf
### ファイル名チェック
echo " "
echo "FileEPS : "${FileEPS}
echo "FileTEX : "${FileTEX}
echo "FileDVI : "${FileDVI}
echo "FilePS : "${FilePS}
### 画像ファイルからタテヨコサイズ情報検出
FigSize=`identify ${inputfile} | awk '{printf($3)}'`
FigSize=`echo ${FigSize} | sed -e "s/x/ /"`
YokoSize=`echo ${FigSize} | awk '{printf($1)}'`
TateSize=`echo ${FigSize} | awk '{printf($2)}'`
### 画像ファイルからアスペクト比、面積 情報検出
aspect_ratio=`echo ${TateSize}/${YokoSize} | bc -l `
aspect_ratio=`printf "%3.2f" ${aspect_ratio}`
square=`echo ${TateSize}*${YokoSize} | bc -l`
square=`printf "%d" ${square}`
echo " Yoko Size:"${YokoSize}" Tate Size:"${TateSize}
echo " aspect ratio:"${aspect_ratio}" square:"${square}
### ctrl file の設定
echo "### ctrl file " > "./RotateCheck.ctl"
echo ${thresh_square} >> "./RotateCheck.ctl"
echo ${thresh_aspect_ratio_yoko} >> "./RotateCheck.ctl"
echo ${thresh_aspect_ratio_tate} >> "./RotateCheck.ctl"
echo ${YokoSize} >> "./RotateCheck.ctl"
echo ${TateSize} >> "./RotateCheck.ctl"
### 組版種別の検出
nType=`./RotateCheck`
charType=`echo ${nType} | sed -e "s/ //g" `
echo "charType: "${charType}
### Template TeX ファイルの入力ファイル記述、回転角記述の置換
if [ "${charType}" == "1" ]
then
convert ${inputfile} ${FileEPS}
sed -e 's/input/'${FileNameString}'/g' ${TemplateFile} | sed 's/width=5.0cm/height=12.0cm/g' > ${FileTEX}
elif [ "${charType}" == "2" ]
then
convert ${inputfile} ${FileEPS}
sed -e 's/input/'${FileNameString}'/g' ${TemplateFile} | sed 's/width=5.0cm/width=10.0cm/g' > ${FileTEX}
elif [ "${charType}" == "3" ]
then
convert ${inputfile} ${FileEPS}
sed -e 's/input/'${FileNameString}'/g' ${TemplateFile} | sed 's/width=5.0cm/height=20.0cm/g' > ${FileTEX}
elif [ "${charType}" == "4" ]
then
convert ${inputfile} ${FileEPS}
sed -e 's/input/'${FileNameString}'/g' ${TemplateFile} | sed 's/width=5.0cm/width=15.0cm/g' > ${FileTEX}
elif [ "${charType}" == "5" ]
then
convert -rotate -90 ${inputfile} ${FileEPS}
sed -e 's/input/'${FileNameString}'/g' ${TemplateFile} | sed 's/width=5.0cm/height=18.0cm/g' > ${FileTEX}
else
convert ${inputfile} ${FileEPS}
sed -e 's/input/'${FileNameString}'/g' ${TemplateFile} | sed 's/width=5.0cm/width=18.0cm/g' > ${FileTEX}
fi
platex ${FileTEX}
dvips ${FileDVI}
ps2pdf ${FilePS} ${FilePDF}
done
rm -f *.aux
rm -f *.tex
rm -f *.ps
rm -f *.log
rm -f *.dvi
rm -f *.eps
------------------------------------------------------
This script uses Fortran calculation for layout arrangement.
OCR process with using ADOBE ACROBAT
Unfortunately, there's no decent PDF OCR converter software in freeware world currently. Description until here, I use free software including an operating system. I always use Adobe Acrobat standard ver.8. This software converts image pdf into string-recognized pdf file. However this standard edition of Adobe acrobat can only convert single file process. Adobe Acrobat Extended edition can use batch mode for multiple file process. Procedures are described in following.
Figure 13: Prepare pdf files in working directory.
Figure 14: Starting up Adobe Acrobat Extended.
Figure 15: Starting up OCR function.Figure 16: Selecting target PDF files. Files are selected depending on article languages, English or Japanes.Figure 17: Selecting output destination.
Figure 18: Selecting target languages.
Figure 19: Processing OCR string recognition.
----------------------------------
Free Text Retrieval SYSTEM for Newspaper Article Searching
........... description is now currently underway .....