複数の Web ページを綺麗な PDF に整形する
SAFE Stack のドキュメント (link) がかなり充実している。 確か、バージョンが V3 になった直後は一時的に情報が少なくなっていたと思うのだけど、以前以上のボリュームになっている気がする。
不明点や疑問点をいきなり自分で試行錯誤するよりも、ドキュメントに当たれば解決することも色々ありそう。 そして、気になるページを丁寧に読み込む前に、サイトにあるドキュメント全体を流し読みしたくなった。 そして^2、流し読みするなら PDF や印刷したものでパラパラとめくりながら、書き込みをしていきたい。
というわけで、複数の URI から PDF を作れるようにした。 途中で回り道をしてしまったけど、結局、 wkhtmltopdf (link) がとてもよく出来ていて、やりたいことをできた。 wkhtmltopdf に複数の URI を渡すとそれらを一まとめの PDF にしてくれる。 しかも、リンク付きの目次も作れるし、各ページのヘッダやフッタも指定できる。 これまで wkhtmltopdf のことを Web ブラウザの印刷で PDF 化するような手動操作を自動化してくれるもの、くらい思っていたけど、もっと高機能で柔軟だった。
重要な機能は wkhtmltopdf が提供してくれる。 ただ使い方がやや煩雑になるのでラッパになるスクリプトを作った。
uri2pdf.rb (wkhtmltopdf のラッパ)
ソース
#!/usr/bin/env ruby
CMD_NAME = File.basename $0
# DEBUG = true
DEBUG = false
def debug (opt_hash = {})
  return (DEBUG || opt_hash[:debug])
end
def abort_after_help (msg)
    puts msg
    puts
    puts @help_msg
    abort
end
require "pp"
require "optparse"
def main ()
  pp [:argv_before_parse, ARGV]  if debug()
  opt_hash = parse_options(ARGV)
  pp [:opt_hash, opt_hash]       if debug(opt_hash)
  pp [:argv_after_parse,  ARGV]  if debug(opt_hash)
  # check commands
  puts "ToDo: check: wkhtmltopdf commands"  if debug(opt_hash)
  # check output file
  outfile = opt_hash[:outfile]
  if outfile.nil? then
    abort_after_help "#{CMD_NAME}: outfile is not specified"
  end
  unless FileTest.writable? File.dirname(outfile) then
    abort_after_help "#{CMD_NAME}: ${outfile}: can not create file"
  end
  # check input and get its content
  infile = opt_hash[:infile]
  uris =
    if infile.nil? then
      uris = ARGV.dup
    else
      uris = IO.readlines(infile).map{ |x| x.chomp.strip }
    end
  uris.reject! { |x| x.empty? }
  if uris.empty? then
    abort_after_help "#{CMD_NAME}: URI is not specified"
  end
  # pp [:uris,  uris]  if debug(opt_hash)
  # create PDF files in temporally directory
  global_opts  = opt_hash[:wkhtmltopdf_global_opts]
  page_opts    = opt_hash[:wkhtmltopdf_page_opts]
  hdr_ftr_opts = opt_hash[:wkhtmltopdf_hdr_ftr_opts]
  toc_opts     = opt_hash[:wkhtmltopdf_toc_opts]
  require 'tmpdir'
  tmp_dir = nil
  begin
    tmp_dir = Dir.mktmpdir("#{CMD_NAME}_")
    # build command image
    cmd_img = "wkhtmltopdf"
    cache_dir_path = opt_hash[:cache_dir]
    if cache_dir_path.nil? || cache_dir_path.empty? then
      cache_dir_path = "#{tmp_dir}/cache"
    end
    Dir.mkdir(cache_dir_path)  unless FileTest.directory? cache_dir_path
    cmd_img += " --cache-dir #{cache_dir_path}"
    cmd_img += " --load-error-handling abort"
    paper_size = opt_hash[:paper_size]
    if paper_size then
      cmd_img += " --page-size #{paper_size}"
    end
    orientation = opt_hash[:orientation]
    if orientation then
      cmd_img += " --orientation #{orientation}"
    end
    style = opt_hash[:style]
    if style then
      style_path = "#{tmp_dir}/style.css"
      open(style_path, 'w') { |io| io.puts style }
      cmd_img += " --user-style-sheet #{style_path}"
    end
    cmd_img += " #{global_opts} #{page_opts} #{hdr_ftr_opts}"
    cmd_img += " toc #{toc_opts}"
    uris.each do |uri|
      escaped_uri = shesc uri
      cmd_img += " page #{escaped_uri}"
    end
    cmd_img += " #{outfile}"
    # try to page fetch until cache is stabilized
    max_try_count = 5
    cur_try_count = 0
    latest_cache = latest_file(cache_dir_path)
    loop do
      cur_try_count += 1
      command_run(opt_hash, cmd_img)
      prev_latest_cache = latest_cache
      pp [:prev_latest_cache, prev_latest_cache]  if debug(opt_hash)
      latest_cache = latest_file(cache_dir_path)
      pp [:latest_cache, latest_cache]  if debug(opt_hash)
      if latest_cache[:time] == prev_latest_cache[:time] then
        break
      else
        if cur_try_count < max_try_count then
          puts "TRY NEXT (#{cur_try_count})"  if debug(opt_hash)
        else
          puts "EXCEEDS MAX TRY COUNT (#{cur_try_count})"  if debug(opt_hash)
          abort
        end
      end
    end
  ensure
    if tmp_dir then
      if opt_hash[:keep_tmpdir] then
        puts "keep temporal directory: #{tmp_dir}"
      else
        FileUtils.remove_entry(tmp_dir, force = true)
      end
    end
  end
end
class MyRuntimeError < RuntimeError
  def initialize (arg = nil)
    super
    @arg = arg
  end
  def name ()
    return "runtime error"
  end
  def desc ()
    return name + (@arg ? ": #{@arg}" : "")
  end
end
class InvalidNumberFormat < MyRuntimeError
  def name ()
    return "invalid number format"
  end
end
class InvalidCommandPath < MyRuntimeError
  def name ()
    return "invalid command path"
  end
end
def parse_options (argv)
  opt = OptionParser.new
  opt.summary_indent = " " * 2
  opt.summary_width = 36
  opt.banner = [
    "Usage:",
    "    #{File.basename($0)} -o FILE URI...",
    "    #{File.basename($0)} -o FILE -i FILE",
  ].join("\n")
  opt.separator ""
  opt.separator "Options:"
  infile_default = nil
  infile = infile_default
  infile_desc =  ["input file"]
  opt.on("-i", "--infile FILE", *infile_desc) do |x|
    infile = x
  end
  outfile_default = nil
  outfile = outfile_default
  outfile_desc =  ["output file"]
  opt.on("-o", "--outfile FILE", *outfile_desc) do |x|
    outfile = x
  end
  cache_dir_default = nil
  cache_dir = cache_dir_default
  cache_dir_desc = ["cache directory"]
  opt.on("-c", "--cache-dir DIR", *cache_dir_desc) do |x|
    cache_dir = x
  end
  keep_tmpdir_default = nil
  keep_tmpdir = keep_tmpdir_default
  keep_tmpdir_desc = ["keep temporally file"]
  opt.on("-k", "--keep-tmpdir", *keep_tmpdir_desc) do
    keep_tmpdir = true
  end
  paper_size_default = nil
  paper_size = paper_size_default
  paper_size_desc = ["paper size: A4, B5, Letter, etc"]
  opt.on("--paper-size SIZE", *paper_size_desc) do |x|
    paper_size = x
  end
  orientation_default = nil
  orientation = orientation_default
  landscape_desc = ["landscape orientation"]
  opt.on("--landscape", *landscape_desc) do
    orientation = "landscape"
  end
  portrait_desc = ["portrait orientation"]
  opt.on("--portrait", *portrait_desc) do
    orientation = "portrait"
  end
  additional_style_default = nil
  additional_style = additional_style_default
  additional_style_desc = ["additinal user style"]
  opt.on("-s", "--style STYLE", *additional_style_desc) do |x|
    additional_style = x
  end
  wkhtmltopdf_global_opts_default = ""
  wkhtmltopdf_global_opts = wkhtmltopdf_global_opts_default
  wkhtmltopdf_global_opts_desc =  ["wkhtmltopdf global options"]
  opt.on("-g", "--wkhtmltopdf-global-opts OPTS", *wkhtmltopdf_global_opts_desc) do |x|
    wkhtmltopdf_global_opts = x
  end
  wkhtmltopdf_page_opts_default = "--default-header"
  wkhtmltopdf_page_opts = wkhtmltopdf_page_opts_default
  wkhtmltopdf_page_opts_desc =  ["wkhtmltopdf page options (default: #{wkhtmltopdf_page_opts_default})"]
  opt.on("-p", "--wkhtmltopdf-page-opts OPTS", *wkhtmltopdf_page_opts_desc) do |x|
    wkhtmltopdf_page_opts = x
  end
  wkhtmltopdf_hdr_ftr_opts_default = ""
  wkhtmltopdf_hdr_ftr_opts = wkhtmltopdf_hdr_ftr_opts_default
  wkhtmltopdf_hdr_ftr_opts_desc =  ["wkhtmltopdf header and footer options"]
  opt.on("-r", "--wkhtmltopdf-hdr-ftr-opts OPTS", *wkhtmltopdf_hdr_ftr_opts_desc) do |x|
    wkhtmltopdf_hdr_ftr_opts = x
  end
  wkhtmltopdf_toc_opts_default = "--disable-dotted-lines"
  wkhtmltopdf_toc_opts = wkhtmltopdf_toc_opts_default
  wkhtmltopdf_toc_opts_desc =  ["wkhtmltopdf toc options"]
  opt.on("-t", "--wkhtmltopdf-toc-opts OPTS", *wkhtmltopdf_toc_opts_desc) do |x|
    wkhtmltopdf_toc_opts = x
  end
  debug_default = false
  debug = debug_default
  debug_desc = "debug mode (default: #{debug_default})"
  opt.on("-d", "--debug", debug_desc) do |v|
    debug = v
  end
  opt.separator ""
  @help_msg = opt.help
  begin
    opt.parse!(argv)
  rescue MyRuntimeError => evar
    puts "Error: #{evar.desc}"
    puts
    puts @help_msg
    exit 1
  rescue OptionParser::ParseError => evar
    puts "Error: #{evar.message}"
    puts
    puts @help_msg
    exit 1
  rescue OptionParser::InvalidOption => evar
    puts "Error: invalid option: #{evar.args.join(' ')}"
    puts
    puts @help_msg
    exit 1
  rescue => evar
    puts "Error: unexpected (#{evar.inspect})"
    abort
  end
  opt_hash = {
    :infile => infile,
    :outfile => outfile,
    :cache_dir => cache_dir,
    :keep_tmpdir => keep_tmpdir,
    :paper_size => paper_size,
    :orientation => orientation,
    :style => additional_style,
    :wkhtmltopdf_global_opts => wkhtmltopdf_global_opts,
    :wkhtmltopdf_page_opts => wkhtmltopdf_page_opts,
    :wkhtmltopdf_hdr_ftr_opts => wkhtmltopdf_hdr_ftr_opts,
    :wkhtmltopdf_toc_opts => wkhtmltopdf_toc_opts,
    :debug => debug,
  }
  return opt_hash
end
require 'find'
def latest_file (dir_path)
  latest_file_info = {:name => nil, :time => Time.utc(1970,1,1,0,0,0)}
  Find.find(dir_path) do |file|
    next  unless FileTest.file? file
    tm = File.mtime(file)
    if tm > latest_file_info[:time] then
      latest_file_info = {:name => file, :time => tm}
    end
  end
  return latest_file_info
end
require 'shellwords'
def shesc (s, allow_nil: false)
  return nil  if allow_nil && s.nil?
  raise "Unexpected class: #{s.class}"  unless s.is_a? String
  return (if s.empty? then s else Shellwords.shellescape s end)
end
def command_run (opt_hash, *cmd_img)
  (result, status) = command_status_and_output_of(opt_hash, *cmd_img)
  check_exitstatus(opt_hash, cmd_img, status)
end
def command_output_of (opt_hash, *cmd_img)
  (result, status) = command_status_and_output_of(opt_hash, *cmd_img)
  check_exitstatus(opt_hash, cmd_img, status)
  return result
end
def command_status_of (opt_hash, *cmd_img)
  (result, status) = command_status_and_output_of(opt_hash, *cmd_img)
  return status
end
def command_status_and_output_of (opt_hash, *cmd_img)
  # pp [:cmd_img, cmd_img]  if debug(opt_hash)
  result = `#{cmd_img.join(' ')}`
  return [result, $?.exitstatus]
end
def check_exitstatus (opt_hash, cmd_img, exitstatus)
  unless exitstatus == 0 then
    cmd_name = File.basename $0
    abort "#{cmd_name}: command #{cmd_img.inspect} is failed (#{exitstatus})"
  end
end
main
ヘルプ
uri2pdf: outfile is not specified
Usage:
    uri2pdf -o FILE URI...
    uri2pdf -o FILE -i FILE
Options:
  -i, --infile FILE                    input file
  -o, --outfile FILE                   output file
  -c, --cache-dir DIR                  cache directory
  -k, --keep-tmpdir                    keep temporally file
      --paper-size SIZE                paper size: A4, B5, Letter, etc
      --landscape                      landscape orientation
      --portrait                       portrait orientation
  -s, --style STYLE                    additinal user style
  -g, --wkhtmltopdf-global-opts OPTS   wkhtmltopdf global options
  -p, --wkhtmltopdf-page-opts OPTS     wkhtmltopdf page options (default: --default-header)
  -r, --wkhtmltopdf-hdr-ftr-opts OPTS  wkhtmltopdf header and footer options
  -t, --wkhtmltopdf-toc-opts OPTS      wkhtmltopdf toc options
  -d, --debug                          debug mode (default: false)
使用例
例1: 7 インチテーブル向けの PDF を作成
uri2pdf \
  --paper-size A6 \
  --landscape \
  -p "--header-font-size 9 \
      --header-left '[title]' \
      --header-right '[page]/[toPage]' \
      --header-line \
      --footer-line \
      --footer-font-size 9 \
      --footer-left '[webpage]' \
      --margin-top 0.9cm \
      --margin-bottom 0.9cm \
      --margin-left 0.5cm \
      --margin-right 0.5cm" \
  -s '* {line-height: 200%};' \
  -c ./cache \
  -i __URI_LIST_FILE__ \
  -o __OUTPUT_PDF_FILE__
例2: A4 印刷向けの PDF を作成
コマンドの上では用紙サイズを A5 にしているけど、自分の視力だとこれを A4 に印刷すると丁度よくなる。 眼の良い人にとっては文字サイズが大き過ぎそう。
uri2pdf \
  --paper-size A5 \
  --portrait \
  -p "--header-font-size 9 \
      --header-left '[title]' \
      --header-right '[page]/[toPage]' \
      --header-line \
      --footer-line \
      --footer-font-size 9 \
      --footer-left '[webpage]' \
      --margin-top 0.9cm \
      --margin-bottom 0.9cm \
      --margin-left 1.2cm \
      --margin-right 1.2cm" \
  -s '* {line-height: 200%};' \
  -c ./cache \
  -i __URI_LIST_FILE__ \
  -o __OUTPUT_PDF_FILE__