converting file formats
converting HTML to Markdown
Note
- foo.html represents an input HTML file.
- bar.md represents an output text file formatted with Pandoc Markdown.
Use pandoc -f html-native_divs-native_spans -t markdown-escaped_line_breaks-fenced_divs-header_attributes-fenced_code_attributes-inline_code_attributes-bracketed_spans-smart-grid_tables-multiline_tables-simple_tables --atx-headers --wrap=none "foo.html" -o "bar.md" to convert an HTML file to a Pandoc Markdown-formatted text file.
explanation
Note
This is an incomplete explanation.
- The
--atx-headersoption produces ATX-style headings for all heading levels, overriding the default behavior of producing Setext-style Markdown headings for levels 1 and 2. - The
--wrap=noneoption disables text wrapping. - Pandoc Markdown output
-bracketed_spansdisables the Pandoc Markdown extension for bracketed spans.-escaped_line_breaksdisables the Pandoc Markdown extension for backslash-escaped line breaks.-fenced_code_attributesdisables the Pandoc Markdown extension for assigning attribute lists to fenced code blocks.-header_attributesdisables the Pandoc Markdown extension for assigning attribute lists to headings.-inline_code_attributesdisables the Pandoc Markdown extension for assigning attribute lists to inline code.-smartdisables the extension for interpreting ASCII characters as curly quotes, em-dashes, en-dashes, and ellipses, and for inserting nonbreaking spaces. Backslash-escaped double-quotes\"are also no longer produced.-grid_tables-multiline_tables-simple_tablesdisables the Pandoc Markdown extensions for grid tables, multiline tables, and simple tables, leaving only pipe tables.
- HTML input
-native_divsdisables the raw HTML extension for preserving native<div>HTML elements.-native_spansdisables the raw HTML extension for preserving some native<span>HTML elements. Some<span>HTML elements are still preserved as bracketed spans (see the admonition below).
Attention
Even when -native_spans is used for HTML input, some <span> HTML elements are preserved as bracketed spans, unless -bracketed_spans is used for Markdown output, in which case the <span> HTML elements are preserved without being converted to Markdown format.
Attention
Using -link_attributes to disable the Pandoc Markdown extension for assigning attribute lists to hyperlinks prevents hyperlinks with attributes from being converted to Markdown-style links.
Attention
Even when +fenced_code_blocks is used for Pandoc Markdown output, indented code blocks are still produced instead of fenced code blocks.
prior work
The method of producing ATX-style headings for all heading levels was introduced to me by an answer on Stack Overflow by shawnhcorey.
converting Markdown to DOCX
Note
- foo.md represents an input text file formatted with Pandoc Markdown.
- bar.docx represents an output Word DOCX file.
- baz.docx represents a style reference for the output Word DOCX file.
Use pandoc -f markdown -t docx "foo.md" --reference-doc="baz.docx" --lua-filter=~/lua/pagebreak.lua -o "bar.docx" to convert a Pandoc Markdown-formatted text file to a Word DOCX file, using baz.docx as a style reference and pagebreak.lua to produce page breaks.
converting Markdown to HTML
Note
- foo.md represents an input text file formatted with Pandoc Markdown.
- bar.html represents an output HTML file.
Use pandoc -f markdown -t html "foo.md" -o "bar.html" to convert a Pandoc Markdown-formatted text file to an HTML file.
converting Markdown to plain text
Note
- foo.md represents an input text file formatted with Pandoc Markdown.
- bar.txt represents an output plain text file.
Use pandoc -f markdown -t plain --wrap=none "foo.md" | sed -e 's/—/-/g' -e "s/’/'/g" -e 's/\xC2\xA0/ /g' - | cat -s - | sponge "bar.txt" to convert a Pandoc Markdown-formatted text file to a plain text file.
explanation
Note
This is an incomplete explanation.
- The
cat -scommand produces a single blank line in place of multiple adjacent blank lines.1 - The
sed -e 's/—/-/g' -e "s/’/'/g" -e 's/\xC2\xA0/ /g'command does the following:'s/—/-/g'replaces any em dashes (—) with hyphen-minuses (-)."s/’/'/g"replaces any right single quotation marks (’) with apostrophes (').'s/\xC2\xA0/ /g'replaces any non-breaking spaces with ordinary spaces, using theU+C2A0UTF-8 code point.
- The
--wrap=noneoption disables text wrapping.
converting PDF to text
Note
- foo.pdf represents an input PDF file.
- bar.txt represents an output text file.
Use pdftotext -layout -nopgbrk foo.pdf bar.txt to convert a PDF file to a text file.
explanation
- The
-layoutoption attempts to preserve the layout of the PDF when converting. - The
-nopgbrkoption disables the insertion of form feed characters to indicate page breaks.
converting plain text to synthesized-speech-FLAC
converting plain text to synthesized-speech-FLAC using eSpeak NG
Note
- foo.txt represents an input plain text file.
- bar.flac represents an output FLAC file.
Use espeak-ng -f foo.txt --stdout | sox --no-clobber - bar.flac to convert a plain text file to a synthesized-speech-FLAC file.
explanation
Note
This is an incomplete explanation.
- The
--no-clobberoption prevents SoX from producing a FLAC output file if a file with the same name already exists.
prior work
The -f and --stdout options were introduced to me by an answer on Stack Overflow by user76204.
converting plain text to synthesized-speech-FLAC using the Festival Speech Synthesis System
Note
- foo.txt represents an input plain text file.
- bar.flac represents an output FLAC file.
Use text2wave -otype aiff foo.txt | sox --no-clobber - bar.flac to convert a plain text file to a synthesized-speech-FLAC file.
Attention
text2wave does not seem to handle contractions correctly, reading out each individual character if an apostrophe is encountered in the middle of a word. A workaround is to omit apostrophes (') from the plain text input file, eliminating any contractions that rely on apostrophes.
explanation
Note
This is an incomplete explanation.
- The
-otype aiffoption produces synthesized speech in AIFF format. - The
--no-clobberoption prevents SoX from producing a FLAC output file if a file with the same name already exists.
converting plain text to synthesized-speech-OGG using the Festival Speech Synthesis System
Use text2wave -otype aiff foo.txt | sox --no-clobber - -C -1 bar.ogg to convert a plain text file to a synthesized-speech-Vorbis-Ogg file.
Attention
text2wave does not seem to handle contractions correctly, reading out each individual character if an apostrophe is encountered in the middle of a word. A workaround is to omit apostrophes (') from the plain text input file, eliminating any contractions that rely on apostrophes.
explanation
Note
This is an incomplete explanation.
- The
-otype aiffoption produces synthesized speech in AIFF format. - The
--no-clobberoption prevents SoX from producing a Vorbis-Ogg output file if a file with the same name already exists.
licensing
No rights reserved: CC0 1.0.