Transition from Mediawiki Raw

## Resources

Used `python-markdown` as the converter.

* [Nice extension bundle](http://facelessuser.github.io/pymdown-extensions/).
    Used all of it.
* [List of 3^rd^ Party Extensions for `python-markdown`](https://github.com/waylan/Python-Markdown/wiki/Third-Party-Extensions)
* [Base16](https://chriskempson.github.io/base16/#default) [Pygments CSS](https://github.com/idleberg/base16-pygments) for code-highlighting
* [A discussion](http://lepture.com/en/2014/markdown-parsers-in-python) of
    popular Python Markdown parsers.
* [Basscss](http://www.basscss.com/docs/base-reset/) looks interesting

## Exporting MediaWiki Content

Use [dumpBackup.php](https://git.wikimedia.org/blob/mediawiki%2Fcore.git/HEAD/maintenance%2FdumpBackup.php)

* in the `maintenance` folder
* with a `--full` flag

to get all pages and revisions. This dumps an XML file. It needed to be parsed.

Using Python 2.7 since `gittle` has a [`urlparse`-related bug for Python 3](https://github.com/FriendCode/gittle/issues/49).

## Parsing XML and Making a `git` repo

```python
# -*- coding: utf-8 -*-

from __future__ import print_function
import shlex
import sys
import subprocess
from multiprocessing import Pool, cpu_count

import sh
from gittle import Gittle
from bs4 import BeautifulSoup

# UTF-8...
reload(sys)
sys.setdefaultencoding('UTF8')

output_folder = '/Users/nikhil/Desktop/wiki/articles'
tmp_folder = '/Users/nikhil/Desktop/wiki/tmp'
path_to_xml_dump = '/Users/nikhil/Desktop/wiki/wiki.xml'
committer_name = 'Nikhil Anand'
committer_email = 'mail@nikhil.io'

process_pool = Pool(cpu_count() * 2)

# Clean all folders
sh.rm('-rf', output_folder, tmp_folder)
print('Removed folders')

# Create required folders
sh.mkdir(output_folder, tmp_folder)
print('Created folders')

# Initialize git repo
repo = Gittle.init(output_folder)
print('Initialized git repo')
repo.commit(name=committer_name, email=committer_email, message='Initial commit')

# Get all pages from XML dump
dump = BeautifulSoup(open(path_to_xml_dump), "lxml").findAll('page')
print('Found', len(dump), 'pages')

page_counter = 0

for page_node in dump:

    page_title = page_node.title.text.replace('/', ' or ')
    revisions = page_node.findAll('revision')
    revision_counter = 1

    for revision in revisions:

        wikifile = '{}.{}.mediawiki'.format(
                        page_title,
                        revision.timestamp.text
                    )

        markdownfile = '{}.md'.format(page_title)

        # Get the content for the current revision
        revision_text = revision.find('text')

        # Write to the temp file
        with open('{}/{}'.format(tmp_folder, wikifile), 'w') as f:
            f.write(revision_text.text)
            print('Wrote {}'.format(wikifile))

        # Use Pandoc to convert file
        command = 'pandoc +RTS -K256m -RTS -f mediawiki -t markdown_mmd "{}/{}" -o "{}/{}"'.format(
                        tmp_folder,
                        wikifile,
                        output_folder,
                        markdownfile
                    )

        conversion_process = subprocess.Popen(shlex.split(command))
        conversion_process.wait()
        print('Converted {}'.format(wikifile))

        # Stage the file
        repo.stage(markdownfile)

        # Commit
        if revision_counter == 1:
            commit_message = '{} : First Draft'.format(page_title)
        else:
            commit_message = '{} : v{}'.format(page_title, revision_counter)

        repo.commit(name=committer_name,
                    email=committer_email,
                    message=commit_message
                    )

        print('Committed {}'.format(commit_message))

        revision_counter += 1

    page_counter += 1

print('Processed', page_counter, 'pages')
```

## Cleaning up Output

```python
# -*- coding: utf-8 -*-

import sys
from glob import glob
import re

reload(sys)
sys.setdefaultencoding('utf-8')

for file in glob('./pages/*.md'):
    f = open(file, 'r').read()
    f_ = f

    f_ = f_.replace('\n', '')
    f_ = f_.replace('', '')
    f_ = f_.replace('\n', '')
    f_ = f_.replace('', '')
    f_ = f_.replace('`\\', '`')
    # f_ = re.sub(r'^`(( )+)?(.*)`', r'    \3', f_, flags=re.MULTILINE)
    f_ = f_.replace('` `', '')
    f_ = f_.replace('', '')
    f_ = f_.replace('`**', '')
    f_ = f_.replace('**`', '')
    f_ = re.sub(r'\s{4}(`\*\*`)', r'    ', f_, flags=re.MULTILINE)
    f_ = re.sub(r'(-\s{3})', r'* ', f_, flags=re.MULTILINE)

    with open(file, 'w') as o:
        o.write(f_)
        print 'Wrote', file

```

Then had to look at each manually :/

## Other Notes

### Comparisons

Flask using Markdown to render each page

```
Transactions:                349 hits
Availability:             100.00 %
Elapsed time:               9.20 secs
Data transferred:           3.21 MB
Response time:              1.77 secs
Transaction rate:          37.93 trans/sec
Throughput:             0.35 MB/sec
Concurrency:               67.14
Successful transactions:         349
Failed transactions:               0
Longest transaction:            2.27
Shortest transaction:           0.06
```

Statically generated HTML (as templates in Flask)

```
Transactions:               1671 hits
Availability:             100.00 %
Elapsed time:               9.02 secs
Data transferred:          15.37 MB
Response time:              0.01 secs
Transaction rate:         185.25 trans/sec
Throughput:             1.70 MB/sec
Concurrency:                1.23
Successful transactions:        1671
Failed transactions:               0
Longest transaction:            0.04
Shortest transaction:           0.00
```

Built-in Gollum server

```
Transactions:                 71 hits
Availability:              86.59 %
Elapsed time:               9.32 secs
Data transferred:           0.82 MB
Response time:              2.78 secs
Transaction rate:           7.62 trans/sec
Throughput:             0.09 MB/sec
Concurrency:               21.18
Successful transactions:          71
Failed transactions:              11
Longest transaction:            7.54
Shortest transaction:           0.00
```