Flow based programming and ETL

For quite some time I've been searching for a resonable approch on Extract, Transform, Load (ETL) in php where I can define a workflow, based on e.g. a UML diagram and just "run" it asynchronously. A solution with a fully fledged ETL tool like MS SSIS or talend were out of the question, due to their high complexity and hardware requirements. Also the possible solution has to integrate into our existing php environment.

phpflo

If you have already read my other posts, you know me to already use RabbitMQ and php-amqp for asynchronously handlinge import processes. This goes one step further and introduces the flow based programming "design pattern".

In computer science, flow-based programming (FBP) is a programming paradigm that defines applications as networks of "black box" processes, which exchange data across predefined connections by message passing, where the connections are specified externally to the processes. These black box processes can be reconnected endlessly to form different applications without having to be changed internally. FBP is thus naturally component-oriented. (Wikipedia)

Developers used to the Unix philosophy should be immediately familiar with FBP:

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

It also fits well in Alan Kay's original idea of object-oriented programming:

I thought of objects being like biological cells and/or individual computers on a network, only able to communicate with messages (so messaging came at the very beginning -- it took a while to see how to do messaging in a programming language efficiently enough to be useful).

Sounds good, doesn't it?

improvements and status quo

Initially I worked with phpflo, adapted it for symfony to use dependency injection instead of a factory-like process and was kind of happy. After a short while, the first problems arose:

Having serveral long running processes introduced the problem of "state" within components and also the network. So, already initialized networks could not be reused and had to be destroyed. Using a compiler-pass approach with a registry of components, also introduced port states within the process.

Several ideas came to my mind: Just restart the processes after every message from the queue or even fork the single ETL processes per message - but everything just lead into more problems:

  • Restarting processes means framework initialization overhead
  • Forking processes needs some kind of lowlevel process management

Overall, the best approach was to integrate some stage-management into phpflo, split the library into several components and implement a parser for the (more convinient) FBP domain specific language (DSL). You can find the implementation here. The split into several libraries was necessary due to separation of concerns, maintenance and possible future contributions of generic components.

integration

Added to our technology stack, phpflo integrated fine with symfony and all components are loaded via DIC. This allows for easy configuration of processes:

CategoryCreator() out -> in MdbPersister()
CategoryCreator() route -> in RouteCreator()
CategoryCreator() media -> in MediaIterator()
MediaIterator(MediaIterator) out -> in MediaHandler()
CategoryCreator() bannerset -> in BannersetHandler()
BannersetHandler() out -> bannerset CategoryCreator()
CategoryCreator() tag -> tags TagCreator(TagCreator)
TagCreator() tags -> tag CategoryCreator()
CategoryCreator() hierarchy -> hierarchy TagCreator()
TagCreator(TagCreator) hierarchy -> hierarchy CategoryCreator()
CategoryCreator() sidebar -> in SidebarHandler()
SidebarHandler() out -> sidebar CategoryCreator()
SidebarHandler() build -> in JsonFieldFetcher()
JsonFieldFetcher() sidebar -> in SidebarCreator()
RouteCreator() out -> in MdbPersister()

This replaces a 450+ lines JSON-file!

So, given all processes are defined as symfony (private) services, they can use all dependencies they need and are even easier to test.

Thanks to the datatype checks I've introduced into phpflo, connections are checked for compatibility. For us this means: Every component with compatible ports could be stitched together and worked with. That removed a lot of inheritance, type-checks and so on.

If you need a similar solutions, I suggest you continue reading here: phpflo on GitHub

And last, but not least: Big thanks to James (@aretecode) for his code reviews and support concerning architectural descisions!

Tags: php, symfony2

Refactoring the ansible-symfony2 deployment

For a few months now, I've been maintaining the ServerGrove ansible-symfony2 deployment, which by the time was more of a proof-of-concept ansible galaxy role.

In search of deployment

When I was first looking into deploying our Symfony application, I've first tried Capistrano (who hasn't). Although I already worked with ansible at that time, using it for deployment, rather than provisioning was not the first idea I had in mind. After struggling with Capistrano and its ruby-based dependencies, I quickly realized: I need something else with more manageable dependencies and also easy configuration.

ansible-symfony2

After some research (read: using google), I ran into ServerGrove's deployment role. At that time I just copied it and rewrote most parts to fit my application. At that time the repo was already quite abandoned. Seeing the open issues and PRs, I decided to ask if I could maintain the repo and even develop it further - Added a few code pieces from my modified deployment, added testing and thought I was done.

Refactoring

Because some people seemed to use the role for their deployments - or at least as a base to start from - I decided to take it further and incorporate some of the key features of my custom build. This includes a RELEASE file for each release that holds the release's unique identifier, hooks for additional tasks, git shallow copy and splitting into multiple files. Full changelog here.

Updating tests to use docker infrastructure

Getting Travis.ci to do as I wanted for testing, took me quite a while, too ;-). Main problem here: If you use python as the docker container language for version constraining to 2.7 for ansible, only php 5.3.x is available. Using the "old" infrastructure with virtual machines also just brings 5.3.x around - and some more side effects like you really have to install everything yourself. So for current testing, I had to go with symfony 2.6.

Extra tasks: hooks

In my current production deployment for Motor Presse, I need some tasks done with gulp. We're heavily relying on gulp to compile our stylus files to CSS, build minified JS and generally handle asset versioning. So how to include tasks like that into the deployment when it's most likely only an edge case? One option would be to add another trigger like "symfony_project_trigger_gulp" - but more config means more tests and not every project uses something like gulp or grunt. The second problem I was facing was cache flushing. How to flush caches on/after deployment? Depending on the infrastructure this might be a php5-fpm restart - or a apache2 restart, because someone uses mod_php. In our special case, we're forking into the php process and manually trigger php code to flush APCu. These are only three examples which already put so much complexity into our deployment that it's almost not manageable. Thankgod there's the include statement. Using a configuration variable for a (role-external) task path, you can include this YAML file within a task of your role dynamically. In config:

---
symfony_project_post_folder_creation_tasks: "{{playbook_dir}}/hooks/post_folder_creation.yml"

and in the actual task:

---
- include: "{{ symfony_project_post_folder_creation_tasks | default('empty.yml') }}"

Nice about this approach is the errorhandling by including an empty default file. I'd have preferred a solution with something like "skip if no include is provided", but instead of a nice onliner, this would have been a file stat, checks if stat.exists and then include. At least 6 lines of code instead of just one, just to have a "skipped". Ok, I could add a debug to the empty file where it says "skipped: no hook defined", but... no real sense in that.

Future ideas

Since I'm actively using this role to deploy large platforms, I'm trying to improve it step by step. The 2.0.1-alpha release is a huge leap forwards to a clean and flexible role. For future releases, there's still a lot of work: As soon as ansible is released in a 2.x version and error handling is finaly upon us, I'd be glad to integrate it into this role. For my current production deployment I'm using a GELF message curl request to a graylog server to promote a deployment. But how to check if there's an error? Or how to add partial errors/skips to such a message? If you have further ideas, improvements, issues or comments, just let me know on github :-)

Tags: php, symfony2, deployment, ansible

Developing a scalable Symfony 2 application

Introduction

One year ago, when I joined the Motor Presse GmbH & Co. KG, a well established and renowned publishing house in Germany, most of the magazine's web presences based on a local CMS. The CMS is quite flexible and was - for the time it was developed around 4-6 years prior - a good choice to use. But the main concepts for caching and the programming paradigms used are quite ancient and there's no real hint on any improvements in the future. So my base setup was clear: Old CMS, really slow (SOAP) interface for decoupling, tons of business logic in templates and webpages that grew organically over a long timespan. Time for a change. One thing was clear: Since most editors were quite proficient using the established CMS for creating their content, this was not a thing we could easly change. But we needed new ways of doing things: A version control, continuous integration, Testing, different test/stage systems and most of all a common development VM with an according IDE.

When we got the task to build a relaunch for menshealth.de, we made it our mission to completely restructure our team, development processes and systems. And this is how it went.

First of all: We are a small development team, consisting of 4 developers and one designer who worked nearly exclusively for the relaunch project and the experience levels of working with more sophisticated frameworks like Symfony2 varies a lot!

First thoughts on infrastructure and design

On first sight the main performance bottlenecks were the (normalized) MySQL database and the non-maintainable templating structure/handling. Facing these facts and knowing we could not move to another CMS in time, we decided to split the future application into three parts. First, the CMS as it was, to keep our editors satisfied and the migration manageable - also this gives us the opportunity to keep our existing infrastructure. Second an API (REST... what else) for data transformation, access and also later for external data exchange plus user authentication. Third and last part is a frontend application to visualize the data from our API.

Frameworks/technologies

Because I am a great fan of Symfony 2, the choice concerning the framework was quite clear. And having a need for fast and structured storage of complex data, we decided to try MongoDB which I had in use in other projects before and was quite satisfied with. As we decided to split the application we've also gained the possibility to asynchronously work our data transformation tasks from CMS to API. RabbitMQ and the AMQP library were logical choices, since they are stable, performant and integrate greatly into a sf2 project.

Building the application infrastructure and deployment

Having defined our components and after long talks with our server provider, we came up with following setup: (image) A varnish proxy/cache with failover in front of two webservers for loadbalancing, two MongoDB instances with an arbiter, one Memcache instance each for session sharing, one API server and the already existing MySQL cluster with one CMS (backend) server. Images are delivered by a small "micro-image-server" from a mounted NFS, images uploaded via the backend CMS. This setup is managed by puppet.

QA, release & deployment

For release and deployment we've decided to go for "we do everything in the master" during the hot phase of the project. As soon as things cooled down, we've switched to the more (process-)secure gitflow. For Deployments over all systems we use ansible with a custom role, based on ServerGrove's symfony2 deployment.

Environments:

  • dev -> virtual machine with vagrant and ansible, local for each developer.
  • stage -> separate server.
  • preview -> separate server with production settings.
  • prod -> production servers.

environments overview

Our QA flow now consists of these components:

  • Jira ticket, approved and assigned by project management.
  • Development in feature branch (git flow) on a local VM.
  • develop branch is checked by Jenkins/PHPCI for every commit and then auto-deployed to test server. First acceptance test stage.
  • Release branches are also checked by Jenkins and then deployed to preview system, second acceptance test stage.
  • master branch is checked by jenkins after every merge, manually deployed to production.

What we did for scalability

We started with a short list of known performance bottlenecks/requirements for our special case:

  • Slow CMS backend due to loads of normalizing in MySQL and bad application architecture php-wise.
  • Frontends need no direct write access to database (security & speed!).
  • CMS and frontend can have slight asynchronicity.
  • Frontend needs to be fast without varnish and therfore allow for a higher content update frequency.
  • Varnish cache times should be manageable through frontend responses.
  • Make it easy to add new frontend servers if load increases (vertical scaling).
  • Keep backend decoupled but allow for vertical scaling there, too.

Looking at those requirements, we see: Nothing really new or unfamiliar. This list lead to following architecture:

  • CMS totally decoupled, triggers REST API calls for every CUD action with minimal payload. Noticed the missing "R"? Reads are done directly from DB for performance reasons.
  • API application processes messages with RabbitMQ/AMQP to keep load manageable.
  • API extracts and transforms data from MySQL server to MongoDB documents.
  • Data goes into a MongoDB cluster.
  • Frontends access MongoDB directly, but readonly.

simplified infrastructure overview

Some more ideas we've sucessfully to the test:

  • Symfony controllers are used as services (see Mathias Noback). This allows for easier testing and much better dependency visibility!
  • One-Bundle approach as advertised in best practices, modified to have src/CoreBundle and src/Core for our library classes.
  • gulp/stylus for css/js management.
  • Removed Assetic, since it is not necessary when using gulp.
  • Using Symfony Response object and custom configurations for ETag/cache lifetime.
  • APCu cache for userland caching.
  • All monolog logging is pulled by Graylog via logfile (GELF) -> Awesant -> RabbitMQ.
  • The library is connected to coreBundle via services (providers) which are only handlers for private services, collected during compiler passes. This has a huge impact on DIC size and performance.
  • All access to CMS data is hidden behind a facade with interfaces and a provider structure that allows for different API versions and access for a (possible) variety of CMS systems per client.

Pitfalls

During the time we've encountered some (from hindsight) funny errors:

  • Using APCu, always check how much memory consumption you have, how quickly the cache builds and flushes. Check hitrates! In our case the cache filled up and flushed so quickly, you couldn't see it on monitoring. But hitrates show ;-)
  • Logging! I can't emphasize enough: Every crucial action that fails should be logged - but with enough meta-information to be conclusive!
  • If you're using the reverse proxy cache kernel, make sure it uses the same strategy (e.g. key) as your varnish does. We had the user agent in varnish cache key, but not within the vary headers - so the reverse proxy cache eratically cached desktop/mobile sepcific stuff and people got a lot of wrong ads. Took me several hours to find :-(

Conclusion

When we first load-tested the application, we were quite surprised by the good overall performance. This behavior continued with the go-live - with some hickups with the cache keys for varnish. We're quite satisfied with our current architecture, especially since we decoupled most of the code from the framework itself and placed it into library structures. But there is still a lot of work to do: We've got a low UnitTest coverage, some of the library parts should be excluded as separate libraries and repositories and also - and this is true for nearly every project - a lot of refactoring has to be done in templating and some library parts.

And this is what we've built: menshealth.de

Tags: php, symfony2, software architecture