With the growing avalanche of big data pounding businesses each and every day, there are few professions that are more in demand than data scientists.What are data scientists? They are the superheroes of any organization who have mastered the art of navigating complex sets of data stored in various systems and programs — information that is critical for any business to not just thrive, but survive.In this infographic we explore both the need for data scientists as well as the education required to ride the avalanche
Although Big Data technology has been advancing rapidly, we have not often seen completely new Big Data tech products hit the market. Most usually, it’s the same old names, releasing improved, upgraded versions of their applications. However, a new SQL engine is entering the fray, and from an unlikely source.Facebook, the social networking giant, in an act of unprecedented generosity, is releasing its custom built SQL engine, named Presto, under an open source license. Several major web based service providers are already testing Presto, with a view to switching their cloud based offerings to this scalable, powerful SQL engine. Facebook developed Presto to handle the incredible volume of queries that the social networking site generates each and every day. Facebook maintains a date repository that stores an amazing 300 petabytes of user data. Presto allows Facebook to query this date in real-time, across a 1000 node server cluster. Pretty impressive stuff. Presto incorporates learning algorithms and data analysis, as well as more traditional transactional processing.Originally, Facebook was built on Hadoop, but as the site grew in popularity, it soon outstretched the capabilities of Hadoop, and work on developing Presto was begun, after several other existing solutions were researched, tested and discarded. Facebook is releasing Presto under an open source license, and has hopes that the adoption of Presto by other companies, will help Facebook to improve the product.Facebook has already launched a website for Presto, and this can be found at the following URL:http://prestodb.io/ On the very front page of this site is a simple video of a command line query being run via Presto. This results in 150 million rows of data, of almost 23Gb in size, being queried within just a couple of seconds, really impressive.We definitely recommend that interested parties check out the new Presto website, there is an awful lot of technical data made available, and of course, the source code is there for the taking.Whether Presto will make a large impact on the Big Date technology market is hard to judge. However, at a roots level, it does seem to offers some amazing performance. With companies such as Dropbox already putting Presto to good use, it is certainly off to a promising start.By Mac Wheeler
Testing a software, product, or website functionality is a important part of the whole process. it's even more important than the product itself. you have been working hard to deliver a product and then you release that without usability testing and when public come to use they encounter different errors, or difficulties while they are using the product, they get disappointed and never come back again ever. so it is better to spend some resources to test before releasing the product. there are quite many methods, but there is almost new one introduced by JAKOB NIELSEN, who is one of the leading scientists in this area. companies often test their products by using so many users and so many tests, which costs more and is really expensive. what this researcher offers is Discount Testing, which it simply means test a UI by 5 users with so many as many test you need. it cost way less, and the result is almost the same. there is also a reason that why to test with 5 users. I put 2 links, one is explaining the Discount Usability Tests itself, and one is about why we only need 5 users. Overall, testing is better than not testing even with one user and one test.
we as human being, who evolved over thousand of years, have been a social species. we like to be in a group, well mostly we do as long others respect our privacy, or we enjoy that time. we like working in groups. other members can point out our problems which we might not have seen, or they could show us a way to solve a problem, or simply we could just not do our assigned works and let the other members be worried about it, and at the end they have to do out part as well. the reason is that they have been working hard for a while done everything on their part, they don't want to ruined it by some lazy smart pants. so they unwillingly do that person works as well, but on what cost? we all faced this situation, playing games, doing report in a company, doing assignment in college, high school, etc. there are 2 points of view here: one from that lazy person and one from other sides. that lazy person always finds somebody to stick to and gets whatever they want. not everybody knows him. they are always so talkative, because of the lack of skills on the job, or laziness, or even feeling smarter than others, since they think they can achieve more without trying hard like others do. on the other side, there are good guys. they want to get the job done. they do the best..almost the best...they can to deliver the jobs were assigned to them, and they expect others do the same,but we're not living in a ideal world, or are we? i share my recent experience in my 3 group projects which i've had this semester, and sadly i've been team mate with one of this type of creature in 2 of them. In one of the project we were supposed to tune a given database and write the SQL commands, test the run time, write a report of cost...time vs. space, etc. it was time consuming since the database was big, the biggest table had 300k records, and there are 10 tables with different size. having 2 weeks till due date, i'd been done just less than 50% of the assignment and there were other works as well. so this guy, who is my tutorial time too, asked me about the assignment. i said it's ok so far, then he asked that do i want to team up? i instantly said yes, thinking it would be good we could finish earlier and better and we have other assignments as well.long story short, till due day, i had to walk through the guy to do this, to do that,and he was pissed at me!? on due day, i finished my part, and i said: "ok, this is tuning for each SQL part, you do the cost, space analysis. it should be easy, mainly copy lecture notes and write something. and that's exactly he did. he copy paste the lecture notes and sent it to me 4 hours before due time. i had to go through each of the part to make it better and at the end i submitted but that was awful. we just got the good mark for tuning section. in other group project, which was a UI design for HCI subject, we were supposed to create a website. i didn't need to be professional, it just needed to follow the requirements for HCI stuff. anyway, he was in our team too. he just talked and gives ideas," this one is better, that one looks good, we should do that..." at one point i got angry i asked" are you gonna do that? who's gonna do that in a short time? plus this is not our only assignment. he said why are you talking like that. i said sorry. whatever team decides i listen. at the end i was assigned to do the coding part for website, and he got the part to do the slides, which was 8 slides, and mainly copied from our first presentation n earlier week. result was, we got a good mark for prototype and presentation, but for the individual report he got the better mark than me, why? because i spent too much time on that website, and my individual report mainly...all of them actually... was done in last 2 days. as result i got just passing mark for my report. now, my question is i'm not that type of guy to be the same as him, but what can i do in team work? specially in college environment....wow i typed a lot...to whoever got here..well good job reading it... Pamador out
In this tutorial we are going to create an AJAX file upload form, that
will let visitors upload files from their browsers with drag/drop or by
selecting them individually. For the purpose, we will combine the
powerful jQuery File Upload plugin with the neat jQuery Knob to present a slick CSS3/JS driven interface.
Hadoop forces you to write every computation in terms of a map, a group by, and an aggregate, or perhaps a sequence of such computations. Running computations in this manner is a straightjacket, and many calculations are better suited to some other model. The only reason to put on this straightjacket is that by doing so, you can scale up to extremely large data sets. Most likely your data is orders of magnitude smaller.But because “Hadoop” and “Big Data” are buzzwords, half the world wants to wear this straightjacket even if they don’t need to.
But my data is hundreds of megabytes! Excel won’t load it. Too big for Excel is not “Big Data”. There are excellent tools out there - my favorite is Pandas which is built on top of Numpy. You can load hundreds of megabytes into memory in an efficient vectorized format. On my 3 year old laptop, it takes numpy the blink of an eye to multiply 100,000,000 floating point numbers together. Matlab and R are also excellent tools.Hundreds of megabytes is also typically amenable to a simple python script that reads your file line by line, processes it, and writes to another file.
But my data is 10 gigabytes!I just bought a new laptop. The 16GB of ram I put in cost me $141.98 and the 256gb SSD was $200 extra (preinstalled by Lenovo). Additionally, if you load a 10 GB csv file into Pandas, it will often be considerably smaller in memory - the result of storing the numerical string “17284932583” as a 4 or 8 byte integer, or storing “284572452.2435723” as an 8 byte double.Worst case, you might actually have to not load everything into ram simultaneously. But my data is 100GB/500GB/1TB! A 2 terabyte hard drive costs $94.99, 4 terabytes is $169.99. Buy one and stick it in a desktop computer or server. Then install Postgres on it. Hadoop << SQL, Python Scripts In terms of expressing your computations, Hadoop is strictly inferior to SQL. There is no computation you can write in Hadoop which you cannot write more easily in either SQL, or with a simple Python script that scans your files.SQL is a straightforward query language with minimal leakage of abstractions, commonly used by business analysts as well as programmers. Queries in SQL are generally pretty simple. They are also usually very fast - if your database is properly indexed, multi-second queries will be uncommon.Hadoop does not have any conception of indexing. Hadoop has only full table scans. Hadoop is full of leaky abstractions - at my last job I spent more time fighting with java memory errors, file fragmentation and cluster contention than I spent actually worrying about the mostly straightforward analysis I wanted to perform.If your data is not structured like a SQL table (e.g., plain text, json blobs, binary blobs), it’s generally speaking straightforward to write a small python or ruby script to process each row of your data. Store it in files, process each file, and move on. Under circumstances where SQL is a poor fit, Hadoop will be less annoying from a programming perspective. But it still provides no advantage over simply writing a Python script to read your data, process it, and dump it to disk.In addition to being more difficult to code for, Hadoop will also nearly always be slower than the simpler alternatives. SQL queries can be made very fast by the judicious use of indexes - to compute a join, PostgreSQL will simply look at an index (if present) and look up the exact key that is needed. Hadoop requires a full table scan, followed by re-sorting the entire table. The sorting can be made faster by sharding across multiple machines, but on the other hand you are still required to stream data across multiple machines. In the case of processing binary blobs, Hadoop will require repeated trips to the namenode in order to find and process data. A simple python script will require repeated trips to the filesystem. But my data is more than 5TB! Your life now sucks - you are stuck with Hadoop. You don’t have many other choices (big servers with many hard drives might still be in play), and most of your other choices are considerably more expensive.The only benefit to using Hadoop is scaling. If you have a single table containing many terabytes of data, Hadoop might be a good option for running full table scans on it. If you don’t have such a table, avoid Hadoop like the plague. It isn’t worth the hassle and you’ll get results with less effort and in less time if you stick to traditional methods. P.S. The Sales Pitch I’m building a startup aiming to provide data analysis (big and small) and realtime recommendations and optimization to publishers and e-commerce sites. If you are interested in being a beta user, email me at email@example.com.I also do consulting. If your company needs a Big Cloudy Data Strategy (TM), I can help you. But be warned - there is a good chance I’ll set you up with Pandas and tell you to A/B test, rather than giving you hadoop in the cloud. P.P.S. Hadoop is a fine tool I don’t intend to hate on Hadoop. I use Hadoop regularly for jobs I probably couldn’t easily handle with other tools. (Tip: I recommend usingScalding rather than Hive or Pig. Scalding lets you use Scala, which is a decent programming language, and makes it easy to write chained Hadoop jobs without hiding the fact that it really is mapreduce on the bottom.) Hadoop is a fine tool, it makes certain tradeoffs to target certain specific use cases. The only point I’m pushing here is to think carefully rather than just running Hadoop on The Cloud in order to handle your 500mb of Big Data at an Enterprise Scale.