{"id":1635,"date":"2016-12-19T17:32:26","date_gmt":"2016-12-19T20:32:26","guid":{"rendered":"https:\/\/www.nachodelatorre.com.ar\/mosconi\/?p=1635"},"modified":"2016-12-19T17:32:26","modified_gmt":"2016-12-19T20:32:26","slug":"haciendo-manejable-a-big-data","status":"publish","type":"post","link":"https:\/\/www.fie.undef.edu.ar\/ceptm\/?p=1635","title":{"rendered":"Haciendo manejable a Big Data"},"content":{"rendered":"<p>Esta t\u00e9cnica reduce los conjuntos de datos mientras preserva sus relaciones matem\u00e1ticas fundamentales.<!--more--><\/p>\n<div class=\"field field-name-field-article-content field-type-text-long field-label-hidden\">\n<div class=\"field-items\">\n<div class=\"field-item even\">\n<p><img loading=\"lazy\" class=\" alignright\" title=\"\" draggable=\"false\" src=\"http:\/\/news.mit.edu\/sites\/mit.edu.newsoffice\/files\/styles\/news_article_image_top_slideshow\/public\/images\/2016\/MIT-Coreset_0.jpg?itok=rPiku97B\" alt=\"A new technique devised by MIT researchers can take data sets with huge numbers of variables and find approximations of them with far fewer variables.\n\" width=\"332\" height=\"221\" \/>One way to handle big data is to shrink it. If you can identify a small subset of your data set that preserves its salient mathematical relationships, you may be able to perform useful analyses on it that would be prohibitively time consuming on the full set.<\/p>\n<p>The methods for creating such \u201ccoresets\u201d vary according to application, however. Last week, at the Annual Conference on Neural Information Processing Systems, researchers from MIT\u2019s Computer Science and Artificial Intelligence Laboratory and the University of Haifa in Israel presented a new coreset-generation technique that\u2019s tailored to a whole family of data analysis tools with applications in natural-language processing, computer vision, signal processing, recommendation systems, weather prediction, finance, and neuroscience, among many others.<\/p>\n<p>\u201cThese are all very general algorithms that are used in so many applications,\u201d says Daniela Rus, the Andrew and Erna Viterbi Professor of Electrical Engineering and Computer Science at MIT and senior author on the new paper. \u201cThey\u2019re fundamental to so many problems. By figuring out the coreset for a huge matrix for one of these tools, you can enable computations that at the moment are simply not possible.\u201d<\/p>\n<p>As an example, in their paper the researchers apply their technique to a matrix \u2014 that is, a table \u2014 that maps every article on the English version of Wikipedia against every word that appears on the site. That\u2019s 1.4 million articles, or matrix rows, and 4.4 million words, or matrix columns.<\/p>\n<p>That matrix would be much too large to analyze using low-rank approximation, an algorithm that can deduce the topics of free-form texts. But with their coreset, the researchers were able to use low-rank approximation to extract clusters of words that denote the 100 most common topics on Wikipedia. The cluster that contains \u201cdress,\u201d \u201cbrides,\u201d \u201cbridesmaids,\u201d and \u201cwedding,\u201d for instance, appears to denote the topic of weddings; the cluster that contains \u201cgun,\u201d \u201cfired,\u201d \u201cjammed,\u201d \u201cpistol,\u201d and \u201cshootings\u201d appears to designate the topic of shootings.<\/p>\n<p>Joining Rus on the paper are Mikhail Volkov, an MIT postdoc in electrical engineering and computer science, and Dan Feldman, director of the University of Haifa\u2019s Robotics and Big Data Lab and a former postdoc in Rus\u2019s group.<\/p>\n<p>The researchers\u2019 new coreset technique is useful for a range of tools with names like singular-value decomposition, principal-component analysis, and latent semantic analysis. But what they all have in common is dimension reduction: They take data sets with large numbers of variables and find approximations of them with far fewer variables.<\/p>\n<p>In this, these tools are similar to coresets. But coresets are application-specific, while dimension-reduction tools are general-purpose. That generality makes them much more computationally intensive than coreset generation \u2014 too computationally intensive for practical application to large data sets.<\/p>\n<p>The researchers believe that their technique could be used to winnow a data set with, say, millions of variables \u2014 such as descriptions of Wikipedia pages in terms of the words they use \u2014 to merely thousands. At that point, a widely used technique like principal-component analysis could reduce the number of variables to mere hundreds, or even lower.<\/p>\n<p>The researchers\u2019 technique works with what is called sparse data. Consider, for instance, the Wikipedia matrix, with its 4.4 million columns, each representing a different word. Any given article on Wikipedia will use only a few thousand distinct words. So in any given row \u2014 representing one article \u2014 only a few thousand matrix slots out of 4.4 million will have any values in them. In a sparse matrix, most of the values are zero.<\/p>\n<p>Crucially, the new technique preserves that sparsity, which makes its coresets much easier to deal with computationally. Calculations become lot easier if they involve a lot of multiplication by and addition of zero.<\/p>\n<p>The new coreset technique uses what\u2019s called a merge-and-reduce procedure. It starts by taking, say, 20 data points in the data set and selecting 10 of them as most representative of the full 20. Then it performs the same procedure with another 20 data points, giving it two reduced sets of 10, which it merges to form a new set of 20. Then it does another reduction, from 20 down to 10.<\/p>\n<p>Even though the procedure examines every data point in a huge data set, because it deals with only small collections of points at a time, it remains computationally efficient. And in their paper, the researchers prove that, for applications involving an array of common dimension-reduction tools, their reduction method provides a very good approximation of the full data set.<\/p>\n<p>That method depends on a geometric interpretation of the data, involving something called a hypersphere, which is the multidimensional analogue of a circle. Any piece of multivariable data can be thought of as a point in a multidimensional space. In the same way that the pair of numbers (1, 1) defines a point in a two-dimensional space \u2014 the point one step over on the X-axis and one step up on the Y-axis \u2014 a row of the Wikipedia table, with its 4.4 million numbers, defines a point in a 4.4-million-dimensional space.<\/p>\n<p>The researchers\u2019 reduction algorithm begins by finding the average value of the subset of data points \u2014 let\u2019s say 20 of them \u2014 that it\u2019s going to reduce. This, too, defines a point in a high-dimensional space; call it the origin. Each of the 20 data points is then \u201cprojected\u201d onto a hypersphere centered at the origin. That is, the algorithm finds the unique point on the hypersphere that\u2019s in the direction of the data point.<\/p>\n<p>The algorithm selects one of the 20 data projections on the hypersphere. It then selects the projection on the hypersphere farthest away from the first. It finds the point midway between the two and then selects the data projection farthest away from the midpoint; then it finds the point midway between those two points and selects the data projection farthest away from it; and so on.<\/p>\n<p>The researchers were able to prove that the midpoints selected through this method will converge very quickly on the center of the hypersphere. The method will quickly select a subset of points whose average value closely approximates that of the 20 initial points. That makes them particularly good candidates for inclusion in the coreset.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><strong>Fuente:<\/strong> <em><a href=\"http:\/\/news.mit.edu\/2016\/making-big-data-manageable-1214\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/news.mit.edu<\/a><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Esta t\u00e9cnica reduce los conjuntos de datos mientras preserva sus relaciones matem\u00e1ticas fundamentales.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[23,29],"tags":[],"_links":{"self":[{"href":"https:\/\/www.fie.undef.edu.ar\/ceptm\/index.php?rest_route=\/wp\/v2\/posts\/1635"}],"collection":[{"href":"https:\/\/www.fie.undef.edu.ar\/ceptm\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.fie.undef.edu.ar\/ceptm\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.fie.undef.edu.ar\/ceptm\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.fie.undef.edu.ar\/ceptm\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1635"}],"version-history":[{"count":0,"href":"https:\/\/www.fie.undef.edu.ar\/ceptm\/index.php?rest_route=\/wp\/v2\/posts\/1635\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.fie.undef.edu.ar\/ceptm\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1635"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.fie.undef.edu.ar\/ceptm\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1635"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.fie.undef.edu.ar\/ceptm\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1635"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}