• 开源镜像
  • 开源沙龙
  • 媛宝
  • 猿帅
  • 注册
  • 登录
  • 息壤开源生活方式平台
  • 加入我们

开源日报

  • 2018年8月16日:开源日报第161期

    16 8 月, 2018

    每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,欢迎关注开源日报。交流QQ群:202790710;微博:https://weibo.com/openingsource;电报群 https://t.me/OpeningSourceOrg


    今日推荐开源项目:《参数检测器 ow》传送门:GitHub链接

    推荐理由:这玩意是一个用 TypeScript 写的函数参数检测器,可以帮你在各种各样的地方检查参数或者别的什么需要检查的值,兴许在调试代码的时候把这个放上去会有一定的帮助,不过讲真,在写代码的时候就多加注意才是节省时间的王道做法。


    今日推荐英文原文:《How to solve 90% of NLP problems: a step-by-step guide》作者:Emmanuel Ameisen

    原文链接:https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e

    推荐理由:NLP,即自然语言处理,这篇文章相当于一个 NLP 的指南,推荐给对它感兴趣的朋友。

    How to solve 90% of NLP problems: a step-by-step guide

    Text data is everywhere

    Whether you are an established company or working to launch a new service, you can always leverage text data to validate, improve, and expand the functionalities of your product. The science of extracting meaning and learning from text data is an active topic of research called Natural Language Processing (NLP).

    NLP produces new and exciting results on a daily basis, and is a very large field. However, having worked with hundreds of companies, the Insight team has seen a few key practical applications come up much more frequently than any other:

    • Identifying different cohorts of users/customers (e.g. predicting churn, lifetime value, product preferences)
    • Accurately detecting and extracting different categories of feedback (positive and negative reviews/opinions, mentions of particular attributes such as clothing size/fit…)
    • Classifying text according to intent (e.g. request for basic help, urgent problem)

    While many NLP papers and tutorials exist online, we have found it hard to find guidelines and tips on how to approach these problems efficiently from the ground up.

    How this article can help

    After leading hundreds of projects a year and gaining advice from top teams all over the United States, we wrote this post to explain how to build Machine Learning solutions to solve problems like the ones mentioned above. We’ll begin with the simplest method that could work, and then move on to more nuanced solutions, such as feature engineering, word vectors, and deep learning.

    After reading this article, you’ll know how to:

    • Gather, prepare and inspect data
    • Build simple models to start, and transition to deep learning if necessary
    • Interpret and understand your models, to make sure you are actually capturing information and not noise

    We wrote this post as a step-by-step guide; it can also serve as a high level overview of highly effective standard approaches.


    This post is accompanied by an interactive notebook demonstrating and applying all these techniques. Feel free to run the code and follow along!

    Step 1: Gather your data

    Example data sources

    Every Machine Learning problem starts with data, such as a list of emails, posts, or tweets. Common sources of textual information include:

    • Product reviews (on Amazon, Yelp, and various App Stores)
    • User-generated content (Tweets, Facebook posts, StackOverflow questions)
    • Troubleshooting (customer requests, support tickets, chat logs)

    “Disasters on Social Media” dataset

    For this post, we will use a dataset generously provided by Figure Eight, called “Disasters on Social Media”, where:

    Contributors looked at over 10,000 tweets culled with a variety of searches like “ablaze”, “quarantine”, and “pandemonium”, then noted whether the tweet referred to a disaster event (as opposed to a joke with the word or a movie review or something non-disastrous).

    Our task will be to detect which tweets are about a disastrous event as opposed to an irrelevant topic such as a movie. Why? A potential application would be to exclusively notify law enforcement officials about urgent emergencies while ignoring reviews of the most recent Adam Sandler film. A particular challenge with this task is that both classes contain the same search terms used to find the tweets, so we will have to use subtler differences to distinguish between them.

    In the rest of this post, we will refer to tweets that are about disasters as “disaster”, and tweets about anything else as “irrelevant”.

    Labels

    We have labeled data and so we know which tweets belong to which categories. As Richard Socher outlines below, it is usually faster, simpler, and cheaper to find and label enough data to train a model on, rather than trying to optimize a complex unsupervised method.

    Richard Socher’s pro-tip

    Step 2: Clean your data

    The number one rule we follow is: “Your model will only ever be as good as your data.”

    One of the key skills of a data scientist is knowing whether the next step should be working on the model or the data. A good rule of thumb is to look at the data first and then clean it up. A clean dataset will allow a model to learn meaningful features and not overfit on irrelevant noise.

    Here is a checklist to use to clean your data: (see the code for more details):

    1. Remove all irrelevant characters such as any non alphanumeric characters
    2. Tokenize your text by separating it into individual words
    3. Remove words that are not relevant, such as “@” twitter mentions or urls
    4. Convert all characters to lowercase, in order to treat words such as “hello”, “Hello”, and “HELLO” the same
    5. Consider combining misspelled or alternately spelled words to a single representation (e.g. “cool”/”kewl”/”cooool”)
    6. Consider lemmatization (reduce words such as “am”, “are”, and “is” to a common form such as “be”)

    After following these steps and checking for additional errors, we can start using the clean, labelled data to train models!

    Step 3: Find a good data representation

    Machine Learning models take numerical values as input. Models working on images, for example, take in a matrix representing the intensity of each pixel in each color channel.

    A smiling face represented as a matrix of numbers.

    Our dataset is a list of sentences, so in order for our algorithm to extract patterns from the data, we first need to find a way to represent it in a way that our algorithm can understand, i.e. as a list of numbers.

    One-hot encoding (Bag of Words)

    A natural way to represent text for computers is to encode each character individually as a number (ASCII for example). If we were to feed this simple representation into a classifier, it would have to learn the structure of words from scratch based only on our data, which is impossible for most datasets. We need to use a higher level approach.

    For example, we can build a vocabulary of all the unique words in our dataset, and associate a unique index to each word in the vocabulary. Each sentence is then represented as a list that is as long as the number of distinct words in our vocabulary. At each index in this list, we mark how many times the given word appears in our sentence. This is called a Bag of Words model, since it is a representation that completely ignores the order of words in our sentence. This is illustrated below.

    Representing sentences as a Bag of Words. Sentences on the left, representation on the right. Each index in the vectors represent one particular word.

    Visualizing the embeddings

    We have around 20,000 words in our vocabulary in the “Disasters of Social Media” example, which means that every sentence will be represented as a vector of length 20,000. The vector will contain mostly 0s because each sentence contains only a very small subset of our vocabulary.

    In order to see whether our embeddings are capturing information that is relevant to our problem (i.e. whether the tweets are about disasters or not), it is a good idea to visualize them and see if the classes look well separated. Since vocabularies are usually very large and visualizing data in 20,000 dimensions is impossible, techniques like PCA will help project the data down to two dimensions. This is plotted below.

    Visualizing Bag of Words embeddings.

    The two classes do not look very well separated, which could be a feature of our embeddings or simply of our dimensionality reduction. In order to see whether the Bag of Words features are of any use, we can train a classifier based on them.

    Step 4: Classification

    When first approaching a problem, a general best practice is to start with the simplest tool that could solve the job. Whenever it comes to classifying data, a common favorite for its versatility and explainability is Logistic Regression. It is very simple to train and the results are interpretable as you can easily extract the most important coefficients from the model.

    We split our data in to a training set used to fit our model and a test set to see how well it generalizes to unseen data. After training, we get an accuracy of 75.4%. Not too shabby! Guessing the most frequent class (“irrelevant”) would give us only 57%. However, even if 75% precision was good enough for our needs, we should never ship a model without trying to understand it.

    Step 5: Inspection

    Confusion Matrix

    A first step is to understand the types of errors our model makes, and which kind of errors are least desirable. In our example, false positives are classifying an irrelevant tweet as a disaster, and false negatives are classifying a disaster as an irrelevant tweet. If the priority is to react to every potential event, we would want to lower our false negatives. If we are constrained in resources however, we might prioritize a lower false positive rate to reduce false alarms. A good way to visualize this information is using a Confusion Matrix, which compares the predictions our model makes with the true label. Ideally, the matrix would be a diagonal line from top left to bottom right (our predictions match the truth perfectly).

    Confusion Matrix (Green is a high proportion, blue is low)

    Our classifier creates more false negatives than false positives (proportionally). In other words, our model’s most common error is inaccurately classifying disasters as irrelevant. If false positives represent a high cost for law enforcement, this could be a good bias for our classifier to have.

    Explaining and interpreting our model

    To validate our model and interpret its predictions, it is important to look at which words it is using to make decisions. If our data is biased, our classifier will make accurate predictions in the sample data, but the model would not generalize well in the real world. Here we plot the most important words for both the disaster and irrelevant class. Plotting word importance is simple with Bag of Words and Logistic Regression, since we can just extract and rank the coefficients that the model used for its predictions.

    Bag of Words: Word importance

    Our classifier correctly picks up on some patterns (hiroshima, massacre), but clearly seems to be overfitting on some meaningless terms (heyoo, x1392). Right now, our Bag of Words model is dealing with a huge vocabulary of different words and treating all words equally. However, some of these words are very frequent, and are only contributing noise to our predictions. Next, we will try a way to represent sentences that can account for the frequency of words, to see if we can pick up more signal from our data.

    Step 6: Accounting for vocabulary structure

    TF-IDF

    In order to help our model focus more on meaningful words, we can use a TF-IDF score (Term Frequency, Inverse Document Frequency) on top of our Bag of Words model. TF-IDF weighs words by how rare they are in our dataset, discounting words that are too frequent and just add to the noise. Here is the PCA projection of our new embeddings.

    Visualizing TF-IDF embeddings.

    We can see above that there is a clearer distinction between the two colors. This should make it easier for our classifier to separate both groups. Let’s see if this leads to better performance. Training another Logistic Regression on our new embeddings, we get an accuracy of 76.2%.

    A very slight improvement. Has our model started picking up on more important words? If we are getting a better result while preventing our model from “cheating” then we can truly consider this model an upgrade.

    TF-IDF: Word importance

    The words it picked up look much more relevant! Although our metrics on our test set only increased slightly, we have much more confidence in the terms our model is using, and thus would feel more comfortable deploying it in a system that would interact with customers.

    Step 7: Leveraging semantics

    Word2Vec

    Our latest model managed to pick up on high signal words. However, it is very likely that if we deploy this model, we will encounter words that we have not seen in our training set before. The previous model will not be able to accurately classify these tweets, even if it has seen very similar words during training.

    To solve this problem, we need to capture the semantic meaning of words, meaning we need to understand that words like ‘good’ and ‘positive’ are closer than ‘apricot’ and ‘continent.’ The tool we will use to help us capture meaning is called Word2Vec.

    Using pre-trained words

    Word2Vec is a technique to find continuous embeddings for words. It learns from reading massive amounts of text and memorizing which words tend to appear in similar contexts. After being trained on enough data, it generates a 300-dimension vector for each word in a vocabulary, with words of similar meaning being closer to each other.

    The authors of the paper open sourced a model that was pre-trained on a very large corpus which we can leverage to include some knowledge of semantic meaning into our model. The pre-trained vectors can be found in the repository associated with this post.

    Sentence level representation

    A quick way to get a sentence embedding for our classifier is to average Word2Vec scores of all words in our sentence. This is a Bag of Words approach just like before, but this time we only lose the syntax of our sentence, while keeping some semantic information.

    Word2Vec sentence embedding

    Here is a visualization of our new embeddings using previous techniques:

    Visualizing Word2Vec embeddings.

    The two groups of colors look even more separated here, our new embeddings should help our classifier find the separation between both classes. After training the same model a third time (a Logistic Regression), we get an accuracy score of 77.7%, our best result yet! Time to inspect our model.

    The Complexity/Explainability trade-off

    Since our embeddings are not represented as a vector with one dimension per word as in our previous models, it’s harder to see which words are the most relevant to our classification. While we still have access to the coefficients of our Logistic Regression, they relate to the 300 dimensions of our embeddings rather than the indices of words.

    For such a low gain in accuracy, losing all explainability seems like a harsh trade-off. However, with more complex models we can leverage black box explainers such as LIME in order to get some insight into how our classifier works.

    LIME

    LIME is available on Github through an open-sourced package. A black-box explainer allows users to explain the decisions of any classifier on one particular example by perturbing the input (in our case removing words from the sentence) and seeing how the prediction changes.

    Let’s see a couple explanations for sentences from our dataset.

    Correct disaster words are picked up to classify as “relevant”.
    Here, the contribution of the words to the classification seems less obvious.

    However, we do not have time to explore the thousands of examples in our dataset. What we’ll do instead is run LIME on a representative sample of test cases and see which words keep coming up as strong contributors. Using this approach we can get word importance scores like we had for previous models and validate our model’s predictions.

    Word2Vec: Word importance

    Looks like the model picks up highly relevant words implying that it appears to make understandable decisions. These seem like the most relevant words out of all previous models and therefore we’re more comfortable deploying in to production.

    Step 8: Leveraging syntax using end-to-end approaches

    We’ve covered quick and efficient approaches to generate compact sentence embeddings. However, by omitting the order of words, we are discarding all of the syntactic information of our sentences. If these methods do not provide sufficient results, you can utilize more complex model that take in whole sentences as input and predict labels without the need to build an intermediate representation. A common way to do that is to treat a sentence as a sequence of individual word vectors using either Word2Vec or more recent approaches such as GloVe or CoVe. This is what we will do below.

    A highly effective end-to-end architecture (source)

    Convolutional Neural Networks for Sentence Classification train very quickly and work well as an entry level deep learning architecture. While Convolutional Neural Networks (CNN) are mainly known for their performance on image data, they have been providing excellent results on text related tasks, and are usually much quicker to train than most complex NLP approaches (e.g. LSTMs and Encoder/Decoder architectures). This model preserves the order of words and learns valuable information on which sequences of words are predictive of our target classes. Contrary to previous models, it can tell the difference between “Alex eats plants” and “Plants eat Alex.”

    Training this model does not require much more work than previous approaches (see code for details) and gives us a model that is much better than the previous ones, getting 79.5% accuracy! As with the models above, the next step should be to explore and explain the predictions using the methods we described to validate that it is indeed the best model to deploy to users. By now, you should feel comfortable tackling this on your own.

    Final Notes

    Here is a quick recap of the approach we’ve successfully used:

    • Start with a quick and simple model
    • Explain its predictions
    • Understand the kind of mistakes it is making
    • Use that knowledge to inform your next step, whether that is working on your data, or a more complex model.

    These approaches were applied to a particular example case using models tailored towards understanding and leveraging short text such as tweets, but the ideas are widely applicable to a variety of problems. I hope this helped you, we’d love to hear your comments and questions! Feel free to comment below or reach out to @EmmanuelAmeisen here or on Twitter.


    每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,欢迎关注开源日报。交流QQ群:202790710;微博:https://weibo.com/openingsource;电报群 https://t.me/OpeningSourceOrg

  • 2018年8月15日:开源日报第160期

    15 8 月, 2018

    每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,欢迎关注开源日报。交流QQ群:202790710;微博:https://weibo.com/openingsource;电报群 https://t.me/OpeningSourceOrg


    今日推荐开源项目:《Python 代码转换器 Black》传送门:GitHub链接

    推荐理由:这个转换器可以把你的 Python 代码转换成……另一个 Python 代码,当然了,是更好懂的那种。不管你的代码有多乱,只要它还是 Python 代码,转换之后看起来都一个样,该对齐对齐该换行换行,最起码变成了谁都能看懂的代码,如果你们有一个很多人的团队兴许可以试试这个,而且我觉得这玩意以后可能还有 C++ 和 JS 之类的传新版本会出……


    今日推荐英文原文:《Weekend Reading: All Things Bash》作者:Carlie Fairchild

    原文链接:https://www.linuxjournal.com/content/weekend-reading-all-things-bash

    推荐理由:关于 Bash 的相关文章,推荐给正在玩 Bash 的朋友

    Weekend Reading: All Things Bash

    Bash is a shell and command language. It is distributed widely as the default login shell for most Linux distributions. We’ve rounded up some of the most popular Bash-related articles for your weekend reading.

    Create Dynamic Wallpaper with a Bash Script

    By Patrick Wheelan

    Harnessthe power of bash and learn how to scrape websites for exciting new images every morning.

     

    Developing Console Applications with Bash

    By Andy Carlson

    Bring the power of the Linux command line into your application development process.

     

    Parsing an RSS News Feed with a Bash Script

    By Jim Hall

    I can automate an hourly job to retrieve a copy of an RSS feed, parse it, and save the news items to a local file that the website can incorporate. That reduces complexity on the website, with only a little extra work by parsing the RSS news feed with a Bash script.

     

    Hacking a Safe with Bash

    By Adam Kosmin

    Being a minimalist, I have little interest in dealing with GUI applications that slow down my work flow or application-specific solutions (such as browser password vaults) that are applicable only toward a subset of my sensitive data. Working with text files affords greater flexibility over how my data is structured and provides the ability to leverage standard tools I can expect to find most anywhere.

     

    Graph Any Data with Cacti!

    By Shawn Powers

    Cacti is not a new program. It’s been around for a long time, and in its own way, it’s a complicated beast itself. I finally really took the time to figure it out, however, and I realized that it’s not too difficult to use. The cool part is that Cacti makes RRDtool manipulation incredibly convenient. It did take me the better part of a day to understand Cacti fully, so hopefully this article will save you some time.

     

    Reading Web Comics via Bash Script

    By Jim Hall

    I follow several Web comics. I used to open my Web browser and check out each comic’s Web site. That method was fine when I read only a few Web comics, but it became a pain to stay current when I followed more than about ten comics. These days, I read around 20 Web comics. It takes a lot of time to open each Web site separately just to read a Web comic. I could bookmark the Web comics, but I figured there had to be a better way—a simpler way for me to read all of my Web comics at once.

     

    My Favorite bash Tips and Tricks

    By Prentice Bisbal

    Save a lot of typing with these handy bash features you won’t find in an old-fashioned UNIX shell.


    每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,欢迎关注开源日报。交流QQ群:202790710;微博:https://weibo.com/openingsource;电报群 https://t.me/OpeningSourceOrg

  • 2018年8月14日:开源日报第159期

    14 8 月, 2018

    每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,欢迎关注开源日报。交流QQ群:202790710;微博:https://weibo.com/openingsource;电报群 https://t.me/OpeningSourceOrg


    今日推荐开源项目:《React 输入控制组件 RIFM》传送门:GitHub链接

    推荐理由:这个 React 组件可以用来对文本框里的输入做各种各样的修改限制补强之类的,从最简单的限定输入数字,到加入小数点后两位,再到输入日期等等的各种输入,如果对输入有特别的要求而且又是使用 React 的话,这个组件就是一个最好的选择。

     


    今日推荐英文原文:《5 open source role-playing games for Linux》作者: Joshua Allen Holm

    原文链接:https://opensource.com/article/18/8/role-playing-games-linux

    推荐理由:Linux 上的开源角色扮演游戏,虽然用 Linux 玩游戏可能有点少见,不过也不算是做不到就对了,这里就有一些给 Linux 的游戏

    5 open source role-playing games for Linux

    Gaming has traditionally been one of Linux’s weak points. That has changed somewhat in recent years thanks to Steam, GOG, and other efforts to bring commercial games to multiple operating systems, but those games are often not open source. Sure, the games can be played on an open source operating system, but that is not good enough for an open source purist.

    More Great Content
    • Download Cheat Sheets
    • Find an Open Source Alternative
    • Read Top Linux Content
    • Learn Advanced Linux Commands

    So, can someone who only uses free and open source software find games that are polished enough to present a solid gaming experience without compromising their open source ideals? Absolutely. While open source games are unlikely ever to rival some of the AAA commercial games developed with massive budgets, there are plenty of open source games, in many genres, that are fun to play and can be installed from the repositories of most major Linux distributions. Even if a particular game is not packaged for a particular distribution, it is usually easy to download the game from the project’s website in order to install and play it.This article looks at role-playing games. I have already written about arcade-style games, board & card games, puzzle games, and racing & flying games. In the final article in this series, I plan to cover strategy and simulation games.

    Endless Sky

    Endless Sky is an open source clone of the Escape Velocity series from Ambrosia Software. Players captain a spaceship and travel between worlds delivering trade goods or passengers, taking on other missions along the way, or they can turn to piracy and steal from cargo ships. The game lets the player decide how they want to experience the game, and the extremely large map of solar systems is theirs to explore as they see fit. Endless Sky is one of those games that defies normal genre classifications, but this action, role-playing, space simulation, trading game is well worth checking out.

    To install Endless Sky, run the following command:

    On Fedora: dnf install endless-sky

    On Debian/Ubuntu: apt install endless-sky

    FreeDink

    FreeDink is the open source version of Dink Smallwood, an action role-playing game released by RTSoft in 1997. Dink Smallwood became freeware in 1999, and the source code was released in 2003. In 2008 the game’s data files, minus a few sound files, were also released under an open license. FreeDink replaces those sound files with alternatives to provide a complete game. Gameplay is similar to Nintendo’s The Legend of Zelda series. The player’s character, the eponymous Dink Smallwood, explores an over-world map filled with hidden items and caves as he moves from one quest to another. Due to its age, FreeDink is not going to stand up to modern commercial games, but it is still a fun game with an amusing story. The game can be expanded by using D-Mods, which are add-on modules that provide additional quests, but the D-Mods do vary greatly in complexity, quality, and age-appropriateness; the main game is suitable for teenagers, but some of the add-ons are for adult audiences.

    To install FreeDink, run the following command:

    On Fedora: dnf install freedink

    On Debian/Ubuntu: apt install freedink

    ManaPlus

    Technically not a game in itself, ManaPlus is a client for accessing various massive multi-player online role-playing games. The Mana World and Evol Online are the two of the open source games available, but other servers are out there. The games feature 2D sprite graphics reminiscent of Super Nintendo games. While none of the games supported by ManaPlus are as popular as some of the commercial alternatives, they do have interesting worlds and at least a few players are online most of the time. Players are unlikely to run into massive groups of other players, but there are usually enough people around to make the games MMORPGs, not single-player games that require a connection to a server. The Mana World and Evol Online developers have joined together for future development, but for now, The Mana World’s legacy server and Evol Online offer different experiences.

    To install ManaPlus, run the following command:

    On Fedora: dnf install manaplus

    On Debian/Ubuntu: apt install manaplus

    Minetest

    Explore and build in an open-ended world with Minetest, a clone of Minecraft. Just like the game it is based on, Minetest provides an open-ended world where players can explore and build whatever they wish. Minetest provides a wide variety of block types and tools, making it a good alternative to Minecraft for anyone wanting a more open alternative. Beyond what comes with the basic game, Minetest can be extended with add-on modules, which add even more options.

    To install Minetest, run the following command:

    On Fedora: dnf install minetest

    On Debian/Ubuntu: apt install minetest

    NetHack

    NetHack is a classic Roguelike role-playing game. Players explore a multi-level dungeon as one of several different character races, classes, and alignments. The object of the game is to retrieve the Amulet of Yendor. Players begin on the first level of the dungeon and try to work their way towards the bottom, with each level being randomly generated, which makes for a unique game experience each time. While this game features either ASCII graphics or basic tile graphics, the depth of game-play more than makes up for the primitive graphics. Players who want less primitive graphics might want to check out Vulture for NetHack, which offers better graphics along with sound effects and background music.

    To install NetHack, run the following command:

    On Fedora: dnf install nethack

    On Debian/Ubuntu: apt install nethack-x11 or apt install nethack-console

    Did I miss one of your favorite open source role-playing games? Share it in the comments below.


    每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,欢迎关注开源日报。交流QQ群:202790710;微博:https://weibo.com/openingsource;电报群 https://t.me/OpeningSourceOrg

  • 2018年8月13日:开源日报第158期

    13 8 月, 2018

    每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,欢迎关注开源日报。交流QQ群:202790710;微博:https://weibo.com/openingsource;电报群 https://t.me/OpeningSourceOrg


    今日推荐开源项目:《人脸识别 face-api.js》传送门:GitHub链接

    推荐理由:顾名思义,这是个基于 tensorflow 的人脸识别 JS 库,如果想要更多的了解它的话可以去看看我们的开源日报(传送门),里面收录了关于这个库的文章,如果想要自己动手玩的话也可以看看它们提供的例子,虽然支持把自己的照片放进去,但是要试验的话要放真实世界里的照片而不是纸片人老婆之类的这一点还请注意。

    示例:https://justadudewhohacks.github.io/face-api.js/face_detection


    今日推荐英文原文:《10 Things You Will Eventually Learn About JavaScript Projects》作者:The Cat with a Dragon Tattoo

    原文链接:https://blog.usejournal.com/10-things-you-will-eventually-learn-about-javascript-projects-efd7646b958a

    推荐理由:在 JS 的项目中你终将学到的十件事,兴许可以作为前车之鉴提醒一下自己

    10 Things You Will Eventually Learn About JavaScript Projects

    JavaScript is an adventure. After almost a decade of both amateur and professional development in various industries, I believe everybody would agree on this one statement.

    Frontend projects give us, programmers, a lot of freedom of choice, of flexibility, and plenty space for creativity — but in return they also demand a bit of knowledge, planning, and self-responsibility.

    Having gone through projects with jQuery, require.js, Angulars, React, ExtJs, and maybe a dozen others I may not recall (or wish to not recall) — I have seen things unimaginable to 2018’s Frontend. And we all probably did at some point!

    But there have always been some common patterns that made working on even the most uncoordinated projects somehow manageable. Below you will find a list of 10 most vital of those — driven by personal experience, these are surely opinionated, although I believe many of experienced developers may agree. These patterns provide a stable foundation for a project, of any framework, any methodology, any team size — reducing the need for documentation, refactoring, and amount of tears shed by the developers’ eyes.

    I hope you will learn something new today, find it helpful, and use those to create something amazing! ?

    1. Divide and conquer

    Most of us heard it somewhere, but many seem to underestimate this rule. CommonJS, Webpack, and Node give us an ability to separate code into multiple files —but why would we even care?

    Consistency. Dividing your project into single-export files will make searching and dependency management significantly easier when the codebase grows. Naming each file after the only thing it exports makes it intuitive and puts no strain on the brain when traversing the architecture.

    Management. Separating each export into its own file allows you to quickly move it when necessary, and promotes decoupling¹. When your helper function is needed in a different part of the application, you may simply create a /shared directory, and drag it there — making it accessible to other parts of your code.

    2. Make things embarrassingly obvious

    Every variable, every function, every file —take your time and name them as if you were naming your newborn. You may save 0.3 seconds today by calling that variable “x”, but in a month you will spend 2 days trying to figure out what it means, then 4 more on refactoring. Think ahead, and don’t be afraid of long names.

    Avoid hacks and things that make you think about applying to MIT straight away. Your solution may indeed be smart and complex —and sometime in the future you, or somebody in your team, will agree on that and then proceed to spend a huge chunk of time trying to figure out what is going on in the code. Focus on making things simple, without a need for documentation or comments².

    3. Resolve magic numbers and strings

    Similarly to naming — however tempting it may be, don’t use magic numbers or strings in your code. No matter how small or non-extraordinary the value seems, put it into a variable with a meaningful name and move it to the top of its scope.

    Most of the time, any explicit value you put into the code will be reused somewhere else. Putting them in variables right away reduces code duplication, makes adjustments easier, and gives these values a meaning.

    4. Fight nesting

    If your code goes beyond 120 characters to the right, beyond 500 lines downwards, or your if -statement goes 3 levels deep — do your best to divide it.

    You can resolve conditionals’ complexity by dividing code within deeply nested if-statements into separate functions, Promises, or Observables. If you use a lot of asynchronous calls, async/await can also significantly simplify your code.

    5. Configure hard

    If your application uses global values, API endpoints, feature toggles, or third-party credentials³ — put those in a separate config file.

    There is a bunch of packages that help manage configs both on web and in node, like config. At some point your application will be available both on the server and locally for development. Creating config file early is much easier than in later stages, and will allow you to adjust how these environments behave, which credentials they should use, which features are available, et cetera.

    6. Frameworks are there to help

    Too many times you can see framework used because someone knew it, or because it is popular.

    Take your time to think whether you need a framework for your project, and which one it should be. End-user couldn’t care less if your website or application is created in this one framework that has 100,000 stars on Github. From experience I would separate frameworks and libraries as:

    • React: when you need total control over the architecture and build, but only for webapps built with components. React-ecosystem development takes time and requires a lot of planning beforehand. React pays back plenty, but only if you know what you are doing.
    • Angular / VueJS / Ember: when you need a webapp done quick and reliable, in exchange allowing a big black-box instead of an architecture. These frameworks do a lot for you — taking away both pros and cons of architecture planning. Their strict structure will also forgive more mistakes than freedom of React would.
    • jQuery / lodash / or similar⁴: when you need a webpage done quick, and you can spare a few kB. Those can significantly reduce development time, but require care, since they allow you to write unmaintainable code — use those as helpers, not as a foundation.
    • Vanilla / No framework: for both webpages and webapps, when you can spend a lot of time on development and planning. Pure JavaScript is a good choice when your project does something experimental — introduces WebGL, Workers, in-depth optimisations, or browser animations — you will end up creating your own kind of framework. With transpilers it can also be used as a better and lighter alternative to jQuery.

    Treat this list only as a suggestion — take time to decide which framework, if any, will be the best for your project.

    7. Unless it is a prototype — write tests

    Unit tests. Smoke tests. End-to-end tests. Sanity checks. Unless you project is only a prototype that will be rewritten soon, write tests. With increasing complexity your codebase will become much harder to maintain and control — tests will do it for you.

    Sometime in the future, you will encounter a bug, look onto the cloudless, blue skies, and thank your past self for writing tests — as you would have never realised how many things quietly broke down in the background after you added your brand new feature.

    8. Use version control

    No matter if it is a prototype, full scale enterprise webapp, or a little, happy side project — use git, or other version control, from the very moment you write down the first line of code. Commit daily, use branches, learn how to merge, resolve conflicts, and return to previous commits. Give meaningful commit messages.

    Version control allows you to travel through time, save broken things, see changes introduced in the past. If there is one thing you take away from this article, it is learning basics of version control and using it on a daily basis. Why? Because even if you ignore the rest, and accidentally go wrong on the way, with version control you can fix it — without it, you are usually doomed to start over.

    9. Manage state responsibly

    Find a pattern or a library for state management, and hang on to it like your life depended on it — because at some point it just may.

    As Frontend developers, we usually face only two significant challenges — to display data and to store data. The later being far harder to maintain over time, as it is so convenient to ignore it — that is until your project becomes virtually unmaintainable several months later.

    Storing data, i.e. state management, is tricky. Our applications usually have to remain in-sync with both what the client sees on their screen, and what server has stored in the databases. Our goal is not to add any more complexity in the middle — in our JavaScript structure. Components should deliver same set of data, synchronise changes made by the user, and react to any changes on the server. How can we solve this issues?

    • Since its a very open ecosystem, for React there is plenty of solutions — Redux for Flux architecture, Mobx for an observable-based one. Each has its pros and cons — make sure to learn and understand the basics of the library before you use it.
    • Angular, Ember, and VueJS serve their own build-in state management solutions — based on the idea of Observables. While not necessary, there are also additional libraries⁵ like ngRx, Akita and Vuex.
    • For any other framework or Vanilla JavaScript, you can use Redux, Mobx, or your own state management solution. The main goal is to ensure that whole application has the same source of truth — this can be a service, a library, or a simple state Observable.

    10. Question trends

    In the very end, listen and learn from the community — but consider and question everything you read, every comment, every lengthy article on Medium written by a cat, any feedback to your code. Be open to new ideas — as those appear in our Frontend ecosystem quite dynamically — but make sure you are not following the hype just to follow the hype — this already led quite a few projects straight into oblivion.

    Project written in an older, mature framework is often far better and more stable, than a project written in two frameworks at once — just because new one came out. While new trends may improve application and development performance a bit, little it beats the consistency. Stick to your choices to preserve maintainability and adjust when necessary — but only when necessary.


    There we go —at this point I would like to thank you for reading, and would love to hear your opinions and stories below in the comments! As mentioned, these were only the very essential experiences I encountered during my rendezvous with JavaScript and Frontend development in its entirety — only a mere drop in an ocean of challenges we all encounter day by day!

    If you’d like to add something, endorse, or maybe point out where I was completely wrong — let me know in the comments section, or ping me on Twitter at @thefrontendcat!

    Once again, thank you,

    I hope you’ve found one or more of these valuable!

    TheFrontendCat @ InstantFrontend.io


    每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,欢迎关注开源日报。交流QQ群:202790710;微博:https://weibo.com/openingsource;电报群 https://t.me/OpeningSourceOrg

←上一页
1 … 219 220 221 222 223 … 262
下一页→

Proudly powered by WordPress